Sunday, July 26, 2009

On syntax highlighting and artificial intelligence

So on Rosetta Code, we use GeSHi for syntax highlighting. The relationship between Rosetta Code, GeSHi, a programming language and the code written in that language is fairly simple. (The exact order of events inside GeSHi might be slightly different; I haven't delved deeply into its core)



Rosetta Code (by way of a MediaWiki parser extension) gives GeSHi a few pointers about how it wants the code formatted, the language the code sample will be in, and, finally, the code sample itself.



GeSHi takes the code example, and loads the language file named after the language in question. Each language file defines a PHP associative array that contains(among a couple other things) simple rules for how GeSHi can apply formatting to the code in a way that will clarify it to the viewer. These rules include lists of known keywords of various classifications, symbols used for normal commenting conventions and optional regex matching rules for each, among other things.



It's a perfectly reasonable, fairly static approach that allows syntax highlighting to cover a broad variety of languages without knowing how to parse that language's actual syntax, and so avoiding having a syntax error break the whole process.



Unfortunately, it requires Rosetta Code to be able to tell GeSHi what language a code sample is written in. It also leads to odd scenarios where a supported language and an unsupported language are so closely related that examples written for the unsupported language can be comfortably highlighted using the the rules for the supported language.



And I have yet to learn of a good way to do syntax highlighting for Forth. (The Forth developers appear to pretty much keep to their own community, and don't seem to do much in the way of outreach, which makes finding a solution relatively difficult, but I digress...)



So what does this have to do with artificial intelligence? Well, in identifying a language without being told what it is, of course!



A few solutions have been discussed. One approach that has been attempted had something to do with Markov Chains. The code is in the GeSHi repos, and I haven't looked at it.



One solution I suggested was to run the code example through all the supported languages (Yes, I know, that's expensive. Not something to be done in real time.), and select the ruleset based on how many rules(X) were matched for a language and how much of the code sample was identified(Y). Using a simple heuristic of (a*X)/(b*Y), you can account for a number of matched rules while hopefully accounting for an overly-greedy regex rule.



How can we take this a step farther? How about formatting languages we don't know about?



Well, many, many languages have rules in common. Common keywords, common code block identifiers, common symbols for comments, common symbols for quotation, etc. This tends to result from their being derived or inspired in some way by another language. For the sake of avoiding pedantry, I'll just say that C, C++, Perl, Python, PHP, Pascal and Java all have a few common ancestors.



One way would be to note the best N language matches, take an intersection of their common rules, and apply that intersection as its own ruleset. This would certainly work for many of the variants of BASIC out there, as well as specialized variants of common languages like C and low-level ISAs.

No comments:

Post a Comment