Monday, September 7, 2009

IntroductionI have a wiki markup text that I need to parse. My first version, which I use in DroidWiki application for Android, is a wiki custom parsing. I wrote my own parsing, because I haven't found a parsing code on the internet that was light enough to be used on Android phone. That code does a regular expression matching for each wiki tag. So, every line is matched at least Recently, I decided to try my hand at more fancy parsing: just parse into a sequence of tokens. Below, are the results of my research on this topic.

Using One-Character-At-A-Time ParsingOne way to solve the problem is to use a one-pass parsing, having the Java code to look at each character (just once) and isolate the tokens this way. Using the character iterator goes like this:

StringCharacterIterator iter = new StringCharacterIterator(markup);

for( char c = iter.first(); c != CharacterIterator.DONE; c = iter.next() ) { // process the char: is it one of the characters that starts any of the syntax tokens?}

This could be fast, but the code would be complicated. I decided for a different approach.Using Interpreter Design PatternThe idea:

Parse the text and convert it to a sequence of basic tokens:

every continuous piece of a regular text is a token

every sequence of syntax is a token (for example, the char less-than if parsing HTML text would be a simgle token by itself)

Process the list of these basic tokens and aggregate them into complete syntactical elements (for example, every complete HTML tag would be a single token).

If you want, create

a SimpleToken class for item 1 above,

a base Token class and specialize it for more significant/complex syntactical elements for item 2 above