Revision as of 15:40, 19 January 2010

This tutorial will present how to parse a subset of a simple imperative
programming language called WHILE (introduced in a book
"Principles of Program Analysis" by Nielson, Nielson and Hankin). It includes
only a few statements and basic boolean/arithmetic expressions, which makes it
a nice material for a tutorial.

This creates a language definition that accepts the C-style comments, requires
that the identifiers start with a letter, and end with alphanumeric
characters. Moreover there is a number of reserved names, that cannot be used
by the identifiers.

Having the above definition we can create a lexer:

> lexer = Token.makeTokenParser languageDef

lexer contains a number of lexical parsers, that we can us to parse
identifiers, reserved words/operations, etc. Now we can select/extract them in
the following way:

This isn't really necessary, but should make the code much more readable (also
this is the reason why we used the qualified import of

Text.ParserCombinators.Parsec.Token

). Now we can use them to

parse the source code at the token level. One of the nice features of these
parsers is that they take care of all whitespace after the tokens.

5 Main parser

As already mentioned a program in this language is simply a statement, so the
main parser should basically only parse a statement. But remember to take care of
initial whitespace - our parsers only get rid of whitespace after the tokens!

> whileParser :: Parser Stmt
> whileParser = whiteSpace >> statement

Now because any statement might be actually a sequence of statements separated

by semicolon, we use

sepBy1

to parse at least one statement. The

result is a list of statements. We also allow grouping statements by the
parenthesis, which is useful, for instance, in the while loop.

If you have a parser that might fail after consuming some input, and you still

want to try the next parser, you should look into

try

combinator.
For instance

try p <|> q

will try parsing with

p

and
if it fails, even after consuming the input, the

q

parser will be
used as if nothing has been consumed by

p

.

Now let's define the parsers for all the possible statements. This is quite
straightforward as we just use the parsers from the lexer and then use all the
necessary information to create appropriate data structures.

In case of Prefix operators it is enough to specify which one should be parsed
and what is the associated data constructor. Infix operators are defined
similarly, but it's necessary to add information about associativity. Note
that the operator precedence depends only on the order of the elements in the
list.

Finally we have to define the terms. In case of arithmetic expressions, it is
quite simple: