Marpa resources

Sun, 18 Nov 2012

Developing a parser iteratively

This post describes a manageable way
to write a complex parser,
a little bit at a time, testing as you go.
This tutorial will "iterate" a parser
through one development step.
As the first iteration step,
we will use the example parser from
the previous tutorial in this series,
which parsed a Perl subset.

You may recall that the topic of that previous tutorial was pattern search.
Pattern search and iterative parser development are
essentially the same thing,
and the same approach can be used for both.
Each development stage of our Perl parser will do a pattern search
for the Perl subset it parses.
We can use the accuracy of this pattern search
to check our progress.
The subset we are attempting to parse is our "search target".
When our "searches" succeed in finding all instances
of the target,
we have successfully written a parser for that subset,
and can move on to the next step of the iteration.

What we need to do

This tutorial is the latest of
a series,
each of which describes one self-contained example of a Marpa-based parser.
In this tutorial we use the example from
the previous tutorial
as the first iteration step
in the iterative development of a Perl parser.
For the iteration step in this example, we will add two features.

The previous iteration step was more of a recognizer than a parser.
In particular, its grammar was too simplified to support a semantics,
even for the Perl subset it recognized.
We will fix that.

Having amplified the grammar, we will add a semantics,
simple, but quite powerful enough to use in checking our progress
in developing the parser.

The format is documented here.
These eight lines were enough to descibe arithmetic expressions sufficiently well
for a recognizer, as well as to provide the "scaffolding" for the unanchored search.
Nice compression, but now that we are talking about supporting a Perl semantics,
we will need more.

Adding the appropriate grammar is a matter of turning to the
appropriate section of the
perlop
man page
and copying it.
I needed to change the format and name the operators,
but the process was pretty much rote, as you can see:

The lexer

The lexer is table-driven.
I've used this same approach to lexing in every post
in this tutorial series.
Those interested in
an explanation of how the lexer works can
find one in the first tutorial.
Having broken out the operators, I had to rewrite
the lexing table,
but that was even more rote than rewriting
the grammar.
I won't repeat the
lexer table here --
it can be found in
the Github gist.

Adding the semantics

Our semantics will create a syntax tree.
Here is that logic.
(Note that the first argument to these semantic closures
is a per-parse "object",
which we don't use here.)

There is some special logic in the
do_target()
method,
involving the "origin", or starting location of the target.
Perl arithmetic expressions,
when they are the target of an unanchored search,
are ambiguous.
For example, in the string "abc 1 + 2 + 3 xyz",
there are two targets ending at the same position:
"2 + 3" and "1 + 2 + 3".
We are interested only in longest of these,
whose start location is indicated by the
$ORIGIN
variable.

The next logic will be familiar from our
pattern search tutorial.
It repeatedly looks for non-overlapping occurrences of
target,
starting from the end and going back to the beginning of the input.

This final code sample is the logic
that unites pattern search with incremental
parsing.
It is a loop through
@results
that prints the original text
and, depending on a flag,
its syntax tree.

Near the top of the loop,
the "$recce->set( { end => $end } )"
call sets the end of parse location to the current
result.
At the bottom of the loop,
we call
"$recce->reset_evaluation()".
This is necessary to allow us to evaluate the
input stream again, but with a new
$end
location.

The
VALUE
sub-loop is
where the
$ORIGIN
variable
was set.
In the semantics,
do_target()
checks this.
In the case of an ambiguous parse,
do_target()
turns any target which does not
cover the full span from
$origin
to
$end
into a Perl
undef,
which will
eventually become
the value of its parse.
The logic in the
VALUE
loop
ignores parses whose value is a Perl undef,
so that only the longest target for each
$end
location is printed.