Marpa resources

Fri, 13 May 2011

This post in one of
a series
inspired by
a web discussion
about what
it would take to replace yacc
and its cousins as the industry standard
for parser generators.
My interest in the question grows out of
Marpa,
my project to take the many improvements made
over the decades to Earley's algorithm,
and turn them into a practical parser generator.

Requirement 6: Available as a Library

To be an adequate yacc replacement,
an algorithm should be available as a library.
In the Perl context,
this means available as a module.
I interpret this requirement loosely
enough to regard the traditional implementations
of LALR parsers,
such as the yacc parser generator,
as meeting this test.
True, they usually
generate the code for a custom parser,
which then must be
compiled as a separate step.
But an LALR parser generator could be implemented
as a library.

The reason that
yacc and its relatives
are traditionally not implemented
as libraries is revealing:
the chances of a practical grammar
"just working" with
yacc
are pretty much zero.
In practice you almost never go from writing a yacc
grammar
directly to a successful execution.
In context, therefore,
having to invoke
the C compiler as a separate stage is
only a small additional effort.
And it does open up a lot of extra options,
such as the
use of alternative compilers, linkers
and other tools;
the full range of options to
these tools;
the opportunity to custom modify the C code; etc., etc.

Marpa fulfills Requirement 6.
It is implemented as a Perl
module, and its grammar pre-processing and parsing has been translated
into a C library (libmarpa), in which form it runs much faster.

The major parsing technique that clearly fails this requirement
is hand-written recursive descent.
Hand-written recursive descent parsers are,
by definition, not available in
a library or any other ready-made form.
A hand-written recursive descent parser is straightforward
to write,
so much so that writing one these days is often preferred
to struggling with yacc.
Nonetheless code it yourself is what you must do.

A Requirement I Dropped

In this list of requirements, I included the requirements
in
Russ Cox's post
and all but one from
the Might-Darais paper.
Those I included I sometimes strengthened,
and I added two more to the list,
dealing with error detection.
Error detection is important to include --
I attribute
the present turn against the LALR-based family of
parser generators (which includes yacc)
mainly to LALR's grotesquely bad error detection properties.

What was the requirement I dropped?
Might & Darais suggest that the ideal algorithm
needs to fit on
a single page --
in effect that a high-quality general parser
must be something you could whiteboard for your
Google interview.
To be sure,
this would be a useful and charming property in any algorithm.
Unsurprisingly, the Might-Darais algorithm has it.

I say "unsurprisingly", because when authors of parsers list
requirements for the perfect parser, they have a tendency to
shoot the arrow first, then set up the target near where
the arrow landed.
I could be accused of this.

Picasso and Angkor Wat

In this later years, Picasso could create a world in a few brush strokes.
While writing Marpa, there were moments when
a problem which appeared complex
dissolved into a simple truth.
The nicest of these was when
the largely incompatible algorithms
of Joop Leo and Aycock-Horspool turned out to
be a perfect fit at all the points where it mattered.
Even so, there were not enough of these moments
to reverse the trend.
Nobody will ever call the Marpa algorithm simple.

But disappointment is what happens when you search
for the wrong thing.
The straightness of the Parthenon's lines is an illusion.
Simple straight lines do not look straight.
The builders of the Parthenon wanted every line to appear
perfectly straight to the human eye, and as a result
there is
not a prefectly straight line
anywhere in the temple.

In South Asia,
the great temple complexes glory in their complexity.
And in its Frieze,
the Parthenon did not even pretend to be simple.
Their creators knew that
beauty is beyond pattern.

Note 3:
Marpa is available on CPAN in
Pure Perl
and
XS
form.
Marpa will parse anything you can write in BNF.
If a grammar is one of the kinds in practical use,
(such as yacc's LALR or recursive descent's LL(1)),
you can expect Marpa to parse it in linear time -- O(n).
To be specific, Marpa parses any
LR-regular grammar in O(n).

Note 4: A yacc
user must also create her own lexer.
This was considered acceptable since there were tools
specifically for creating yacc-compatible lexers,
and since, in exchange for the trouble of writing a yacc
parser, she expected a extremely efficient parser,
one which often justified a hand-written lexer.

Note 5:
I did not include every improvement to Earley's proposed
in the literature.
In particular, Marpa does not use lookahead.
Prior to Marpa,
Grune and Jacobs had already concluded
that the tradeoffs for lookahead in
Earley's parsers were dubious.

No amount of lookahead
will make a parser faster
than linear -- O(n).
For the LR(k) grammars,
that is, LR grammars with a lookahead of k,
Marpa is already O(n) for all finite values of k.
The same is true of the LL(k) grammars for all
finite values of k.
In fact,
Marpa is O(n) for
LR grammars with infinite lookahead,
as long as the lookahead is a regular expression.
The overhead from adding
explicit lookahead to Marpa would be very real.
The benefit would be hard to find.