Marpa resources

Wed, 03 Aug 2011

I have just released
Marpa::XS 0.008000.
With this release the core Marpa algorithm has been converted to C,
vastly speeding it up.
Marpa::XS is still alpha, but the additional development needed
at this point is a matter of packaging
(See Note 1).

It is my hope that Marpa
will become the standard parsing algorithm for problems too
big for regular expressions.

Marpa parses all
classes of grammar that are currently in practical use
in linear time.
(See Note 2).

Marpa is a general BNF parser -- that means if you feed it anything
written in BNF, it "just parses" it.
This includes grammars which are left-recursive, right-recursive and
ambiguous -- even infinitely ambiguous.

Marpa never goes exponential -- worst case, even for highly ambiguous
grammars, is O(n**3), which is considered optimal
(See Note 3).

Limitations and Comparisons

The foremost limitation of Marpa is, of course,
that it is alpha.
Development is well advanced, but the interface remains
subject to change.

There are several parsing tasks
where current technology will be superior
even to a stable, production Marpa.
For problems which work well as regexes (compact and with no backtracking),
Marpa will never replace a good regular expression engine.
On the other hand, Marpa may well be the answer to a lot of problems
which are forced to fit into the regular expression paradigm
for lack of anything better.
Hand-written recursive descent could compete with Marpa, but
not having to write your own parser is a huge advantage.

I've made no secret that I consider yacc and LALR theoretical milestones
and industry traditions which are best honored in the breach.
I've pointed out LALR's horrendous error-handling properties
elsewhere
(See Note 4).
LALR does offer some apparent speed advantages,
but in the context of practical applications these are
more apparent than real.

Notes

Note 1:
Inside the code are
many relics of the development process,
and these need
to be cleaned up.
The user documentation is finished,
but the internals documentation needs a lot
of work.

Some additional C code may be written to help in the
evaluation phase but, while this code could produce
some significant additional efficiencies, I don't consider it part of the
core Marpa development project for two reasons.
First, evaluation is to a large degree a matter of calling higher-level
and application code.

Second, the core Marpa algorithm was new
and extremely difficult mathematics --
built on top of a large body of prior work to be sure,
but new nonetheless.
With the evaluation code, my original thinking involves making
making choices of what to use in existing mathematics --
evaluation consists of problems
like tree traversal, where there is lots
of prior work to consult.

Note 2:
To be specific, Marpa parses any LR-regular grammar in linear time -- O(n).
LR-regular is a vast class of grammars that includes LALR,
LR(k) for all k,
and LL(k) for all k.
This means that Marpa parses,
in linear time,
every grammar parsed by yacc,
recursive descent and regular expressions.

Note 3:
The phrase "considered optimal" elides some
irrelevant bits of theory.
It would be a bit of a
surprise if it is possible to
do general BNF parsing
in O(n), but nobody has proved that it can't be done.
The Valiant algorithm parses general BNF
and is O(n**2.376...) or better.
But Valiant's algorithm is only faster for huge problems,
and for those it
needs a machine with many terabytes of main memory
to deliver on its speed claims.
So it won't be competing with Marpa any time soon.