Marpa resources

Mon, 07 Nov 2011

In this post, I pit Marpa against the Perl regex engine.
The example I will use is unanchored searching for balanced parentheses.
I have claimed that many problems now tackled with regexes are better
solved with a more powerful parser, like Marpa.
I believe the numbers in this post back up that claim.

To be very clear,
I am NOT claiming that Marpa should or can replace
regexes in general.
For each character,
all an RE
(regular expression) engine needs to do
is to compute a transition from
one "state" to another state based on that character --
essentially a simple lookup.
It's the sort of thing a modern C compiler should optimize
into a series of machine instructions that
you can count
on the fingers of one hand.

Marpa is much more powerful than an regular expression engine,
and to deliver this power
Marpa makes a list of all the possibilities
for each
token.
Tricks are used to compress these per-token lists,
and Marpa's code to process them is heavily
optimized.
But even so,
Marpa's processing requires more than a handful of
machine instructions.

In this context, I think some of the numbers below
may surprise people.
RE's beat everything else as long as they stick
to their game.
But these days they are often stretched beyond their limits,
often without an appreciation of
how quickly their efficiency deteriorates when
outside those limits.

Unanchored searches for balanced parens

I chose the problem solved
by
Regexp::Common::balanced --
unanchored searches for balanced parens.
"Unanchored" here means that the search
is not anchored to the beginning,
or any other specific point of the string.
Unanchored searches need,
not only to identify the matching string,
but to determine where in the searched string to find
the target.

When it comes to unanchored matches,
most users want the "first longest" match.
That is, they want the first match but,
when one match is completely contained in another one,
they want the longest match.
This is the problem in its hardest form.
It is simple to find
the match which ends first.
"First longest" needs the match which STARTS first.
"First longest" is the problem that Regexp::Common::balanced addresses.
For this benchmark,
I programmed Marpa to return exactly the same results
that Regexp::Common::balanced returns.

The examples I will use in this post
are sets of parens of varying length.
All the examples will have a prefix, a balanced paren "target", and
a short, unbalanced, trailer.
If the string is of length N,
the prefix consists of N-8 left parens.
The target is always this string: "(()())".
The trailer is always two left parens: "((".
Here, with spacing added for clarity, is the test string for length 20:

(((((((((( ((()()) ((

The Results

Numberof parensin teststring

Executions per second

libmarpa(pure C)

Marpa::XS(mixed Cand Perl)

Regexp::Common::balanced

tchrist'sregex

Marpa::PP(PurePerl)

10

4524.89

111.71

3173.30

33429.33

47.39

100

1180.64

58.96

62.09

197.25

15.35

500

252.40

19.50

2.43

7.58

4.09

1000

117.16

10.28

0.53

1.84

2.14

2000

56.07

5.47

0.12

0.34

1.08

3000

36.35

3.72

0.05

0.13

0.74

The above results are in executions per second -- a higher number means a faster algorithm.
These numbers are what happens when regexes are pushed beyond their limits.
Regex::Balanced::common goes
quadratic
on these examples,
while all versions of Marpa stay linear.
(Here linear and quadratic refer to speed.
The results above are reported in executions per second,
and you need to take the reciprocal to get the speed.)

libmarpa, the pure C version of Marpa, is faster than Regexp::Common::balanced
on even the shortest example I tested.
Marpa::XS, the XS (mixed C and Perl) version, catches up with it when the
length of the test strings gets a little past 100 characters.
You would expect that Marpa::PP, which is implemented entirely in Perl,
would not have a chance against the Perl regex engine,
which is implemented in C.
But somewhere in the low 100's Marpa::PP also catches up and by the time
we are testing strings of 500 characters, Marpa::PP is running twice as fast.

As the length of the test strings increases,
Marpa's relative advantage grows.
At 3000, Marpa::XS is over 74 times as fast as Regexp::Common::balanced,
and libmarpa is well over 700 times as fast.
Even Marpa::PP is nearly 15 times as fast,
and steadily improving its advantage.

[ In the comments, Tom Christiansen shares a Perl regex which is faster than Regexp::Common::balanced. I've included results for it in the table above. For discussion of it, see the comments. ]

And I hardly even cheated!

In this comparison, I tried to be "fair".
"Fair" can be hard to define in a benchmark.

The Choice of Tests

The test strings,
and the problem (unanchored searching for balanced parens)
were chosen to highlight the limits of Perl regexes.
On the other hand,
this problem is typical of the things programmers want to do,
as well as of the sort of thing they ask Perl regexes to to.

Official versus Developer's Versions

For Marpa::XS and Marpa::PP, I insisted on using official distributions out on CPAN.
My benchmarking script is
available as a gist on github.
These benchmarks were done using only the DOCUMENTED interfaces of
Marpa::XS 0.020000
and
Marpa::PP 0.010000.
Both these versions had several useful capabilities which are not in the documented
interface,
and, especially with the shorter test strings,
both Marpa::XS and Marpa::PP pay a real price for
my decision not to use them.
But I wanted to demonstrate the kinds of speeds that real users can get,
using what is actually on CPAN now, as it is documented TODAY.

The libmarpa numbers, on the other hand,
are for a developer's version.
The libmarpa library has never had a separate release,
and is not yet documented.
A stable libmarpa is part of Marpa::XS
and the latest code is in Marpa::R2.
The version used for these tests is the one
in
Marpa::R2
and
my benchmarking code
is in
the Marpa::R2 github repository.

Precompilation

Parsers, like Marpa and yacc, are designed for repeated use.
Regular expression engines, on the other hand, are often used in "one-shot"
applications.
When the application is viewed as a regex, it seems fair to include any precompilation
in the benchmark times.
If the application is thought of as parsing, it seems fair to allow the algorithms to see
the grammar first, without the clock running, and optimize the heck out of it.
Which is fair here?
The choice would make a real difference.
Marpa does a lot of precomputation, more than any other parsing algorithm I know of.

I decided that this test would be about pitting Marpa versus regexes,
on the regex's own turf and using their rules.
In all the tests above, for every repetition, Marpa was forced to redo
all its precomputations, while the clock was running.
For shorter tests in particular,
this put Marpa at a real disadvantage.
But it makes the results clearly relevant to the use of Marpa inside a regex engine.

Reading the input string

Both libmarpa and Regexp::Common::balanced have a big advantage over
Marpa::XS and Marpa::PP.
Marpa::XS and Marpa::PP have to use Perl to convert the input string
into their internal format.
For Perl's regex engine and libmarpa, this is done in C.
I decided to require both algorithms to do their
string manipulation "with the clock running",
very much to the disadvantage of Marpa::XS
and Marpa::PP.

This disadvantage could be called unfair,
since the choice of language for
string manipulation is about interface and convenience,
and does not really have anything to do with the speed of the
underlying algorithms.
But I felt that,
when string conversion times were included,
run times were more realistic --
more like what would be encountered
in the actual applications that regexes are asked to deal with.
This handicap makes Marpa::XS's performance
all the more impressive.

A faster regex engine

I also want to suggest the possibility of using libmarpa to extend the Perl
regex engine.
It is possible to efficiently identify,
for any regex, the presence
of features that are problematic for an RE-based recognition engine.
A regex implementation could check for these and select a recognition
engine accordingly -- the tradition RE-based engine for simpler regexes or,
when it seems beneficial,
a Marpa-based recognition engine.

Notes

"regular expression":
In the pure mathematical sense,
a regular expression (RE) is a state machine
that recognizes only patterns that use
concatenation (ab),
repetition (a*),
alternation a|b),
or some combination of these.
Perl regexes are sometimes regular expressions
or their equivalent.
More often they are not.
For example, any Perl regex which captures substrings is
not equivalent to a regular expression.

"token":
In the example in this post, "token" can be taken as a synonym for character.
RE's typically work with the individual characters
of character strings.
In this post, so does Marpa.
Typically, general purpose parsers like Marpa let a lower level
gather one or more individual characters into "tokens".

"paren":
Saying "parenthesis" gets tedious.
I often abbreviate it to "paren".

"quadratic":
By quadratic, I mean O(n^2) where n
is the length of the test string.
That Regexp::Common::balanced goes quadratic is my conjecture --
I have no proof.
But many of the simpler
approaches to unanchored search are quadratic,
and the benchmark numbers suggest this is the case here.