Mon, 28 Jun 2010

The Garden Path

syntax error at (eval 25) line 1, near ";4"
syntax error at (eval 25) line 1, near "}}

TPP is based on an algorithm called LALR.
(To avoid confusion when discussing various programs which parse Perl,
I will call the one that comes with the Perl distribution
the Tradition Perl Parser, or TPP.)
If you look at
perly.y,
you will find a real achievement
in LALR parsing.
and therefore in computer language parsing in general.
But LALR has its limits and even in the TPP
these show up every now and then.

Let's call our first example the Garden Path Code Block.
What is happening here?
Perl is being taken
"down the garden path" between the inner set of
curly braces.
TPP starts out down the path and, based on what it sees,
decides that it is parsing an anonymous hash.
TPP commits to that choice and then trainwrecks on the semicolon.
PPI
does no better.
PPI recognizes the Garden Path Code Block, but it parses it
as an anonymous hash, which is wrong.

I am prototyping a new Perl syntax analyzer using
Marpa,
my new parser.
Marpa parses anything you can write in BNF, using ideas based on those
by Jay Earley.
Marpa::Perl is still a prototype,
but it has no difficulty in
recognizing the Garden Path Code Block as a code block.

LALR's problem is that it is based on a state machine.
LALR's state machine ran at speeds and used memory
in keeping with the computer technology of 1969, which is
when LALR was invented.
But the LALR state machine needs to make decisions based
on limited context.
When the LALR parser in
perly.y
sees that semicolon, it is pretty much game over.

LALR has always been
a tight squeeze
for the grammars of practical programming languages.
This is despite the fact that LALR has had decades to
condition
language designers to accept its idea of what
"practical" means in a programming language.

For Marpa, the Garden Path Code Block is not
even a serious challenge.
Marpa is based on Jay Earley's ideas.
These were the subject of
my last post.
Briefly, the core idea is to create a table of "Earley items".
Earley items are short records indicating parsing progress,
keyed to locations in the input.
For an Earley-based parser, a "garden path" is dealt with by having
Earley items record the multiple possibilities.
This need to deal with alternative possibilities
is not dissimilar to the one
which motivates other parsers to use more expensive strategies,
such as backtracking or parallel processing.

What the Man Page Says

Suppose we go all the way down the "garden path",
and it still is not clear whether the curly braces
enclose an anonymous hash or a code block?
If we alter our example to replace the deadly semicolon
with an insidious comma, this is what it looks like:

{42;{1,2,3,4}}

And for Perl 5.10, this is the result:

{'1' => 2,'3' => 4}

PPI seems to agree with Perl 5.10.

Marpa::Perl, on the other hand, sees that this code is ambiguous.
It recognizes two parses.
The first parse is the one Perl and PPI see: an anonymous hash.
Marpa::Perl's second parse treats both sets of curlies as surrounding
code blocks.
In scalar context, the value which results from this second parse
is the number 4.

In cases of desperation,
we look at the documentation.
It takes some searching to find it,
but there is
a very relevant discussion in
a section of the perlref man page.
It is worth reading.
Of its several examples,
this is the most interesting:

sub showem { { @_ } } # ambiguous (currently ok, but may change)

Here Perl's rules of thumb result in the "{ @_ }"
being parsed as a code block --
the opposite of the behavior in our previous example.

The threat made in the comment --
that TPP's current behavior may change --
would be a hard thing to make good on.
Assuming array context,
the effect of having the
perlref
example flip semantics
would be to change
the return value from an
array into a hash ref.
That kind of silent change is not the kind
of thing I like to see in production code.

Most Perl programmers are probably not aware of this gotcha.
Even well-informed programmers
can easily write ambiguous code by accident,
and will receive no warning of their mistake.
Perhaps a good application for Marpa::Perl
would be spotting ambiguities like these.

How to Disambiguate

Once spotted, ambiguous curlies are easy to fix.
A standard way to disambiguate
an anonymous hash
is to prefix the
curly braces
with a plus sign:

{42;+{1,2,3,4}}

TPP, PPI and Marpa::Perl have no trouble
recognizing that the inner curly brackets enclose an anonymous hash in
the above.

A programmer can disambiguate code blocks by
placing a semicolon as the first thing inside the curly brackets.

{42;{;1,2,3;4}}

The inner curlies in the above are recognized
as defining a code block by all three parsers:
TPP, PPI and Marpa::Perl.