Marpa resources

Sun, 02 Dec 2012

Scannerless parsing

I've been working
on a "scannerless" Marpa interface.
"Scannerless" means that the user does not need to write
a separate lexer --
the lexer (scanner) is included in the parser.
One of my working examples is
the synopsis from
the main Marpa::R2 POD page,
rewritten to do its own lexing:

rules with a tilde ("~") to separate
LHS from RHS: these indicate rules whose
whitespace is to be left as-is

single-quoted strings, to tell Marpa which
character to look for; and

square-bracketed character classes, to
tell Marpa to look for a class of characters.
Their interpretation is done by Perl,
and therefore the allowed classes are exactly those
accepted by your version of Perl.

Valid strings in this language are "15329 + 42 * 290 * 711",
"42*3+7",
"3*3+4* 4",
along with all their whitespace variants.

My recent posts have been tutorial.
My work on scannerless parsing is
not quite ready for a tutorial presentation,
so this post will be conceptual.
It is about an interesting issue that arises in
scannerless parsing,
one which Perl 6 also had to solve,
and which Marpa solves in a new and different way.
That issue is whitespace.

Dealing with whitespace

For the statements with a declaration operator of
::=,
whitespace is handled automatically by Marpa.
Valid strings in the above language are
"42*3+7",
"42 * 3 + 7" and
"42 * 3+7",
all of which yield 133 as the answer.
The trick is to, on one hand, allow whitespace to be optional
and, on the other hand, recognize that strings like "42"
must be a single number.
That is, the parser should not recognize optional whitespace
between the two digits and decide that
"42",
is actually two numbers:
"4" and
"2".

The Perl 6 project has already taken on scannerless parsing.
My methods for dealing with whitespace are based on theirs.
Central to
their solution is "smart whitespace".
("Smart whitespace" is my term --
the
Perl 6 doc
is more matter-of-fact.)
Smart whitespace is whitespace which is optional, except between
word characters.
Stated another way, smart whitespace is either explicit whitespace,
or a word boundary.
In the case of "42",
"4" and
"2" are both word characters, so there is no
word boundary between them, and therefore no smart whitespace.

Implementing smart whitespace

Left parsers (like that which Perl 6 uses)
often know very little about the context of the parse.
But left parsers do know the current "character transition" --
what the previous character was,
and what the current character is.
In a left parser, finding word boundaries for the
purpose of detecting smart whitespace fits in
nicely with the way it works in general.

Marpa, of course,
also knows the previous and current characters.
It is certainly possible for
Marpa to check every transition for a word boundary.
But in Marpa's case, this check would
be an additional overhead, handling just one special case.
It'd be nice if we could look for word boundaries in a cool Marpa-ish way,
preferably one with efficiency advantages.

Out come the Ruby Slippers

"Ruby Slippers" parsing, as a reminder, is new with Marpa,
despite seeming a very obvious concept.
It amounts to adjusting the input to the parser based on what
the parser wants.
This can be seen as assuring the parser that whatever it wishes
for will happen, the same power that was conferred on Dorothy
in
Wizard of Oz
by a happy choice of footware.

To make the Ruby Slippers work in this case,
we make a word boundary a special kind of virtual token,
and we define smart whitespace to be one of two things:

A sequence of one or more characters of
real, physical whitespace.

A virtual word-boundary token.

We then proceed normally with the parse,
until there's a problem.
When the parser reports a problem,
we ask it if it is looking for one
of the virtual word boundary tokens.
If so, we give it one and continue.
Why does life have to be difficult?

Comments on this post
can be sent to the Marpa Google Group:
marpa-parser@googlegroups.com