Marpa resources

Sun, 11 Nov 2012

Pattern searches

We use regular expressions for pattern searching these days.
But what if your search target is not a regular expression?
In this post I will show how to use Marpa to search text files for
arbitrary context-free expressions.

This tutorial builds on earlier tutorials.
It is possible to simply dive into it,
but it may be easier
to start with two of my earlier posts,
here
and
here.

The grammar

I will use arithmetic expressions as
the example of a search target.
Even the arithmetic subset of Perl expressions is quite complex,
but in this case we can get the job done
with eight lines of grammar and a lexer driven
by a table of just over a dozen lines.
Here is the grammar:

This grammar uses
Marpa::R2's BNF interface.
It takes considerable advantage of the fact that we are not
parsing
these expressions, but
recognizing
them.
Because of this, we don't have to specify whether expressions left- or right-associate.
We can also ignore what operators mean and group them according to syntax only
-- binary, prefix unary and postfix unary.
Similarly, we can ignore the precedence within these large groups.
This leaves us with numbers, scalars,
parentheses,
and binary, prefix unary and postfix unary operators.
(To keep this example simple, we restrict the primaries
to numeric constants and Perl scalars.)

What we are searching for is defined by the
target
symbol.
For
target
you could substitute
the start symbol of
any context-free grammar,
and the structure of this example will still work.
To turn a parser for
target
into a pattern searcher, we add a new start
symbol (unimaginatively named "start")
and two rules that
allow the target to have a
prefix.

Ambiguous parsing

To do an anchorless pattern search,
this example will use ambiguous parsing.
This grammar always has at least one parse going,
representing the prefix for
the zero or more targets
that our parser
expects to find in the future.
The prefix will never end, because
any token (as indicated by a token
named, literally,
any_token)
extends it.

If we are in the process of recognizing a
target,
we will have one or more other parses going.
I say "one or more" because the search method
described in this post
allows target to be ambiguous.
But arithmetic expressions,
the target pattern used in this example,
are not ambiguous.
So our example will have
at most two parses active at any point:
one for the prefix and another for the target.

Ambiguous parsing has a serious potential downside --
it is not necessarily linear
and therefore not necessarily efficient.
But Marpa can parse many classes of ambiguous grammar in linear time.
Grammars like the one in this post --
a prefix and an unambiguous search target --
fall into one of the linearly parseable classes.
Keeping the prefix going requires a tiny constant overhead per token.

Order is significant here.
In particular
two-character operators are checked for first.
This guarantees that
two consecutive minus signs
will be seen as an
decrement operator, and not as a double negation.

Ambiguous lexing

The very careful reader may have noticed that
any_token
is not in the lexing table.
The main loop is written so that every token is read as an
any_token.
If no token from the lexing table is accepted,
the next character in the input stream
is read as an
any_token.
If a token from the lexing table
is
accepted,
then it gets read twice,
once as an
any_token,
and once as the token type taken from the lexing table
entry.

Ambiguous lexing is a familiar technique to
the Natural Language Processing community.
Engish, in particular, is a language that abounds
in lexemes that can play multiple roles.
The word "sort", for example, can easily be
an noun, a verb or an adjective.

The Ruby Slippers

The main loop will also be a simple case of the use
of the Ruby Slippers.
For those unfamiliar,
the "Ruby Slippers" parsing technique handles difficult lexing
and parsing problems by asking the parser, at the problem point,
what it is looking for,
and providing it.
This seems a fairly obvious approach,
but the Ruby Slippers are new with Marpa --
traditional parsers could not easily
determine where they were in a parse.

One way to use the Ruby Slippers is to ask the parser in
advance what it is looking for.
The code that follows uses another method.
Instead of determining in advance what tokens to read,
it simply feeds tokens to the parser.

Token rejection is a "soft" error -- it costs
little to try, and little to retry.
The following code can
efficiently determine which entry in the lexing table is appropriate,
simply by trying each of them in order.
If the
alternative()
method returns a Perl
undef,
indicating that a token was rejected,
then the main loop will try later entries in the lexing table.

When a token is accepted,
the main loop can safely assume that it is on the right track.
Marpa is 100% accurate about
which tokens can and cannot result in a successful parse.

The main loop

The main loop iterates through input looking for tokens.
Whitespace is skipped.
Comments are not skipped.
Finding arithmetic expressions in
strings and/or comments can be useful.
We will assume that is the case here.

The
earleme_complete()
method tells Marpa that all the alternatives
at one location have been entered,
and that the parse should now move on to the next location.
(Marpa's idea of location is called an "earleme", in honor of the great
parsing theorist, Jay Earley.)

How to parse without really trying

At this point, I want to draw the reader's attention to the code
that deals with special cases for the minus sign.
Specifically, to the fact that there is no such code.
The more familiar you are with PPI and/or
perly.y,
the more remarkable this will seem.

To take one example, PPI correctly realizes that the minus
sign in
"1+2-3" is a binary operator.
However PPI fails on "(1+2)-3" --
it thinks the minus sign is part of the number "-3".
Why don't the authors of PPI just look at the Perl
interpreter and copy the logic there?
Take a glance at perly.y
and toke.c
and you will know the answer to that question.

What is PPI's problem here?
The problem is that,
without knowing where you are in the expression,
you cannot tell whether a minus sign is a unary
operator or a binary operator.
And the parse engines for PPI and for Perl itself,
while quite different in many respects,
share a property common to traditional parsers --
in determining context
they offer the lexer, respectively,
little and no help.

In the code in this example,
Marpa's alternative() method is, by accepting
and rejecting tokens, guiding the lexer to the right choice.
Because of Perl's grammar, a minus sign at a given position
cannot be both a unary operator and a binary operator.
And Marpa is 100% accurate in its knowledge of which
tokens are possible.
So Marpa's
alternative() method
always knows whether a minus sign can be
a unary or binary operator and accepts
or rejects the token accordingly.

This is the Ruby Slippers in action --
a very simple solution to what for the Perl
interpreter and PPI
is a very complicated problem.
When I developed the Ruby Slippers technique,
my most serious problem
was convincing myself that something
so simple could really work.

Finding the targets

Once the parse is complete, it remains to find
and print the "targets" found
by the search.
In
a previous post,
I showed how,
given a symbol name,
to find the last occurrence of the symbol in a Marpa parse.
That routine needed to be modified to allow repeated searches,
but the change was straightforward.
The code is in the
gist,
and the ideas behind it were explained
in
the previous post,
so I won't repeat them here.