Introducing the problem

This technique will work for
languages other than Perl.
In fact, this technique could be used to look for several different
target languages
at once,
for example, when searching a database
of emails that might include code segments
in Perl, HTML, C, etc.

For clarity, I will often refer to the
Perl and C programs
being sought as targets
in text as "code segments".
In the context of this blog post,
a string is only considered to be a valid
"code segment" if it is also a valid
and complete program.

Perl all by itself presents just about every parsing difficulty there is,
and it will work fine for this discussion.
We will search for the "longest first match", which is harder
than most alternatives,
but which is also what applications usually want.
For example,
take the following:

abcdef ++$a; ++$b; ghijkl

Here the null string is the first valid Perl code segment,
and to be pedantic about it,
it is the longest first match.
But zero length Perl code segments are
not of interest to most applications,
so we'll ignore them from here on out.
Ignoring zero-length Perl code segments,
the string "++$a" is a first match,
but is still not the longest first match.
" ++$a; ++$b; "
is valid Perl code,
and that is the longest first match.

The search can get considerably more complicated.
Consider this code:

abcdef sub f { ++$a; ++$b; } ghijkl

Here
" sub f { ++$a; ++$b; } "
is the longest first match, but until the closing curly brace is seen it is not
clear where the longest first match will start, much less end.

The Basic Idea

We note that no longest first match can start after some other match ends.
That means there are two phases in the search -- one where we should allow
targets to start, and one where we should not.

Let's assume that,
whenever two different code segments share some of the input tokens,
one contains the other -- that is, if strings
AB
and
BC
are valid code segments, then
ABC
is valid code.
(We'll call this the Overlap Closure Property.)
Our strategy will be to parse in two "modes".
In prefix mode, we track all possible Perl code segments.
Prefix mode ends when the first non-zero length Perl code segment ends.
At that point we know that one of the current candidates must
be the longest first match.

After prefix mode,
we continue to read tokens,
building all our candidate code segments,
and keeping track of where the most recent one ended.
We do this until either parsing cannot continue,
or we have run out of tokens.
As we'll show more carefully below,
once we can no longer continue parsing,
the Perl code segment that ended most recently will be
the longest first match.
If several end at the same location,
the longest of them will be the longest first match.

To actually implement this idea,
I'll need to use a series of tricks:
Prefixing, the Ruby Slippers
and the Bigfoot Maneuver.
I'll describe these next,
then I'll put them together as an algorithm.

The technique I outline in this post
has the advantage that it does not require
rewriting the rules
and symbols of the original, non-partial-parsing, grammar.
New rules and symbols are added "on top of" the original
grammar.

Prefixing

Prefixing is the simplest of the tricks.
As long as it is possible for a longest first match to start,
my Marpa-based Perl parser is in "prefix mode".
Once the end of a non-trivial Perl code segment is seen,
"prefix mode" is turned off.

The
<target>
symbol represents the longest first match --
the "target" that we are looking for.
The plus sign after
<non_perl_token>
indicates that
<non_perl_prefix>
is a non-zero length sequence of
<non_perl_token>
symbols.
<non_perl_token>'s
are just "aliases"
for the application's normal tokens.
Marpa allows tokens to be ambiguous in this way.

While prefixing is turned on, every token is read as
both its normal self within the original grammar,
and as a
<non_perl_token>.
Initially, prefixing is turned on.
It will be turned off as described later.

The Ruby Slippers

The Ruby Slippers technique, as many of my readers will recall,
is wishful thinking, brought to life and made effective within the world
of parsing.
So, let's ask ourselves, what could we wish for that would make finding
the longest first match easy?

We close our eyes, click our sanguine heels and wish
that the longest first match had markers in the input,
one at its beginning, and one at its end.
As Glinda gently urges us on,
we add that to our grammar:

<target> ::= <target_start_marker> <prog> <target_end_marker>

Here prog is the top level symbol of the Perl grammar --
the original grammar's "start" symbol.

Recall that in the Ruby Slippers technique, the grammar assumes that the
world is unrealistically easy, orderly, etc.
And the lexer makes the parser's wishes come true.
For the partial Perl parser,
we write the lexer so that, as long as prefixing is on,
whenever the parser wants a
<target_start_marker>,
it gives it one.

The Bigfoot Maneuver

In parsing, the Bigfoot maneuver uses "bigfoot" tokens.
We call them "bigfoot" tokens because,
while we say they're out there,
we never actually encounter one,
and in fact are fortunate not to do so.

Marpa, at every point, knows which tokens it is expecting.
We can use this feature, along with "bigfoot" tokens, to signal events to the
lexer.
When we want the parser to signal some event to the lexer, we arrange for it
to expect a "bigfoot" token.
Let's look again at this rule:

<target> ::= <target_start_marker> <prog> <target_end_marker>

Here
<target_end_marker>
is a "bigfoot" token.
We won't ever see one, but the fact that the parser is looking
for one will tell us that we have found a Perl
prog.

We'll use another bigfoot token to deal with
zero-length Perl code segments.
A null string is a legal Perl code segment, but we don't want to count
zero-length code segments as longest first matches.
We could deal with this
by rewriting the Perl grammar slightly,
in order to exclude zero length Perl code segments.
But it is desirable to leave Perl's rules and symbols untouched.
So instead, we introduce another pair of rules

<line>
and
<decl>
are two of the original Perl grammar's symbols.
<line>
and
<decl>
are never zero length.
Every Perl
prog
is a sequence of zero or more
<line>'s
and
<decl>'s.

We define a "non-trivial" Perl code segment as one that contains a
<line>
or a
<decl>.
To signal the lexer that we have found a non-trivial Perl code segment, we use
<non_trivial_target_end>
as a bigfoot token.

The Strategy

We can now describe the algorithm.
The program to find all the Perl in an arbitrary text is a loop,
which walks through the input finding one longest first match at a time.
Each search for a longest first match is implemented as a separate parse.
When a longest first match is found, a new search begins at the next token
after the match.

To find a longest first match,
we start out in prefix mode.
In prefix mode,
every Perl token is read in two ways.
First, it is read
the same way
that it was read in the original Perl grammar.
Second, it is read
as a
<non_perl_token>.

In addition, as long as we are in prefix mode,
whenever the parser demands a
<target_start_marker>,
we provide it.
This means that while in prefix mode,
we are tracking all possible Perl code segments,
as well as extending the string of
<non_perl_token>'s to
act as the prefix to any new target candidate that we encounter.
Note that the last prefix token -- the one that we read just before we turn prefix mode off --
will be in all of our target candidates.
This is important because it guarantees that they will all overlap.

When we see that a
<non_trivial_target_end>
bigfoot token is expected,
we turn prefix mode off.
We now have at least one candidate for a target and,
since we are not extending the prefix, we will start no new
target candidates.
For the rest of this parse,
we will only lengthen the already started Perl code segments.

Whether in prefix mode or not,
every time the parser requests a
<target_end_marker>
bigfoot token,
we record our location.
We never actually provide a
<target_end_marker>
as input.
Bigfoot tokens are not Ruby Slippers tokens.
This means that rules that have bigfoot tokens are
"virtual" rules in the sense that they will never actually
be completed.

Parsing ends when it is "exhausted"
or when we run out of tokens.
An "exhausted" parse is one which cannot continue successfully,
although it may have already succeeded.
(In fact, since the partial parser extends the prefix forever
unless a match is found,
in our case an "exhausted" parse MUST be a successful parse.)

Once parsing ends,
we call the recognizer's
progress()
method to get a list of all the
prog's ending
at the last location where we expected a
<target_end_marker>
bigfoot.
The longest of the completed
prog rules
found in our
progress()
report for that last location is our longest first match.
If there is no last location,
we will also have run out of tokens,
and we are done processing the input.

Other approaches

The above approach is not the only way to do partial parsing in Marpa.
In fact, you can do it much more simply using only the Prefixing technique,
along with a similar "Postfixing" technique.
The problem is that this method easily goes cubic (O(n3)),
and the worst case
(the one where most or all of the input text is a single Perl code segment)
is also one of practical interest.
The approach outlined above will be linear (O(n)) for all cases of practical interest.
I qualify my assertion by saying "cases of practical interest"
because in fact Perl parsing is, in the general case, undecidable.

The method outlined in this post could be extended
to deal with languages that do not have
the Overlap Closure Property.
In fact,
it is not completely clear that Perl,
in general,
has the Overlap Closure Property.
The C language does not in general have the Overlap Closure Property,
at least not if the preprocessor is taken into account.
Using the C preprocessor, it would be easy
to construct strings
X,
Y,
Z
where
X,
Y,
Z,
XY
and
YZ
are all completely valid C,
but
XYZ
has a syntax error.
But, for Perl and C code of practical interest,
it is not unreasonable to expect the Overlap Closure Property will hold.