Marpa resources

Wed, 06 Jan 2016

"You see things; and you say 'Why?' But I dream things that never were; and I say 'Why not?'"
--
George Bernard Shaw

In the 1960's and 1970's computer languages were evolving rapidly.
It was not clear which way they were headed.
Would most programming be done with
general-purpose languages?
Or would programmers create a language for every task domain?
Or even for every project?
And, if lots of languages were going to be created,
what kinds of languages would be needed?

It was in that context that Čulik and Cohen,
in
a
1973 paper,
outlined what they thought programmers would want and should have.
In keeping with the spirit of the time,
it was quite a lot:

Programmers would want to extend their grammars with new syntax,
including new
kinds of expressions.

Programmers would also want to use tools that automatically generated new syntax.

Programmers would not want to, and especially
in the case of auto-generated syntax
would usually not be able to,
massage the syntax into very restricted forms.
Instead, programmers would create grammars and languages
which required unlimited lookahead to disambiguate,
and they would require parsers which could handle these grammars.

Finally, programmers would need to be able to rely on
all of this parsing being done in linear time.

Today, we think we know that
Čulik and Cohen's vision was naive,
because we think we know that parsing technology cannot support it.
We think we know that parsing is much harder than they thought.

The eyeball grammars

As a thought problem, consider the "eyeball" class of grammars.
The "eyeball" class of grammars contains all the grammars that a human
can parse at a glance.
If a grammar is in the eyeball class,
but a computer cannot parse it,
it presents an interesting choice. Either,

your computer is not using the strongest practical algorithm; or

your mind is using some power which cannot be reduced to a machine computation.

There are some people out there (I am one of them)
who don't believe that everything the mind can do reduces
to a machine computation.
But even those people
will tend to go for the choice in this case:
There must be some practical computer parsing algorithm which
can do at least as well at parsing as a human
can do by "eyeball".
In other words, the class of "reasonable grammars" should
contain the eyeball class.

Čulik and Cohen's candidate for the class of "reasonable grammars"
were the grammars that
a deterministic parse engine
could parse if it had a lookahead that was infinite,
but restricted to distinguishing between regular expressions.
They called these the LR-regular, or LRR, grammars.
And the LRR grammars
do
in fact seem to be a good first approximation
to the eyeball class.
They do not allow lookahead that contains things
that you have to count, like palindromes.
And, while I'd be hard put to eyeball every possible string for every possible regular expression,
intuitively the concept of scanning for a regular expression
does seem close to capturing the idea of glancing through a text looking for a telltale pattern.

So what happened?

Alas, the algorithm in the Čulik and Cohen paper turned out to be impractical.
But in 1991, Joop Leo discovered a way to adopt Earley's algorithm to parse the LRR grammars
in linear time, without doing the lookahead.
And Leo's algorithm does have a practical implementation:
Marpa.