2019 Event Schedule

Have you viewed the TPCiP schedule yet? Visit the 2019 TPCiP event schedule and check out the ?Sched? app (App Store, Google Play) for your viewing convenience. Make an account so you can plan out your daily conference schedule!

Sign up for a master-class tutorial!

Here's the lineup:

June 16: Practical Perl 6 with Jeff Goff (full day) $115 Introduction to Git (even for non-developers!) with John Anderson (half day) $65 Setting you up the bomb: interactive git rebase for the win with John Anderson (half day) $65

Birds of a Feather (BoF) event registration

Do you have a gathering you'd like to host? Do you need room space to do that? We'd love to help! By registering your BoF early you can guarantee the most visibility through promotion. Registration does not guarantee room assignment. Rooms will be reserved for BoFs on a first come, first served basis. Registration submissions will be accepted until June 1st. Submit your BoF information today.

Conference Tickets and Hotel Room Reservations

The conference events and recommended lodging are conveniently in one place:

The organizers can help you with your reservation if you need an accessible room. Please email admin@perlconference.us for assistance.

Sponsor Spotlight: Geekuni

Get your newbies up to speed! Do you wish you could expand your Perl team without putting everything on hold? Our sponsor, Geekuni, is offering a 2-week FREE TRIAL of online Perl training for two team members - worth $750! Send us an email at admin@perlconference.us with subject "Geekuni FREE TRIAL" and we'll put you in touch. There is limited space available - book your free trial now!

Want to sponsor The Perl Conference?

Sponsors can also donate directly to The Perl Conference in Pittsburgh! Often the organizers are able to link a sponsor directly to an event at the conference ( wifi, coffee break, etc ) where the cost of the event is in line with the donation amount. Email treasurer@perlfoundation.org for more information or to donate directly to this year?s event.

Sponsors of $500 or more will be provided a table at our sponsor expo and job fair.

The challenge

A recent
blog post by Michael Arntzenius ended with a friendly challenge to Marpa.
Haskell list comprehensions are something that
Haskell's own parser handles only with difficulty.
A point of Michael's critique of Haskell's parsing was
that Haskell's list comprehension could be even more powerful if not
for these syntactic limits.

Michael wondered aloud if Marpa could do better.
It can.

The problem syntax occurs with the "guards",
a very powerful facility of
Haskell's list comprehension.
Haskell allows several kinds of "guards".
Two of these "guards" can have the same prefix,
and these ambiguous prefixes can
be of arbitrary length.
In other words,
parsing Haskell's list comprehension requires
either lookahead of arbitrary length,
or its equivalent.

To answer Michael's challenge,
I extended my Haskell subset parser to deal with
list comprehension.
That parser, with its test examples, is online.[1]
I have run it for examples thousands of tokens long and,
more to the point,
have checked the Earley sets to ensure that Marpa
will stay linear,
no matter how long the ambiguous prefix gets.[2]

Earley parsing, which Marpa uses,
accomplishes the seemingly impossible here.
It does the equivalent of infinite lookahead efficiently,
without actually doing any lookahead or
backtracking.
That Earley's algorithm can do this has been a settled
fact in the literature for some time.
But today Earley's algorithm is little known even
among those well acquainted with parsing,
and to many claiming the equivalent of infinite lookahead,
without actually doing any lookahead at all,
sounds like a boast of magical powers.

In the rest of this blog post,
I hope to indicate how Earley parsing follows more than
one potential parse at a time.
I will not describe Earley's algorithm in full.[3]
But I will show that no magic is involved,
and that in fact the basic ideas behind Earley's method
are intuitive and reasonable.

A quick cheat sheet on list comprehension

List comprehension in Haskell is impressive.
Haskell allows
you to build a list using a series of "guards",
which can be of several kinds.
The parsing issue arises because two of the guard types --
generators and boolean expressions --
must be treated quite differently,
but can look the same over an arbitrarily long prefix.

Generators

Here is one example of a Haskell generator,
from the test case for this blog post:

This says to build a lists of x's
such that the guard
[x, 1729, 99 ] <- xss
holds.
The clue that this guard is a generator is the
<- operator.
The <- operator
will appear in every generator,
and means "draw from".

The LHS of the <- operator is a pattern
and the RHS is an expression.
This generator draws all the elements from xss
which match the pattern [x, 1729, 99 ].
In other words, it draws out
all the elements of xss,
and tests that they
are lists of length 3
whose last two subelements are 1729 and 99.

The variable x is set to the 1st subelement.
list will be a list of all those x's.
In the test suite, we have

Boolean guards versus generators

From the parser's point of view, Boolean guards
and generators start out looking the same --
in the examples above, three of our guards start out
the same -- with the string [x, 1729, 99 ],
but

in one case (the Boolean guard),
[x, 1729, 99 ] is the beginning of an expression; and

in the other two cases (the generators),
[x, 1729, 99 ] is a pattern.

Clearly patterns and expressions can look identical.
And they can look identical for an arbitrarily long time --
I tested the Glasgow Haskell Compiler
(GHC)
with identical expression/pattern prefixes
thousands of tokens in length.
My virtual memory eventually gives out,
but GHC itself never complains.[8]
(The comments "insert more here" show the points at which the
comma-separated lists of integers can be extended.)

The problem for parsers

So Haskell list comprehension presents a problem for parsers.
A parser must determine whether it is parsing an expression or
a pattern, but it cannot know this for an arbitrarily long time.
A parser must keep track of two possibilities at once --
something traditional parsing has refused to do.
As I have pointed out[9],
belief that traditional parsing "solves" the parsing problem is
belief in human exceptionalism --
that human have calculating abilities that Turing machines do not.
Keeping two possibilites in mind for a long time is trivial for
human beings -- in one form we call it worrying,
and try to prevent ourselves from doing it obsessively.
But it has been the orthodoxy that practical parsing algorithms
cannot do this.

Arntzenius has a nice summary of the attempts to parse this
construct while only allowing one possibility at a time --
that is, determistically.
Lookahead clearly cannot work -- it would have to be arbitrarily
long.
Backtracking can work, but can be very costly
and is a major obstacle to quality error reporting.

GHC avoids the problems with backtracking by using post-processing.
At parsing time, GHC treats an ambiguous guard as a
Boolean.
Then, if it turns out that is a generator,
it rewrites it in post-processing.
This inelegance incurs some real technical debt --
either a pattern must always be a valid expression,
or even more trickery must be resorted to.[10]

The Earley solution

Earley parsing deals with this issue by doing what
a human would do --
keeping both possibilities in mind at once.
Jay Earley's innovation was to discover a way for a computer
to track multiple possible parses
that is compact,
efficient to create,
and efficient to read.

Earley's algorithm maintains an "Earley table"
which contains "Earley sets",
one for each token.
Each Earley set contains "Earley items".
Here are some Earley items from Earley set 25
in one of our test cases:

In the code, these represent the state of the parse just after
the pipe symbol ("|") on line 4 of our test code.

Each Earley item describes progress in one rule of the grammar.
There is a dot (".") in each rule,
which indicates how far the parse
has progressed inside the rule.
One of the rules has the dot just after the pipe symbol,
as you would expect, since we have just seen a pipe symbol.

The other four rules have the dot at the beginning of the RHS.
These four rules are "predictions" -- none of their symbols
have been parsed yet, but we know that these rules might occur,
starting at the location of this Earley set.

Each item also records an "origin": the location in the input where
the rule described in the item began.
For predictions the origin is always the same as the Earley set.
For the first Earley item, the origin is 3 tokens earlier,
in Earley set 22.

The "secret" of non-determinism

And now we have come to the secret of efficient non-deterministic parsing --
a "secret"
which I hope to convince the reader is not magic,
or even much of a mystery.
Here, again, are two of the items from Earley set 25:

At this point there are two possibilities going forward --
a generator guard or a Boolean expression guard.
And there is an Earley item for each of these possibilities in the Earley set.

That is the basic idea -- that is all there is to it.
Going forward in the parse, for as long as both possibilities stay
live, Earley items for both will appear in the Earley sets.

From this point of view,
it should now be clear why the Earley algorithm can keep track
of several possibilities without lookahead or backtracking.
No lookahead is needed because all possibilities are in the
Earley set, and selection among them will take place as the
rest of the input is read.
And no backtracking is needed because every possibility
was already recorded -- there is nothing new to be found
by backtracking.

It may also be clearer why I claim that Marpa is left-eidetic,
and how the Ruby Slippers work.[13]
Marpa has perfect knowledge of everything in the parse so far,
because it is all in the Earley tables.
And, given left-eidetic knowledge, Marpa also knows what
terminals are expected at the current location,
and can "wish" them into existence as necessary.

The code, comments, etc.

A permalink to the
full code and a test suite for this prototype,
as described in this blog post,
is
on Github.
In particular,
the permalink of the
the test suite file for list comprehension is
here.
I expect to update this code,
and the latest commit can be found
here.

Footnotes

1.
If you are interested in my Marpa-driven Haskell subset parser,
this blog post
may be the best introduction.
The code is
on Github.
↩

2.
The Earley sets for the ambigious prefix immediately reach a size
of 46 items, and then stay at that level.
This is experimental evidence that the Earley set
sizes stay constant.

And, if the Earley items are examined,
and their derivations traced,
it can be seen that
they must repeat the same Earley item count
for as long as the ambiguous prefix continues.
The traces I examined are
here,
and the code which generated them is
here,
for the
reader who wants to convince himself.

The guard prefixes of Haskell are ambiguous,
but (modulo mistakes in the standards)
the overall Haskell grammar is not.
In the literature on Earley's,
it has been shown that for an unambiguous grammar,
each Earley item has an constant amortized cost in time.
Therefore,
if a parse produces
a Earley sets that are all of less than a constant size,
it must have linear time complexity.
↩

3.
There are many descriptions of Earley's algorithm out there.
The
Wikipedia page on Earley's algorithm
(accessed 27 August 2018)
is one good place to start.
I did
another very simple introduction to Earley's in
an
earlier blog post,
which may be worth looking at.
Note that Marpa contains
improvements to Earley's algorithm.
Particularly, to fulfill Marpa's claim of linear time for all
LR-regular grammars, Marpa uses Joop Leo's speed-up.
But Joop's improvement is not necessary or useful
for parsing
Haskell list comprehension,
is not used in this example,
and will not be described in this post.
↩

Always approach a case with an absolutely blank mind.
It is always an advantage.
Form no theories, just simply observe and draw inferences from your observations.
—
Sherlock Holmes, quoted in "The Adventure of the Cardboard Box".

It is a capital mistake to theorize before one has data.
—
Holmes, in "A Scandal in Bohemia".

I make a point of never having any prejudices, and of following docilely wherever fact may lead me.
—
Holmes, in "The Reigate Puzzle".

When you have eliminated the impossible, whatever remains, no matter how improbable, must be the truth.
—
Holmes, in "The Sign of Four".

In imagination there
exists the perfect
mystery story.
Such a story presents
the essential clues, and compels us to form our own
theory of the case.
If we
follow the plot carefully, we arrive at the complete
solution for ourselves just before the author's disclosure
at the end of the book. The solution itself, contrary to
those of inferior mysteries, does not disappoint us;
moreover, it appears at the very moment we expect it.
Can we liken the reader of such a book to the scientists,
who throughout successive generations continue to seek
solutions of the mysteries in the book of nature? The
comparison is false and will have to be abandoned later,
but it has a modicum of justification which may be
extended and modified to make it more appropriate to
the endeavour of science to solve the mystery of the
universe.
—
Albert Einstein
and Leopold Infeld. [1]

The Sherlock Holmes approach

Einstein assumed his reader's idea of methods of proper investigation,
in science as elsewhere,
would be similar to those Conan Doyle's Sherlock Holmes.
I will follow Einstein's lead in starting there.

The deductions recorded in the Holmes' canon
often involve
a lot
of theorizing.
To make it a matter of significance what the dogs in "Silver Blaze" did in the night,
Holmes needs a theory of canine behavior,
and Holmes' theory sometimes outpaces its pack of facts by a considerable distance.
Is it really true that only dangerous people own
dangerous dogs?[2]

Holmes's methods, at least as stated in the Conan Doyle stories,
are incapable of solving anything
but the fictional problems he encounters.
In real life, a "blank mind" can observe nothing.
There is no "data" without theory, just white noise.
Every "fact" gathered relies on many prejudgements about what is
relevant and what is not.
And you certainly cannot characterize anything as "impossible",
unless you have, in advance, a theory about what is possible.

The Einstein approach

Einstein, in his popular account
of the evolution of physics,
finds the Doyle stories "admirable"[3].
But to solve real-life mysteries, more is needed.
Einstein begins his description of his methods at the start
of his Chapter II:

The following pages contain a dull report of
some very simple experiments.
The account will be boring
not only because the description of experiments is uninteresting
in comparison with their actual performance,
but also because the meaning of the experiments does
not become apparent until theory makes it so. Our
purpose is to furnish a striking example of the role of
theory in physics.
[4]

Einstein follows with a series of the kind of experiments
that are performed in high school physics classes.
One might imagine these experiments allowing an observer
to deduce the basics of electromagnetism
using materials and techniques available for centuries.

But, and this is Einstein's point,
this is not how it happened.
The theory came
first,
and the experiments were devised afterwards.

In the first pages
of our book we compared the role
of an investigator
to that of a detective who, after
gathering the requisite facts, finds the right solution
by pure thinking. In one essential this comparison must
be regarded as highly superficial. Both in life and in
detective novels the crime is given. The detective must
look for letters, fingerprints, bullets, guns, but at least
he knows that a murder has been committed. This is
not so for a scientist. It should not be difficult to
imagine someone who knows absolutely nothing about
electricity, since all the ancients lived happily enough
without any knowledge of it. Let this man be given
metal, gold foil, bottles, hard-rubber rod, flannel, in
short, all the material required for performing our
three experiments. He may be a very
cultured person,
but he will probably put wine into the bottles, use the
flannel for cleaning, and never once entertain the idea
of doing the things we have described.
For the detective
the crime is given, the problem formulated: who
killed Cock Robin?
The scientist must, at least in part,
commit his own crime, as well as carry out the investigation.
Moreover, his task is not to explain just one
case, but all phenomena which have happened
or may
still happen. — Einstein and Infeld [5]

Commiting our own crime

If then,
we must commit the crime of theorizing before the facts,
where does out theory come from?

Science is not just a collection of laws,
a catalogue of unrelated facts.
It is a creation of the human mind,
with its freely invented ideas and concepts.
Physical theories try to form a picture of reality
and to establish its connection
with the wide world of sense impressions.
Thus the only justification for our mental structures
is whether and in what way our theories form such
a link. — Einstein and Infeld [6]

In the case of planets moving around the sun
it is found that the system of mechanics works
splendidly.
Nevertheless we can well imagine that another system,
based on different assumptions,
might work just as well.
Physical concepts are free creations
of the human mind, and are not,
however it may seem,
uniquely determined by the external world.
In our endeavor to understand reality
we are somewhat like a man trying
to understand the mechanism of a closed watch.
He sees the face and the moving hands,
even hears its ticking,
but he has no way of opening the case.
If he is ingenious
he may form some picture of a mechanism
which could be responsible
for all the things he observes,
but he may never be quite sure
his picture is the only one
which could explain his observations.
He will never be able
to compare his picture with the real mechanism
and he cannot even imagine the possibility
or the meaning of such a comparison.
But he certainly believes that,
as his knowledge increases,
his picture of reality will become
simpler and simpler
and will explain a wider and wider range
of his sensuous impressions.
He may also be believe in the existence
of the ideal limit of knowledge
and that it is approached
by the human mind.
He may call this ideal limit
the objective truth. -- Einstein and Infeld [7]

It may sound as if Einstein believed that the soundness of
our theories is a matter of faith.
In fact, Einstein was quite comfortable with putting it
exactly that way:

However, it must be admitted
that our knowledge of these laws is only imperfect
and fragmentary, so that,
actually the belief
in the existence of basic all-embracing laws
in Nature also rests on a sort of faith.
All the same this faith has been largely
justified so far by the success of
scientific research. — Einstein [8]

I believe that every true theorist
is a kind of tamed metaphysicist,
no matter how pure a "positivist" he may
fancy himself.
The metaphysicist believes that the logically
simple is also the real.
The tamed metaphysicist believes
that not all that is logically simple
is embodied in experienced reality,
but that the totality of all sensory experience
can be "comprehended" on the basis of a
conceptual system built on premises of great
simplicity.
The skeptic will say this is a "miracle creed."
Admittedly so, but it is a miracle creed
which has been borne out to an amazing extent by
the development of science. — Einstein [9]

The liberty of choice, however,
is of a special kind;
it is not in any way similar to the liberty of a
writer of fiction.
Rather, it is similar to that of a man engaged
in solving a well-designed puzzle.
He may, it is true, propose
any word as the solution;
but, there is only one
word which really solves the puzzle in all its
parts.
It is a matter of faith that nature
—
as she is perceptible to our five senses
—
takes the character of such a
well-formulated puzzle.
The successes reaped up to now
by science do,
it is true,
give a certain encouragement for this faith. --
Einstein [10]

The puzzle metaphor of the last quote is revealing.
Einstein believes there is a single truth,
but that we will never know what it is —
even its existence can only be taken as a matter of faith.
Existence is a crossword puzzle whose answer we will never
know.
Even the existence of an answer must be taken as
a matter of faith.

The very fact that the totality of our sense experience
is such that by means of thinking
(operations with concepts,
and the creation and use of definite functional relations
between them,
and the coordination of sense experiences to these concepts)
it can be put in order,
this fact is one which leaves us in awe,
but which we shall never understand.
One may say that
"the eternal mystery of the world
is its comprehensibility". — Einstein [11]

In my opinion,
nothing can be said a priori
concerning the manner in which the concepts
are to be formed and connected,
and how we are to coordinate them to sense experiences.
In guiding us in the creation of such an order
of sense experiences,
success alone is the determining factor.
All that is necessary is to fix a set of rules,
since without such rules the acquisition
of knowledge in the desired sense would be impossible.
One may compare these rules with the rules of a game
in which,
while the rules themselves are arbitrary,
it is their rigidity alone which
makes the game possible.
However, the fixation will never be final.
It will have validity only for a special field
of application. — Einstein [12]

There are no eternal theories in science.
It always happens that some of the facts
predicted by a theory
are disproved by experiment.
Every theory has its period of
gradual development and triumph,
after which it may experience a
rapid decline. — Einstein and Infeld
[13]

In our great mystery story there are no problems
wholly solved and settled for all time. — Einstein and Infeld
[14]

This great mystery story
is still
unsolved.
We
cannot
even be sure that it has a final solution. —
Einstein and Infeld [15]

Choosing a "highway"

In most of the above,
Einstein is focusing on his work in a "hard" science: physics.
Are his methods relevant to "softer" fields of study?
Einstein thinks so:

The whole of science is nothing
more than a refinement of everyday thinking.
It is for this reason that the critical thinking
of the physicist cannot possibly be restricted to
the examination of the concepts of his own
specific field.
He cannot proceed without considering critically
a much more difficult problem,
the problem of analyzing the nature of everyday
thinking. — Einstein
[16]

Einstein's collaboration with Infeld was, like the "Timeline",
a description of the evolution of ideas,
and in the Einstein–Infeld book they describe their approach:

Through the maze of
facts and concepts we had to choose some highway
which seemed to us most characteristic and significant.
Facts and theories not reached by this road had to be
omitted. We were forced, by our general aim, to make
a definite choice of facts and ideas. The importance of a
problem should not be judged by the number of pages
devoted to it. Some essential lines of thought have been
left out, not because they seemed to us unimportant,
but because they do not lie along the road we have
chosen. — Einstein and Infeld [17]

Truth and success

Einstein says that objective truth, while
it exists, is not to be attained in the hard sciences,
so it is not likely he thought that a historical
account could outdo physics in this respect.
For Einstein, as quoted above,
"success alone is the determining factor".

Success, of course, varies with what the audience
for a theory wants.
In a very real sense,
I consider a theory that can predict the
stock market more successful than
one which can predict perturbations of planetary orbits
invisible to the naked eye.
But this is not a reasonable expectation when applied
to the theory of general relativity.

Among the expectations reasonable for a timeline of parsing
might be these:

It helps choose the right parsing algoithm for practical
applications.

It helps a reader to understand articles in the
literature of parsing.

It helps guide future research.

It predicts the outcome of future research.

When I wrote the first version of Timeline,
its goal was none of these.
Instead I intended it to explain the sources behind my own
research in the Earley/Leo lineage.

With such a criteria of "success",
I wondered if Timeline would have an audience
much larger than one,
and was quite surprised when it started getting thousands of
web hits a day.
The large audience Timeline 1.0 drew
was a sign that there is an large appetite
out there for
accounts of parsing theory,
an appetite so strong that anything resembling
a coherent account
was quickly devoured.

In response to the unexpectedly large audience,
later versions of the Timeline widened
their focus.
Timeline 3.1
was broadened to give good coverage
of mainstream parsing practice
including a lot of new material and original analysis.
This brought in lot of material on topics
which had little or no influence on my Earley/Leo work.
The parsing of arithmetic expressions,
for example,
is trivial in the Earley/Leo context,
and before my research for Timeline 3.0
I had devoted little attention to
approaches that I felt amounted to
needlessly doing things the hard way.
But arithmetic expressions are at the borderline of power
for traditional approaches
and parsing arithmetic expressions was a central motivation
for the authors of the algorithms that have so far
been most influential on mainstream parsing.
So in
Timeline 3.1
arithmetic expresssions became a recurring theme,
being brought back for detailed examination time and time again.

Is the "Timeline" false?

Is the "Timeline" false?
The answer is yes, in three increasingly practical senses.

As Einstein makes clear,
every theory that is about reality,
will eventually proved be false.
The best a theory can hope for is the fate of
Newton's physics —
to be shown to be a subcase of a larger theory.

In a more specific sense,
the truth of any theory of parsing history depends
on its degree of success in explaining the facts.
This means that the truth of the "Timeline" depends on which facts
you require it to explain.
If arbitrary choices of facts to be explained are allowed,
the "Timeline" will certainly be seen to be false.

But can the "Timeline" be shown to be false
for criteria of success which are non-arbitrary?
In the next section, I will describe four non-arbitrary
criteria of success,
all of which are of practical interest,
and for all of which the "Timeline" is false.

The Forever Five

"Success" depends a lot on judgement,
but my studies have led me to conclude that all but five algorithms
are "unsuccessful" in the sense that,
for everything that they do,
at least one other algorithm does it better in practice.
But this means there are five algorithms which do solve
some practical problems
better than any other algorithm,
including each of the other four.
I call these the "forever five" because,
if I am correct,
these algorithms will be of permanent interest.

My "Forever Five" are regular expressions, recursive descent, PEG, Earley/Leo and Sakai's
algorithm.[18]
Earley/Leo is the focus of my
Timeline, so that an effective
critique of my "Timeline"
could be a parsing historiography centering on any other of the other four.

For example, of the five, regular expressions are the most limited in parsing power.
On the other hand, most of the parsing problems you encounter in practice
are handled quite nicely by regular expressions.[19]
Good implementations of regular expressions are widely available.
And, for speed, they are literally unbeatable -- if a parsing problem is a
regular expression, no other algorithm will beat a dedicated regular expression
engine for parsing it.

Could a
Timeline
competitor be written which
centered on regular expressions?
Certainly.
And if immediate usefulness to the average programmer is the criterion
(and it is a very good criterion),
then the
Regular Expressions Timeline
would certainly give
my timeline a run for the money.

What about a PEG Timeline?

The immediate impetus for this article was
a very collegial inquiry
from Nicolas Laurent, a researcher whose main interest is PEG.
Could a
PEG Timeline
challenge mine?
Again, very certainly.

Because there are at least some
problems for which PEG is superior to everything else,
my own Earley/Leo approach included.
As one example, PEG
could be an more powerful alternative to regular expressions.

That does not mean that I might not come back with
a counter-critique.
Among the questions that I might ask:

Is the PEG algorithm being proposed a future,
or does it have an implementation?

What claims of speed and time complexity are made?
Is there a way of determining in advance of runtime how fast
your algorithm will run?
Or is the expectation of practical speed
on an "implement and pray" basis?

Does the proposed PEG algorithm match human parsing
capabilities?
If not, it is a claim for human exceptionalism,
of a kind not usually accepted in modern computer science.
How is exceptionalism justified in this case?

The search for truth is more precious
than its possession. -- Einstein, quoting Lessing[20]

Footnotes

2.
"A dog reflects the family life.
Whoever saw a frisky dog in a gloomy family, or a sad dog in a happy one?
Snarling people have snarling dogs, dangerous people have dangerous ones."
From "The Adventure of the Creeping Man".
↩

18.
Three quibbles:
Regular expressions do not find structure,
so pedantically they are recognizers,
not parsers.
Recursive descent is technique for creating a family of algorithms,
not an algorithm.
And the algorithm first described by Sakai is more commonly
called CYK, from the initials of three other researchers who re-discovered
it over the years.
↩

19.
A lot of this is because programmers learn to formulate problems in
ways which avoid complex parsing so that,
in practice,
the alternatives are
using regular expressions or rationalizing away the
need for parsing.
↩

20.
"The Fundaments of Theoretical Physics", in
Ideas and Opinions, p. 335.
↩

In
a previous post,
I outlined a method for using the Marpa algorithm as the basis for
better combinator parsing.
This post follows up
with a trial implementation.

For this trial,
I choose the most complex example from the classic 1996 tutorial
on combinator parsing by
Hutton and Meijer[1].
Their example implements the offside-rule parsing of a functional language --
parsing where whitespace is part of the syntax.[2]
The Hutton and Meijer example is for Gofer,
a now obsolete implementation of Haskell.
To make the example more relevant,
I wrote a parser for Haskell layout
according to the Haskell 2010 Language Report[3].

For tests,
I used the five examples (2 long, 3 short) provided
in the 2010 Report[4],
and the four examples given in the "Gentle Introduction" to Haskell[5].
I implemented only enough of the Haskell syntax to run
these examples,
but this was enough to include a substantial subset of Haskell's
syntax.

This description of the implementation includes many extracts from
the code.
For those looking for more detail,
the full code and test suite for this example are
on Github.
While the comments in the code do not amount to a tutorial, they are
extensive.
Readers who like to "peek ahead" are encouraged to do so.

Layout parsing and the off-side rule

It won't be necessary to know Haskell to follow this post.
This section will describe Haskell's layout informally.
Briefly, these two code snippets should have the same effect:

In my test suite, both code snippets produce the same AST.
The first code display uses Haskell's implicit layout parsing,
and the second code display uses explicit layout.
In each, the "let" is followed by a block
of declarations
(symbol <decls>).
Each decls contains one or more
declarations
(symbol <decl>).
For the purposes of determining layout,
Haskell regards
<decls> as a "block",
and each
<decl> as a block "item".
In both displays, there are two items in
the block.
The first item is
y = a*b,
and the second
<decl> item
is f x = (x+y)/y.

In explicit layout, curly braces surround the block,
and semicolons separate each
item.
Implicit layout follows the "offside rule":
The first element of the laid out block
determines the "block indent".
The first non-whitespace character of every subsequent non-empty line
determines the line indent.
The line indent is compared to the block indent.

If the line indent is deeper than the block indent,
then the line continues the current block item.

If the line indent is equal to the block indent,
then the line starts a new block item.

If the line indent is less than the block indent
(an "outdent"),
then the line ends the block.
An end of file also ends the block.

Lines containing only whitespace are ignored.
Comments count as whitespace.

Explicit semicolons can be used
in implicit layout:
If a semicolon occurs in implicit layout,
it separates block items.
In our test suite,
the example

The examples in the displays above are simple.
The two long examples from the 2010 Report are
more complicated:
6 blocks of 4 different kinds,
with nesting twice reaching
a depth of 4.
The two long examples in the 2010 Report are the same,
except that one uses implicit layout and the other uses
explicit layout.
In the test of my Haskell subset parser,
both examples produce identical ASTs.

There are additional rules,
including for tabs, Unicode characters and
multi-line comments.
These rules are not relevant in the examples I took from the Haskell literature;
they present no theoretical challenge to this parsing method;
and they are not implemented in this prototype Haskell parser.

The strategy

To tackle Haskell layout parsing, I use a separate
combinator for each layout block.
Every combinator, therefore, has its own block and item symbols,
and its own block indent;
and each combinator implements exactly one method of layout -- explicit or implicit.

From the point of view of its parent combinator,
a child combinator is a lexeme,
and the parse tree it produces is the
value of the lexeme.
Marpa can automatically produce an AST,
and it adds lexeme values to the AST as leaves.
The effect is that Marpa automatically assembles
a parse tree for us from the tree segments returned by the
combinators.

Ruby Slippers semicolons

In outlining this algorithm, I will start by explaining
where the "missing" semicolons come from in the implicit layout.
Marpa allows various kinds of "events",
including on discarded tokens.
("Discards" are tokens thrown away, and not used in the parse.
The typical use of token discarding in Marpa is for the handling of whitespace
and comments.)

The following code sets an event named 'indent', which
happens when Marpa finds a newline followed by zero or more
whitespace characters.[9]
This does not capture the indent of the first line of a file,
but that is not an issue --
the 2010 Report requires that the first indent be treated as a
special case anyway.

Indent events, like others, occur in the main read loop
of each combinator. Outdents and EOFs are dealt with by terminating
the read loop.[11]
Line indents deeper than the current block indent are dealt with by
resuming the read loop. [12]
Line indents equal to the block indent trigger the reading of a
Ruby Slippers semicolon as shown in the following:

Ruby Slippers

In Marpa, a "Ruby Slippers" symbol is one which does not actually occur
in the input.
Ruby Slippers parsing is new with Marpa,
and made possible because Marpa is left-eidetic.
By left-eidetic, I mean that Marpa knows, in full
detail, about the parse to the left of its current position,
and can provide that information to the parsing app.
This implies that Marpa also knows which tokens are acceptable
to the parser at the current location,
and which are not.

Ruby Slippers parsing enables a very important trick which
is useful in "liberal"
parsing -- parsing where certain elements might be in some sense
"missing".
With the Ruby Slippers you can design a "liberal" parser with
a "fascist" grammar.
This is, in fact, how the Haskell 2010 Report's
context-free grammar is designed --
the official syntax requires explicit layout,
but Haskell programmers are encouraged to omit most of the explicit
layout symbols,
and Haskell implementations are required to "dummy up" those
symbols in some way.
Marpa's method for doing this is left-eideticism and Ruby Slippers
parsing.

The term "Ruby Slippers" refers to a widely-known scene in the "Wizard of Oz" movie.
Dorothy is in the fantasy world of Oz, desperate to return to Kansas.
But, particularly after a shocking incident in which orthodox Oz wizardry
is exposed as an affable fakery,
she is completely at a loss as to how to escape.
The "good witch" Glenda appears and tells Dorothy that in fact she's always
had what she's been wishing for.
The Ruby Slippers, which she had been wearing all through the movie,
can return her to Kansas.
All Dorothy needs to do is wish.

In Ruby Slippers parsing,
the "fascist" grammar "wishes" for lots of things that may not be in
the actual input.
Procedural logic here plays the part of a "good witch" -- it tells
the "fascist" grammar that what it wants has been there all along,
and supplies it.
To do this,
the procedural logic has to have a reliable way of knowing what the parser
wants.
Marpa's left-eideticism provides this.

Ruby Slippers combinators

This brings us to a question
I've postponed -- how do we know which combinator
to call when?
The answer is Ruby Slippers parsing.
First, here are some lexer rules for "unicorn" symbols.
We use unicorns when symbols need to appear in Marpa's lexer,
but must never be found in actual input.

<unicorn> is defined to match
[^\d\D].
This pattern is all the symbols which are not digits
and not non-digits -- in other words, it's impossible that this
pattern will ever match any character.
The rest of the statements declare other unicorn lexemes
that we will need.
<unicorn> and
<L0_unicorn> are separate,
because we need to use
<unicorn> on the RHS of some lexer rules,
and a Marpa lexeme can never occur
on the RHS of a lexer rule.[15]

In the above Marpa rule,

<decls> is the symbol from the 2010 Report;

<ruby_i_decls> is a Ruby Slippers symbol for
a block of declarations with implicit layout.

<ruby_x_decls> is a Ruby Slippers symbol for
a block of declarations with explicit layout.

<laidout_decls> is a symbol (not in the 2010 Report)
for a block of declarations covering all the possibilities for
a block of declarations.

It is the expectation of a
<laidout_decls> symbol that causes child
combinators to be invoked.
Because <L0_unicorn> will never be found
in the input,
the
<decls> alternative will never match --
it is there for documentation and debugging reasons.[17]
Therefore Marpa, when it wants a
<laidout_decls>,
will look for a
<ruby_x_decls>
if a open curly brace is read;
and a
<ruby_i_decls> otherwise.
Neither <ruby_x_decls>
or
<ruby_i_decls> will ever be found in the
input,
and Marpa will reject the input,
causing a "rejected" event.

Rejected events

In this code, as often,
the "good witch" of Ruby Slippers does her work through
"rejected" events.
These events can be set up to happen when, at some parse
location, none of the tokens that Marpa's internal lexer
finds are acceptable.

In the "rejected" event handler,
we can use Marpa's left eideticism to find out what
lexemes Marpa would consider acceptable.
Specifically, there is a terminals_expected()
method which returns a list of the symbols acceptable
at the current location.

Once we "grep" out all but the symbols with the "ruby_" prefix,
there are only 4 non-overlapping possibilities:

Marpa expects a
<ruby_i_decls>
lexeme;

Marpa expects a
<ruby_x_decls>
lexeme;

Marpa expects a
<ruby_semicolon>
lexeme;

Marpa does not expect
any of the Ruby Slippers lexemes;

If Marpa does not expect any of the Ruby Slippers
lexemes, there was a syntax error in the Haskell code.[19]

If a <ruby_i_decls>
or a <ruby_x_decls>
lexeme is expected, a child combinator is invoked.
The Ruby Slippers symbol determines
whether the child combinator looks for implicit
or explicit layout.
In the case of implicit layout, the location of
the rejection determines the block indent.[20]

If a
<ruby_semicolon>
is expected, then the parser is at the point where a
new block item could start,
but none was found.
Whether the block was implicit or explicit,
this indicates we have reached the end of the block,
and should return control to the parent combinator.[21]

To explain why
<ruby_semicolon>
indicates end-of-block,
we look at both cases.
In the case of an explicit layout combinator,
the rejection should have been caused by a closing
curly brace, and
we return to the parent combinator
and retry it.
In the parent combinator, the closing curly brace
will be acceptable.

If we experience a "rejected" event while
expecting a
<ruby_semicolon> in an implicit layout
combinator,
it means we did not find an explicit semicolon;
and we also never found the right indent for creating a Ruby semicolon.
In other words, the indentation is telling us that we are at the end
of the block.
We therefore return control to the parent combinator.

Conclusion

With this, we've covered the major points of this Haskell prototype
parser.
It produces an AST whose structure and node names are those of
the 2010 Report.
(The Marpa grammar introduces non-standard node names and rules,
but these are pruned from the AST in post-processing.)

In the code, the grammars from the 2010 Report are included for
comparison, so a reader can easily determine what syntax we left out.
It might be tedious to add the rest,
but I believe it would be unproblematic, with one interesting exception:
fixity.
To deal with fixity, we may haul out the Ruby Slippers again.

The code, comments, etc.

A permalink to the
full code and a test suite for this prototype,
as described in this blog post,
is
on Github.
I expect to update this code,
and the latest commit can be found
here.
Links for specific lines of code in this post are usually
static permalinks to earlier commits.

2.
I use
whitespace-significant parsing as a convenient example
for this post,
for historical reasons and
for reasons of level of complexity.
This should not be taken to indicate that I recommend it
as a language feature.
↩

9.
Single-line comments are dealt with properly by lexing them
as a different token and discarding them separately.
Handling multi-line comments is not yet implemented --
it is easy in principle but
tedious in practice and the examples drawn from the
Haskell literature did not provide any test cases.
↩

15.
The reason for this is that by default a Marpa grammar determines
which of its symbols are lexemes using the presence of those
symbol on the LHS and RHS
of the rules
in its lexical and context-free grammars.
A typical Marpa grammar
requires a minimum of explicit lexeme declarations.
(Lexeme declarations are statements with the :lexeme
pseudo-symbol on their LHS.)
As an aside,
the Haskell 2010 Report is not always careful about the lexer/context-free
boundary,
and adopting its grammar
required more use of Marpa's explicit lexeme declarations
than usual.
↩

17.
Specifically, the presense of a
<decls> alternative silences the usual warnings about
symbols inaccessible from the start symbol.
These warnings can be silenced in other ways,
but at the prototype stage it is convenient to check that
all symbols supposed to be accessible through
<decls> are in fact accessible.
There is a small startup cost to allowing the extra symbols
in the grammars,
but the runtime cost is probably not measureable.
↩

19.
Currently the handling of these is simplistic.
A practical implementation of this method would want better reporting.
In fact, Marpa's left eideticism allows some interesting things
to be done in this respect.
↩

Language popularity

Github's
linguist
is seen as the most trustworthy tool
for estimating language popularity[1],
in large part because it reports its result as
the proportion of code in a very large dataset,
instead of web hits or searches.[2]
It is ironic, in this context,
that
linguist
avoids looking at the code,
preferring to use
metadata -- file name and the vim and shebang lines.
Scanning the actual code is linguist's last resort.[3]

How accurate is this?
For files that are mostly in a single programming language,
currently the majority of them,
linguist's method are probably very accurate.

But literate programming often requires mixing languages.
It is perhaps an extreme example,
but much of the code used in this blog post
comes from a Markdown file, which contains both C and Lua.
This code is "untangled" from the Lua by ad-hoc scripts[4].
In my codebase,
linguist
indentifies this code simply
as Markdown.[5]linguist
then ignores it,
as it does all documentation files.[6].

Currently, this kind of homegrown
literate programming may be so rare
that it is not worth taking into account.
But if literate programming becomes more popular,
that trend might well slip under
linguist's radar.
And even those with a lot of faith in
linguist's numbers should be happy to
know they could be confirmed by more careful methods.

Token-by-token versus line-by-line

linguist avoids reporting results based on looking at the code,
because careful line counting for multiple languages
cannot be done with traditional parsing methods.[7]
To do careful line counting,
a parser must be able to handle ambiguity in several forms --
ambiguous parses, ambiguous tokens, and overlapping variable-length tokens.

The ability to deal with
"overlapping variable-length tokens" may sound like a bizarre requirement,
but it is not.
Line-by-line languages (BASIC, FORTRAN, JSON, .ini files, Markdown)
and token-by-token languages (C, Java, Javascript, HTML)
are both common,
and even today commonly occur in the same file (POD and Perl,
Haskell's Bird notation, Knuth's CWeb).

Deterministic parsing can switch back and forth,
though at the cost of some very hack-ish code.
But for careful line counting,
you need to parse line-by-line and token-by-token
simultaneously.
Consider this example:

A reader can imagine that this code is part of a test case using code
pulled from a LaTeX file.
The programmer wanted to indicate the copied portion of code,
and did so by commenting out its original LaTeX delimiters.
GCC compiles this code without warnings.

It is not really the case that LaTeX is a line-by-line language.
But in literate programming systems[8],
it is usually required that the
\begin{code}
and
\end{code}
delimiters begin at column 0,
and that the code block between them be a set of whole lines so,
for our purposes in this post,
we can treat LaTeX as line-by-line.
For LaTeX, our parser finds

Note that in the LaTeX parse, line alignment is respected perfectly:
The first and last are ordinary LaTeX lines,
the 2nd and 6th are commands bounding the code,
and lines 3 through 5 are a code block.

The C tokenization, on the other hand,
shows no respect for lines.
Most tokens are a small part of their line,
and the two comments start in the middle of
a line and end in the middle of one.
For example, the first comment starts at column 17
of line 1 and ends at column 5 of line 3.[10]

What language is our example in?
Our example is long enough to justify classification,
and it compiles as C code.
So it seems best to classify this example as C code[11].
Our parses give us enough data for a heuristic
to make a decision capturing this intuition.[12]

Earley/Leo parsing and combinators

In a series of previous posts[13],
I have been developing a parsing method that
integrates
Earley/Leo parsing and combinator parsing.
Everything in my previous posts is available
in Marpa::R2,
which was Debian stable as of jessie.

The final piece, added in this post, is the
ability to use variable length subparsing[14],
which I have just added to Marpa::R3,
Marpa::R2's successor.
Releases of Marpa::R3
pass a full test suite,
and the documentation is kept up to date,
but R3 is alpha, and the usual cautions[15]
apply.

Earley/Leo parsing is linear for a superset
of the LR-regular grammars,
which includes all other grammar classes in practical use,
and Earley/Leo allows the equivalent of infinite lookahead.[16]
When the power of Earley/Leo gives out,
Marpa allows combinators (subparsers)
to be invoked.
The subparsers can be anything, including
other Earley/Leo parsers,
and they can be called recursively[17].
Rare will be the grammar of practical interest that
cannot be parsed with this combination of methods.

The example

The code that ran this example is available on Github.
In previous posts,
we gave larger examples[18],
and our tools and techniques have scaled.
We expect that the variable-length subparsing
feature will also scale -- while it was not available in
Marpa::R2, it is not in itself new.
Variable-length tokens have been available in other Marpa interfaces for
years and they were described in Marpa's theory paper.[19].

The grammars used in the example of this post are minimal.
Only enough LaTex is implemented
to recognize code blocks; and
only enough C syntax is implemented to recognize comments.

Footnotes

2.
Their methodology is often left vague,
but it seems safe to say the careful line-by-line counting
discussed in this post
goes well beyond the techniques used in
the widely-publicized lists of "most popular programming
languages".

In fact, it seems likely these measures do not use line
counts at all,
but instead report the sum of blob sizes.
Github's linguist does give a line count but
Github does not vouch for its accuracy:
"if you really need to know the lines of code of an entire repo, there are much better tools for this than Linguist."
(Quoted from
the resolution of
Github linguist issue #1331.)
The Github API's list-languages command reports language sizes
in bytes.
The API documentation
is vague, but it seems the counts are the
sum of blob sizes,
with each blob classed as one and only one language.

4.
This custom literate programming system is not documented or packaged,
but those who cannot resist taking a look can find the Markdown
file it processes here,
and its own code
here
(permalinks accessed 2 October 2018).
↩

5.
For those who care about getting
linguist
as
accurate as possible.
there is a workaround:
the
linguist-language
git attribute.
This still requires that each blob be
reported as containing lines of only one language.
↩

6.
For the treatment of Markdown, see
linguistREADME.md
(permalink accessed as of 30 September 2018).
↩

7.
Another possibility is a multi-scan approach -- one
pass per language.
But that is likely to be expensive.
At last count there were 381 langauges in
linguist's
database.
Worse, it won't solve the problem:
"liberal" recognition even of a single language
requires more power than available from
traditional parsers.
↩

8.
For example, these line-alignment requirements match
those in
Section 10.4 of the 2010 Haskell Language Report.
↩

11.
Some might think the two LaTex lines should be counted as LaTex and,
using subparsing of comments, that heuristic can be implemented.
↩

12.
To be sure, a useful tool would want to include considerably more of
C's syntax.
It is perhaps not necessary to be sure that a file compiles
before concluding it is C.
And we might want to class a file as C in spite of a
fleeting failure to compile.
But we do want to lower the probably of a false positive.
↩

14.
There is documentation of the interface,
but it is not a good starting point
for a reader who has just started to look at the Marpa::R3 project.
Once a user is familiar with Marpa::R3 standard DSL-based
interface,
they can start to learn about its alternatives here.
↩

15.
Specifically,
since Marpa::R3 is alpha,
its features are subject
to change without notice, even between micro releases,
and changes are made without concern for backward compatibility.
This makes R3 unsuitable for a production application.
Add to this that,
while R3 is tested, it has seen much less
usage and testing than R2, which has been very stable for
some time.
↩

16.
Technically, a grammar is LR-regular if it can be parsed
deterministically using a regular set as its lookahead.
A "regular set" is a set of regular expressions.
The regular set itself must be finite,
but the regular expressions it contains
can match lookaheads of arbitrary length.
↩

Announcing Timeline 3.1

I have just released
version 3.1 of my Parsing Timeline.
It is a painless introduction to
a fascinating and important story
which is scattered among
one of the most
forbidding literatures in computer science.
Previous versions of this timeline have been,
by far,
the most popular of my writings.

A third of Timeline 3.1 is new,
added since the 3.0 version.
Much of the new material is adapted from previous
blog posts, both old and recent.
Other material is completely new.
The sections that are not new with 3.1
has been carefully reviewed and
heavily revised.