Marpa resources

Mon, 31 Dec 2012

In
a previous post, I showed a self-parsing grammar,
written in Marpa's new BNF interface.
That grammar was in a tradition going back to the 70's.
Following the tradition, I cheated a bit.
That grammar required,
but did not include, a lexer to make a prepass over
its input.

This post contains a self-parsing
and self-lexing grammar,
the one for Marpa's forthcoming Scanless interface.
This grammar is about as self-contained as a grammar can get,
short of being encoded into a
Universal
Turing machine.

Many readers will
prefer to be introduced to the Scanless interface
by way of
a simpler example,
but based on the response to the previous post I know
there are others who share my fascination with
self-description and self-exemplification.
And there is something to be said for reading an example
that is a final authority on itself.

This is certainly a practical example.
The grammar that follows is used
to parse itself and all other grammars written for the
Marpa's Scanless interface.
It is also used to parse the strings written
for Marpa's BNF interface,
the Scanless interface's predecessor.

Starting out

The grammar in this blog post is abridged a bit,
and rearranged for ease of explanation.
The original is
here.

# Copyright 2012 Jeffrey Kegler

The file starts with legalese, which I've cut.
(The grammar is under GNU's LGPL 3.0.)
Note the hash comment -- since this is a self-describing self-lexer,
the grammar will eventually tell us how it deals with hash comments.

The
rules
symbol is the start symbol.
Our grammar consists of a series of one or more rules,
each one of which falls into one of five types.

By my count,
three of the rules in our self-describing grammar are
themselves self-describing.
The first and last rule above are two of them.
The definition of
rules
is itself a series
of one or more
rule's.
And the definition of
rule
is a
<priority rule>,
which is
one of the 5 types of
rule
that are allowed.

The first two rules exemplify two of the other rule types:
the first is a
<start rule>
and the second is a
<quantified rule>.
A start rule is defined as
consisting of the
:start
pseudo-symbol,
followed by a
::=
operator,
followed by a
symbol:

The parentheses can be ignored.
Their purpose is to make life easier on the semantics
-- they surround symbols
that are hidden from the semantics.

"What I say" versus "what I mean"

When we define a grammar rule,
sometimes we are asking Marpa to do exactly what
we say, character for character.
For example, in the
"<op declare bnf> ~ '::='" rule above,
we are saying that the symbol
<op declare bnf>
is exactly the string '::=',
nothing more and nothing less.
"Do exactly what I say" rules are called lexical rules,
and for them we use the match operator ("~").

In other cases, we are asking Marpa to "do what I mean".
That is, we are saying that "the rule essentially consists of these symbols,
but I also want you
to do what is reasonable with whitespace and comments."
An example of a "do what I mean" rule was
"rules ::= rule+".
Within, between, before and after the
rule
symbols,
there will be comments and whitespace.
The whitespace will sometimes be optional,
and will sometimes be required.

Spelling all the comment and whitespace handling
out would be very tedious and error-prone.
We want there to be rules of a kind that "do what we mean".
Marpa's "do what I mean" rules are called "structural" rules,
and are identified by the BNF operator ("::=").

Some structural rules have lexical content within them.
An example was
"<start rule> ::= (':start' <op declare bnf>) symbol".
That is basically a structural rule, where Marpa should "do what I mean"
with whitespace and comments.
But it also contains a string,
':start'.
When a string or a character class
occurs inside a structural rule,
Marpa knows how to treat them properly.
Marpa knows, for example,
that whitespace is not OK
between the "a" and the "r" of ":start".

The "what I mean" versus "what I say" distinction
corresponds very closely to the distinction in Perl 6 grammars
between the "rule" and "token".
It also corresponds to the traditional division of labor in compilers,
between the lexer and the parser proper.

Rules

Above we saw an example of a quantified rule: "rules ::= rule+".
Here is the definition:

A quantified rule contains a left hand side (LHS) symbol name,
one of the two declaration operators,
a
single symbol,
a plus or minus "quantifier",
and an adverb list.
The adverb list can be empty, as it was in our example.
(In fact the adverb list has been empty in every rule so far.)

Most rules, including most of the rules we've already seen,
are priority rules.
Priority rules are
so-called because in their most complicated form they can express
a precedence scheme.
The typical rule in a grammar is a priority rule with only
one priority -- a "simple" priority rule.
All priority rules in this grammar will be of the simple kind.

Within priorities, there can be alternatives,
and we have seen examples of these.
The self-defining rule that defined a
rule
stated that it was one of a set of 5 possible
rule types (priority rule being among those types).
In that self-defining rule,
the different types of rule were alternatives within a single
priority.

And, as long as we are on the subject of self-defining rules,
there are three of them in this grammar, of which we have previously
identified two.
The definition of
<priority rule>
is the third and last self-defining rule --
it is itself a priority rule.

Symbols

We've used
symbol,
<symbol name>,
and
<single symbol>
a few times.
It's time to see how they are defined:

At this point,
symbol
and
<symbol name>
are
essentially the same thing:
someday there may be another way to specify symbols
other than by name.
<single symbol>
means any expression guaranteed
to produce a single symbol.
symbol
is obviously one;
a character class is the other.

The rules above contain our first mention
of character classes and,
by coincidence,
our first use of character classes.
Character classes are enclosed in square brackets,
and look exactly like Perl character classes.
In fact, they are implemented as Perl character classes, memoized for
efficiency.

Now that we know what a symbol can be,
let's look at how right hand sides are built up:

Three of the unused
adverbs have to do with the associativity (right/left/group)
of priorities.
Since all our "prioritized" rules are trivial (have only one priority),
this grammar does not use them.
(Their use is described in
the documentation of Marpa's BNF interface.)
We also will not see
action
adverbs in this grammar,
for reasons explained below.

These adverbs are used for quantified rules.
One specifies a "separator" that can go between items of the series.
The other specifies whether separation is "proper" or not.
(When
proper
is 0, a separator is allowed after the last item of a series.
When that is the case,
the separator does not really always separate two items
and in that sense the separator is not "proper".)

Discarded tokens

:discard ~ whitespace
whitespace ~ [\s]+

The two rules say that sequences of whitespace are recognized
as tokens, then discarded.
Perl-style comments are handled in the same way:

"Unterminated final hash comments" deal with the special
case of hash comments at the end of a file,
when that file is not properly
terminated with a newline.
The
<unterminated final hash comment>
symbol is an example of how
a long angle bracketed symbol name can make things clearer.
Without the long name, it might not be evident what that rule and symbol
were trying to accomplish.

Long symbol names
have a cleaner look than comments,
but they are not a panacea.
They have
the special advantage that the description goes wherever the symbol
name goes.
But when the description is too long,
that advantage becomes a disadvantage.

Strings

In the next snippet, defining single-quoted strings,
the description is clearly too long for the symbol name,
so much of it does go into a comment.

# In single quotes strings and character classes
# no escaping or internal newlines, and disallow empty string
<single quoted string>
~ ['] <string without single quote or vertical space> [']
<string without single quote or vertical space>
~ [^'\x{0A}\x{0B}\x{0C}\x{0D}\x{0085}\x{2028}\x{2029}]+

Note two Unicode vertical whitespace codepoints,
U+2028 and U+2029, are included.
The implementation only supports 7-bit ASCII,
so for the moment these accomplish nothing.
But when Unicode support is added, the grammar won't need to be changed.

These are Perl character classes,
and are passed unaltered to Perl for interpretation.
Marpa needs to recognize that they start and end with square brackets.
And it also must recognize enough of their internals
to deal with escaped square brackets,
which makes
them the most
complicated lexeme of the grammar.

Longest tokens matching

When there is a choice of lexicals, Marpa follows a longest tokens matching strategy.
The effect is usually that it does what you mean.
(In part this is because longest token match
is the usual default for regular expressions and
lexical analyzers, so that programmers are trained by their tools
to really mean a longest match
whenever they specify a match.)

Most of what longest-token-match does is obvious.
For example,
in the rule "weeknights ~ 'Mon' | 'Tue' | 'Wed' | 'Thu'",
it recognizes that
weeknights
is one symbol and not two symbols like
"wee" and "knights".
Longest tokens match
means that if you have a grammar where you specify both
++
and
+
as operators, Marpa will always prefer
the longer operator:
++.

Because this is Marpa, there is a slight difference from the traditional longest
token
matching.
Note that in Marpa's matching strategy, "tokens" is plural.
If more than one possibility has the same length, Marpa will try them all.
This plays a role in our meta-grammar.
For example,
separator
is a keyword.
But it is also a valid symbol name.
Marpa allows it to be both, and figures out which is meant at the
structural level, based on context.

Semantics

In practice, a grammar is usually tied tightly to a single semantics.
This is an exception.
The Scanless interface's meta-grammar is also the grammar for Marpa's
BNF interface, and the BNF interface has a different semantics.

For most grammars in Marpa's BNF or Scanless interface,
the semantics would be specified using
action
adverbs.
Examples of the normal method of specifying
semantics are in the documentation for
the BNF interface
and for
the Scanless interface.

In this special-case grammar,
there are no
action
adverbs.
Marpa waits until it knows which interface the grammar will be used for.
At that point the internals use the symbol names to map rules to
actions on a "just in time"
basis.

Comments

Comments on this post can be sent to the Marpa Google Group:
marpa-parser@googlegroups.com