Like the Token class, the AST class is very minimal. The
framework assumes that each AST node will have a string-valued
type attribute, and that the children of a node are obtainable
using the __getitem__ method. Again, you can roll your own
class with this type of interface too.

I prefer n-ary trees for my ASTs rather than binary trees, so the
example one in examples/paper is quite a bit different than my own.

If the input is not matched by some point by regular expressions you've
supplied, then the error method will get called (in previous
versions an assertion was raised instead). This method gets passed
the input string and the position within the string that the error
occurred.

GenericScanner provides a default implementation which prints out
an error message and raises a SystemExit exception.

I'm deliberating as to whether this should be a straight exception, or
if I should modify the interface to permit some form of error recovery...

GenericParser should work with any Backus-Naur form (BNF) grammar,
including grammars
with empty right-hand sides and ambiguous grammars.
There are apparently a few rather obscure cases where Earley's parsing algorithm
fails, but they're not likely to crop up in any reasonable application.

GenericParser can often run faster if it knows the type of its
tokens. More specifically, if it knows a token's type as a string. You
can tell the parser this by supplying a typestring method, which is
passed a token, and must either return that token's type as a string, or
None if it's unable to (in the latter case, the parser will simply
fall back on a less-efficient method of operation).

For example, if a token's type is always stored in under the name type,
you might have the following code:

You're using a very general parsing algorithm
implemented in Python; it's not going to be blazingly fast,
sorry.

Action code.

The entire input has been recognized by
GenericParser before any of your action code is
executed. This may restrict the types of things you do on-the-fly
in your parser actions. It's one argument for building an AST
that you can traverse any way you want.

Watch your method names.

I've been bitten by this a few times.
Python won't warn you if you inadventently redefine a method,
which I've done when cutting and pasting a bit too freely. It's
a bugger to track down too.

Since the IPC7 paper, I had added some rudimentary ambiguity resolution code.
It was undocumented, far from satisfactory, and it has been supplanted by
the following interface.

Ambiguities are not resolved until the parse tree is being traversed. In
other words, the input is already known to be syntactically correct; it's
just a matter of deciding which parse tree to build when there are multiple
choices.

When an ambiguity is reached, a choice will need to be made between two
or more rules. These rules must reside in different p_ methods.
The method names which correspond to the conflicting rules, minus the
initial ``p_,'' are gathered together in a list. This list is
sorted by the length of the rules' right-hand side - shorter rules appear
earlier in the list - and passed to the resolve method. The
resolve method must choose an element of the list and return its choice.

GenericParser supplies a default implementation of resolve
which always selects the rule with the shortest right-hand side (assuming
the conflicting rules reside in different p_ methods, of course).
This is enough to handle the ``dangling-else'' conflict.

Now some examples. The first one always picks the rule with the shortest
right-hand side (this is the default as supplied):

If you have an ambiguity, but aren't sure where, you may want to override
the default resolve with one that simply prints out the list it's
passed. This allows you to find the guilty parties, and to ensure that there's
no duplicates in the list.

This class will construct syntax trees for you automatically (or at
least with minimal intervention). Instead of having your parser be a
subclass of GenericParser, you make it a subclass of
GenericASTBuilder. GenericASTBuilder is a subclass of
GenericParser, meaning the parsing interface is unchanged.

The constructor for GenericASTBuilder takes an extra argument,
which is the class that you want AST nodes to be.

It's actually simpler than it sounds. The rest of this section gives
a ``cookbook'' explanation. The class of AST nodes is called ``AST''
in the examples below.

Sometimes these are called ``concrete syntax trees.'' By heterogeneous
I mean that the leaves of the tree are instances of tokens rather than
instances of the AST class. (See the GenericASTTraversal section
for more on this.)

To make a homogeneous parse tree, the leaves need to be changed from
token instances into AST instances. When GenericASTBuilder
encounters a leaf, it calls GenericASTBuilder.terminal, so you
simply need to supply your own version of it:

To use GenericASTBuilder for building an abstract syntax tree,
there should be a fairly straightforward mapping between the parse tree
and the AST you want. Just like GenericASTBuilder.terminal was
supplied in the last section, you'll now supply a
GenericASTBuilder.nonterminal method. The arguments to this method
are the nonterminal it's trying to build a node for, and the node's children.

For example, if I wanted to flatten the parse tree out a bit, I could
skip allocating new nodes if there's only one child:

This was called the ASTTraversal class in the IPC7 paper. For
consistency, I've renamed it and placed it in spark.py along
with the other generic classes. For backward compatibility,
ASTTraversal is still present in its old spot as a wrapper; new
code should use GenericASTTraversal.

I got a great suggestion about heterogeneous ASTs: use the already-present
token instances as leaves of the AST. I was all ready to add support
into GenericASTTraversal so that it only traversed a node's children
if the node had a __getitem__ method present. Then it
suddenly occurred to me that GenericASTTraversal already supports
heterogeneous ASTs: if you want to use tokens as leaves, just add a
method to your token class:

class MyToken:
...
def __getitem__(self, i): raise IndexError
...

This way the interface to GenericASTTraversal is kept simple.

If you want to prevent further traversal into a subtree during a
preorder traversal, calling the prune method will do the trick. I
found I needed this when generating code from ASTs.

To decouple GenericASTTraversal from the node implementation,
GenericASTTraversal now calls the typestring method to
get the node's type as a string. Like its counterpart in GenericParser,
it takes a node as an argument, and must return a string (it may not return
None, unlike the one in GenericParser).

The default implementation simply returns the node's type field; this
behaviour is backwards-compatible.

GenericASTMatcher is a class that is designed for generating code
from ASTs. You supply a set of patterns (in docstrings, of course) and
actions; GenericASTMatcher will find a way to cover the AST with
your patterns, then invoke the corresponding actions. Actions are called
in left-to-right, bottom-up order.

As in GenericParser, the methods you supply are prefixed with
p_, which in this context stands for ``pattern.'' The argument
to the p_ methods is the AST node where the pattern is rooted.

Patterns are just trees linearized into prefix
form, which use parentheses to denote
subtrees. On the left-hand side is a symbol which the pattern is
effectively ``rewritten'' to if the pattern is matched. For example,
the pattern V ::= term ( V * V ) would correspond to:

term
/|\ => V
/ | \
V * V

You'd use the above class as follows:

interpreter = AutoInterpret(ast)
interpreter.match()
print ast.value

You may optionally supply an AST to match. This is so you can
create one instance of your matcher class, then have it match different
ASTs. For instance, you could use a matcher for implementing a
language's arithmetic expressions, and use a GenericASTTraversal
for the rest.

AST nodes must support __getitem__ and __cmp__ methods
to use GenericASTMatcher.

If there's a conflict between two patterns, then
GenericASTMatcher will choose the longest one. Ideally,
the entire matching engine will eventually be replaced by a more
sophisticated one that'll find an ``optimal'' covering.

Parentheses considered harmful.

You may end up with some strange
things happening if you have a terminal/nonterminal named ``(''
or ``)'' as they're delimiting the pattern's tree structure.

Patterns.

GenericASTMatcher's engine will accept patterns
that are more general than those described above. These restrictions
may be enforced in a later release, however.

You can override t_ and p_ methods now in the expected fashion;
in other words, defining p_foo in a subclass hides p_foo in
its superclass. (Previously duplicate method names were not removed when
the Generic* classes did their reflection, giving strange results.)

Several of the Generic* classes keep references
to the various t_ and p_ methods. Unfortunately this
results in a circular reference, as Darrell Gallion pointed out, which
Python's current garbage collector won't collect.

This will only be an issue if you create and destroy Generic*
classes repeatedly; simply using a single instance of
GenericScanner repeatedly will not consume extra memory.

I probably won't fix this, since full GC is likely coming soon
to a Python near you, and it won't affect many users. If you
do need to handle this problem, contact me and I can advise you
how to change the SPARK code.

How do I keep track of the line number?

There's currently no
automatic way to do this. What I do is to keep a line
number attribute, say lineno, for each token. Then
in my GenericScanner subclass I'll have a method like:

def t_nl(self, s):
r'\n'
self.lineno = self.lineno + 1

(I set self.lineno to 1 in my subclass' tokenize method.)

What about inherited attributes?

For inherited attributes - ones
that propagate from the root to the leaves of a tree - I just set
variables in my subclass rather than pass them around explicitly.
You may want to avoid relying on this while parsing unless you
know exactly how and when your parsing actions are being invoked.

How do I distribute a project that uses SPARK?

As far as I'm
concerned, you need only include spark.py. I'd appreciate it
if you mentioned SPARK in your documentation and/or web pages. And
if you send me the URL of your project's home page, I can add it
to the SPARK web page.