John Lilley <jlilley@empathy.com> wrote:>I've had this idea recently that surely has been explored by those>more knowledgeable than I, so I want to know if there are schorarly>articles available on what I would call a "tree-based" approach to>source-code parsing. Essentially, I would take the following steps:>>1) Lex the input stream into tokens as usual.>>2) Convert a token stream directly into a degenerate "AST" [...]>>3) Walk the flat "AST" using LL or LALR methods and match gross>features of the language that are invariant regardless of local>syntax.>>4) Perform successive refinement steps of the AST. [...]

You've pretty much described a Lisp or Scheme compiler.

In Lisp and Scheme compilers, the gross chunking off expressions is
done with parentheses (but you could have more kinds of delimiters
if that's what you want). Then compilation proceeds by simple
recursive descent over a tree. The recursive descent is roughly
LL(0), but that can be generalized to give you more
conventional-looking syntax.

The compile process is context-sensitive and transformational, yet
basically simple.

Computer scientists often think that cfg's are the way to go, and that
context-sensitive grammars and transformational grammars are
impractical.

Ironically, people have been merrily using fundamentally
transformational languages since 1963, when macros were added to Lisp.

A lot of people think that Lisp macros are just a weird hack, but
they're an example of something deeply powerful and wonderful. They
let you define most of the language from within the language, by
specifying transformations from new constructs into simpler ones that
the compiler can understand directly.

Recent work in Scheme has "cleaned up" Lisp macros, fixing the scoping
problems and giving you a clear framework for adding productions to
the grammar and telling the compiler how to transform the recognized
code into equivalent expressions. So you have an extensible parser
built right into the language. This will be a standard feature of
Scheme, as of the soon-to-be-released new edition of the Scheme
standard.

This facility can be used to straightforwardly encode many
context-sensitivities that real languages exhibit and cfg parsers
miss. For example, you can encode the fact that a break can only
occur in the context of a loop by implementing the loop construct as a
macro, and having that macro introduce a local macro that defines
break. The normal scope rules of the language ensure that a break
will be legal in the context of a loop, but a syntax error outside of
a loop.

Since the parser understands scope, it can naturally handle most scope
issues "for free". And as in the example above, many other
context-sensitivities can be straightforwardly expressed and
automatically checked by casting them in terms of transformations that
introduce new scopes.

My research group is currently working on an extensible compiler front
end that gives you the power and elegance of Scheme macros (and much
more), while supporting "normal looking" LL(k) syntax, e.g., Java's.
(Right now it's only LL(1), and rudimentary in general, but it
basically works.)

It's ironic that many programming language folk scoff at Lisp and
Scheme's "trivial" syntax, failing to notice that the surface syntax
is boring largely so that the higher-level syntax can be extremely
sophisticated. The most interesting work in parsing has been done in
Lisp and Scheme, and in languages inspired by them.

Context-free grammars are boring and very clumsy; I think computer
scientists missed the boat in the late 60's and early 70's.

Too bad people think "transformational grammar is hard", when *some*
transformational grammars are really quite simple and tractable---and
realy the only kind of thing that makes sense for programming
languages. Luckily we can design programming languages to be much
easier to deal with than natural languages!