Introduction

Parsers, Interpreters and Compilers are becoming more and more into the reach of the normal programmer. In the very beginning, such tools were hand crafted. Then came the time of the parser and compiler generators (e.g., YACC (popular UNIX parser generator) or ANTLR), and currently we experience the integration of grammars and parsers directly into a host language. boost::spirit, for example, is integrated into C++ as a library which comes with a domain specific sublanguage (using operator overloading and expression templates). Perl 6 goes even a step further and makes grammars part of the language. These tools help you very much but they still require quite a lot of knowledge about the underlying theoretical framework.

This is the first part of a series of articles which explains the most basic concepts and practices to cope with parsing and compiling. It uses toy programming languages and their interpreters to expose the concepts. For each toy language, sample programs are presented. These samples are handwritten, but in a style which shows how the task could be automated. This introduction into language processing addresses newbies as well as more experienced programmers. The former get enough explanatory text, the latter C++ code samples which show how language processing can be done with minimal baggage. The whole article series strives for minimality both in theory and in implementation.

Steps for building an interpreter

Let's take a command line calculator for arithmetic formulas as an introductory example. The following excerpt shows a sample session, where lines starting with '>>' indicate program output.

To implement this, the calculator program must accomplish the following tasks:

Task

Sketch of Solution

It must recognize the structure and the components of the input line which consists of arithmetical expressions, assignments and formula names. Errors must be reported in a user friendly, informative way.

This is done by a process called parsing. The effort done for parsing pays off in the quality of the error messages for syntax errors and in the error recovery. A poorly designed parser will produce follow errors after the first error or will not be able to continue parsing at all after the first error.

It must evaluate the arithmetic formula and store the result in a map which associates the formula name with the value of the expression. Furthermore, it must store the formula expression itself in order to recalculate the formula when a component changes its value by another evaluation.

Evaluation of an expression or statement to a result value can only be done after successful parsing. Evaluation may either be executed directly during parsing or it may be deferred until the parsing process has constructed an appropriate intermediate "database" and finally has generated efficient "code" for the execution by a real or virtual processor.

It must cope with runtime errors like divide by zero or taking the root of a negative number.

Handling of runtime errors is done by the runtime system. The amount of runtime error handling is a tradeoff between efficiency and convenience and must be adapted to the expectations of the target user group.

Parsing and Grammars

Parsing is the process of associating a hierarchical structure with a source text. A programmer, who indents her source text appropriately (e.g., she indents the body of loops), does a simple form of parsing. Parsing is the first task in language processing and one of the most important ones, because it guides all the later steps of a translation system. The formal framework for parsing is set by a grammar which describes the source language. A grammar consists of a set of rules, which taken together associates a unique tree structure with any correct input text. Grammars are divided into different classes according to their expressiveness. On the lowest level are grammars for regular expressions. These grammars cannot describe recursively nested constructs as they are needed to express, for example, the fact that a statement has other statements as its components. The grammar class commonly used to describe programming languages is called "context free". Each rule of a context free grammar has the form R: a b c | d ; where R is the name of the rule (also called a Nonterminal), : separates the rule name from the right hand side of the rule and where a b c d and so on are either terminals or names of further rules (Nonterminals), and | is used to separate alternatives. This is further explained in the following table:

Notion

Samples

Meaning

Terminal

'for'[a-zA-Z][0-9]+;

Describes a string of the input directly. It is common practice to use a subset of the notation which is used in regular expressions for describing terminals. The only difference is the notation for strings. In context free grammars, we write e.g., 'for', where we would write just for in a regular expression.

Quantifiers

? * +

Same meaning as in regular expressions. These three quantifiers are sufficient.

Alternative

|

Same meaning as in regular expressions.

Grouping

( ... )

Same meaning as in regular expressions.

NonterminalRule

EnclosedDigits: [0-9]+ | '(' EnclosedDigits ')' ;

EnclosedDigits in the sample is a Nonterminal. In this sample, it is both the name of a rule and used on the right hand side of the rule. For each Nonterminal used somewhere in a grammar rule, there must be exactly one corresponding rule.

The following sample context free grammar rule describes a string of digits enclosed in an arbitrary number of parentheses:

EnclosedDigits: [0-9]+ | '(' EnclosedDigits ')' ;

This grammar rule associates the input string ((123))) with the following parse tree:

EnclosedDigits
'(' EnclosedDigits ')'
'(' EnclosedDigits ')'
'123'

In the coming samples, we will use the following linear form as a tree notation:

Interestingly, the rule EnclosedDigits: matches the complete input string ((123)) from the first to the last character, and it also matches inner parts like (123) and 123. This is possible because of the recursive nature of context free grammars. When a grammar is used to recognize an input string, we say that we use the grammar for parsing. A grammar can also be used to generate strings of the language by a process called substitution. Substitution takes place on the right hand side of the so called start rule. Substitution means replacing a Nonterminal by one of the alternatives on the right side of the rule for the replaced Nonterminal. By repeated substitutions on the right hand side of the start rule, we can generate all input strings which belong to the language described by the grammar. This shall be shown with the following sample start rule:

Expression: Expression '+' Expression | '(' Expression ')' | [0-9]+ ;

The following table shows one possible substitution history:

Before Substitution

Step Description

After Substitution

Parse Tree

Expression

Substitute by first alternative of Expression rule

Expression
'+'
Expression

Expression<
Expression
'+'
Expression
>

Expression
'+'
Expression

Substitute marked Nonterminal by second alternative of Expression rule

The resulting generated language string is 1001+(37). Note that the order of substitutions - whether you first substitute a Nonterminal occurring on the very right side or on the left side - is not specified by a context free grammar. A parser therefore needs not only a context free grammar to do its job, it needs also a parsing strategy, which is nothing else than a built-in rule telling which of the possible substitutions is carried out first. Common parsing strategies are LL and LR. Both parser types read the input from left to right, which is the reason for the first L in LL and LR. An LL parser always substitutes the leftmost Nonterminal first, whereas an LR parser substitutes the rightmost Nonterminal first. Typically, LL and LR parsers do not accept the same grammar rules. An LL parser is not happy with left recursive rules whereas an LR parser is not happy with right recursive rules, this is exemplified by the following sample grammar rule:

Parsing Strategy

Grammar

Comment

LL

Digits: [0-9] Digits? ;

Match first a terminal, then recurse.

LR

Digits: Digits? [0-9] ;

Recurse, then match terminal.

The LR - grammar Digits: Digits? [0-9] is left recursive, because its right hand side has the rule name as its first symbol. Fortunately, it is not difficult to rewrite an LR grammar into an LL grammar and vice versa, as we will see later.

Expressiveness of Context Free Grammars

Many import properties of existing programming languages can either not be expressed with a context free grammar or are tedious to describe. Consider a language with procedures where the procedure name must both occur at the beginning and at the end:

Proc: 'proc' Name Parameters Body Name ';' ;

There is no way to describe that the Name at the beginning must be the same as the Name at the end. Take as another example a length limitation, e.g., the rule that an identifier cannot exceed 32 characters. While it is possible to describe this with a context free grammar rule, it either requires a lot of writing or the addition of a new quantifier (e.g.: [a-zA-Z]{1,32}). It turns out in many cases that it is preferable to check for the violation of a restriction after successful parsing rather than trying to express the restriction within the grammar. Generally speaking, a grammar for a programming language accepts always a superset of the legal constructs. The real strength of context free grammars is the proper description of hierarchical relations as shown by the following examples of grammars for simple arithmetic expressions. This is a poorly chosen grammar for expressions:

expr0: expr0 ('+'|'-') expr0 | expr0 ('*'|'/') expr0 | [0-9]+ ;

This grammar does not express the fact that the precedence of '+' is not the same as that of '*'. The following is a much better grammar for expressions:

The source text 2*3+5 could be parsed by expr0 into the following tree:

expr0<
expr0<'2'>
'*'
expr0< expr0<'3'> '+' expr0<'5'> >
>

whereas expr1 would result in this tree:

expr1<
expr1<mul_expr<'2'> '*' '3' >
'+'
mul_expr<'5'>
>

Only the second tree reflects the precedence of the '+' and '*' operators. By the way, the grammar rule for expr1 is suitable for an LR parser, but not for an LL parser.

The Peril of Ambiguity

A grammar is ambiguous if a source text can be parsed into two different parse trees depending on the order of substitutions. The above sample expr0: expr0 ('+'|'-') expr0 | expr0 ('*'|'/') expr0 | [0-9]+ ; is ambiguous because the source text 2*3+5 can be parsed into both of the following trees:

expr0<expr0<'2'> '*' expr0< expr0<'3'> '+' expr0<'5'> > >

expr0<expr0<expr0<'2'> '*' expr0<'3'> > '+' expr0<'5'> >

depending on the substitution order. Ambiguity defeats the main purpose of grammars, namely the association of a source text with a unique hierarchical structure. Whether ambiguity of a grammar really results in different hierarchical structures not only depends on the grammar but also on the parsing strategy.

Recursive Descent Parsing

Recursive descent parsing is the most simple and intuitive parsing method. It is an LL parsing method and therefore reads the input from left to right (the first L in LL) and traverses the grammar from left to right (the second L in LL). The name recursive descent is due to the fact that this parsing method uses a function for every grammar rule. The function for the start rule must parse the complete source. It does this by providing a function body, which implements the right hand side of the start rule in the following way:

in case of alternatives, it tries all the promising alternatives by executing the corresponding code for the alternative until one of them succeeds (matches the input text)

in case of a sequence of grammar symbols (tokens or Nonterminals), it executes the corresponding code for each symbol in the sequence until one of them fails or otherwise the code for the whole sequence succeeds.

Note that the implementation of a recursive descent parser sometimes requires backtracking. Backtracking means that you go back to a previous state of the computation. If, for example, the first part of the chosen alternative is successfully matched against the input text and then a mismatch occurs, we must set back the source position to the position before we entered the alternative and choose another alternative. Let's show this with a sample implementation. The rules:

The start grammar functionMatchesEnclosedDigits takes an iterator (e.g., character pointer) as input and either returns true, in which case it sets the iterator to the position just behind the recognized text, or it returns false, in which case it leaves the iterator unchanged. The close relationship between grammar rules and corresponding parsing functions make it obvious that a recursive descent parser cannot work with grammar rules like:

Digits: Digits? [0-9] ;

because this would result in infinite recursion. Many grammars coming with language specifications are written for LR parsers (the well known YACC tool requires an LR grammar). But it is quite easy to rewrite an LR grammar into an LL grammar.

Parsing Expression Grammars (PEGs)

The idea behind the so called parsing expression grammar (PEG) formalism [1] is a "procedural interpretation" of context free grammars. A PEG grammar can be "executed" using the grammar and a source string as input. The result of the execution is either a match of the input string, in which case the parsing position is at the end of the recognized part, or otherwise the result is "doesn't match", in which case the input position remains unchanged. This statement is not only true for the grammar as a whole, but for each grammar rule. The PEG formalism is nothing else than a mapping of common practices in recursive descent parsing to grammars. It consists of the following grammar elements:

PEG construct

Notation

Operational Meaning

Sequence

e1 e2

Match input against e1 moving the input position behind the input part described by e1, then match from new input position against e2. If the match fails for either e1 or e2, the input position is restored to the position before entering the Sequence and the result is fail, otherwise the input position is moved behind the recognized Sequence and the result is success.

Prioritized choice between Alternatives

e1 / e2

Match input against e1 moving the input position behind the input part described by e1. If the match fails, restore the input position and match against e2. If the match fails again, restore to the position before entering the Alternative and the result is fail. Otherwise, if one of the alternatives succeeds, the input position is moved behind the source part which is recognized by the chosen alternative and the result is success.

Greedy Option

e?

Match input against e, moving the input position behind the recognized part in case of success. If the match fails, restore the input position, but the result is in any case success.

Greedy zero or more occurrences

e*

Interpret it like an unlimited number of e? e? e? e? ...

Greedy one or more occurrences

e+

Interpret it like e e? e? e? ...

Peek predicate

&e

Match the input against e without changing the input position (the input position remains the same whether the match succeeds or fails). The result is success if it matches, otherwise it is fail.

Not predicate

!e

Match the input against e without changing the input position (the input position remains the same whether the match succeeds or fails). The result is fail if it matches, otherwise it is success.

Advance predicate

.

Increase the input position by one. The result is success.

The following table shows the effect of PEG grammar constructs on sample input strings. The | character indicates the input position:

PEG construct

Input

Match Position

Match Result

S: 'for' ;

|for

for|

true

S: 'for' ;

|former

for|mer

true

S: 'for' ;

|afor

|afor

false

S: 'for' 'all' ;

|forall men

forall| men

true

S: 'former' / 'for' ;

|for

for|

true

S: 'former' / 'for' ;

|former

former|

true

S: 'for' / 'former' ;

|for

for|

true

S: 'for' / 'former' ;

|former

for|mer

true

S: 'for'?'mer' ;

|former

former|

true

S: 'for'?'mer' ;

|mer

mer|

true

S: 'for'?'former' ;

|former

|former

false

S: [0-9]* ;

|1903.535

1903|.535

true

S: [a-z .]+'.*'? ;

|ifi.go.*

ifi.go.|*

true

S: 'for' &'(' ;

|for(

for|(

true

S: 'for' &'(' ;

|for[

|for[

false

S: 'for' !'(' ;

|for[

for|[

true

S: 'for' !'(' ;

|for(

|for(

false

PEG versus regular expressions

Since PEGs cover context free grammars, they are more powerful than regular expressions. The most important difference between Parsing Expression Grammars and regular expressions is the fact that the former supports recursive definitions, the latter not. But simple non recursive situations often require shorter regular expression patterns as can be shown for an email address recognizer.

The correct PEG email address recognizer is more complicated than the regular expression email address recognizer and uses implicit backtracking (in !EMailSuffix). When we parse the legal email address marc.bloom@blo.blo.uk with the PEG rule EMail: [A-Za-z0-9._%-]+'@'[A-Za-z0-9._%-]+'.'[A-Za-z]{2,4} ;, we will fail, because blo.blo.uk will be recognized by the greedy rule component [A-Za-z0-9._%-]+, so that there will be no input character left for '.'[A-Za-z]{2,4}. The regular expression pattern does not have this problem, because the input string blo.blo.uk will be wisely shared between the two parts of the regular expression [A-Za-z0-9._%-]+\.[A-Za-z]{2,4}. Fortunately, it turns out that such intricate backtracking (as used in the correct PEG rule) is seldom necessary for parsing.

Procedural and template implementations of PEG elements

An ideal C/C++ PEG implementation is both close to a PEG grammar and runtime efficient. This can be achieved both with normal "C" code as well as with C++ templates. C++ templates prove to be more flexible, whereas normal C-code has the advantage of making less trouble in case of programmer errors, because error messages for templates are still difficult to decode. The rule S: 'C' ; has the following procedural implementation:

Note, that the constructs . and e{LOW,HIGH} are not part of the originally proposed PEG elements, but are added by me. Applied to the email PEG recognizer rules, we get the following template implementation:

Note that EMail::Matches(it,&c) has two parameters (contrary to the table above). The first is an iterator, the second a pointer to the parsing context. This parsing context is necessary, if one wants to carry out some action during parsing.

Input Iterators

The parser reads the source for parsing. Storage medium, encoding, and end of input indicators can be different for various sources. In most cases, the storage medium is a file or a character string. The character encoding is either ASCII, UTF-8 or one of the many other Unicode encodings. The end of input can be detected either by the presence of a special terminator or by comparison against an end of input pointer.

Checking each input position against the end of input is in most cases not necessary. Most grammars expect only a subset of the possible input characters. If we know that the input is terminated by a special character, say '\0', we can use a start rule like: S: TopRule '\0';.

If we use a template parameter Iterator, we can cope with all the mentioned possibilities by providing a sensible iterator implementation. Our PEG implementation uses only the following iterator operations:

PEG iterator operation

Meaning

int iterator::operator*()const

Returns the current character value. In case of an iterator for an input range, it returns a negative value when at the end of range.

iterator& iterator::operator++()

Increments the input position. In case of an iterator for an input range, it does not increment the input position when at the end of the input range.

Why performance matters

The parser of a C++ compiler spends less then 20% of the compilation time with parsing. But there are many applications for parsing without sophisticated post-processing where speed really matters. Examples are auto-completion support for development environments, preprocessing of C/C++ source files, extraction of items from large XML files.

There are two main sources of inefficiencies in parsing:

backtracking

runtime costs of unnecessary abstractions

In recursive descent parsing, backtracking can be avoided by rewriting the grammar in a way that avoids alternatives with the same beginning. This technique is known as left factorization and will be shown in one of the next articles of this series. In LR parsing (LALR parsing to be precise) as used by YACC (the UNIX parser generator), backtracking is avoided by the internal use of a stack for symbols which wait for further processing. It is well known that hand written recursive descent parsers using optimized grammars and tools like YACC are the best performers. But for the casual user, handwritten parsers as well as compiler generation tools are not ideal, the former because they require too much work, the latter, because they generate in most cases impenetrable and hard to maintain code. With Parsing Expression Grammars, the picture could change, because PEGs combine the easy to understand recursive descent technique with the possibility of an efficient implementation which is very close to the original grammar. I therefore compared the runtime efficiency of three PEG implementations and of a hand-optimized parser. I took a grammar for arithmetical expressions as a test case, because such a grammar can be written in a way which requires no backtracking. All the tested parsers had to parse the expression 132*( firstOccurance + x2*( 1001/N55 )+19 ), which was repeated until it filled a buffer of 1000000 characters. The following table gives the results.

Implementation technique

CPU usage VC++ 7.0

CPU usage GCC 3.3.1

procedural PEG implementation

0.17 seconds

0.03 seconds

templated PEG implementation

0.211 seconds

0.041 seconds

begEnd templated PEG implementation

0.32 seconds

0.09 seconds

hand optimized PEG implementation

0.06 seconds

0.03 seconds

The results show that the templated and procedural PEG implementations come very close to the hand-optimized parser. They also show that using an iterator which always checks for the end of input condition has its price.

Points of interest

The term Parsing Expression Grammars was coined by Bryan Ford [1]. He also pointed out that this kind of recursive descent parsing with backtracking is common practice. He showed that only very few rules are needed to define a powerful framework for parsing. Instead of relying on grammar transformations which eliminate alternatives having the same beginning (so called left factorization), he invented a parser, which memorizes already taken paths and achieves linear parsing time for any PEG grammar (the packrat parser).

An important advantage of PEGs comes from the fact that a separate scanner is not necessary. When parsing with a scanner, one divides the parsing task in a low level part called scanning, and in a high level part, proper parsing. A scanner typically skips white space and comments and reads items like numbers, operators and identifiers, and returns an integer value to indicate what it has found. A scanner reduces the necessity to backtrack in some cases. On the other hand, it introduces a new level of abstraction which was never necessary from a theoretical point of view, and it can cause problems because of the lack of context information. A typical example is the closing angle bracket of template instantiations (the > symbol). You get a compiler error if you close two template instantiations at once using >> (one must use > > instead), because the scanner recognizes >> as a shift operator, a problem which does not arise if you do not use a scanner. Interestingly, Christoper Diggins [2], who presented his YARD parser on this site, does not mention the work of Bryan Ford, but uses exactly the PEG ideas including scanner-less parsing. This is certainly due to the fact that these ideas lie in the air.

About the Author

Comments and Discussions

I found this basic interpreter in my source codes. I think it deserves a look. It could be an answer to a BASIC Interpreter. I have yet to work with it. I just found it about 20 minutes ago. I uploaded it here:
http://www.mediafire.com/?2sw9c9rj2ear83u

Hi,
I am trying to develop generalized algorithm for code instrumentation. This should be driven through rule based machanism, means based on the rules, it should instrument/inject the code. Rules can be add/modify/remove in future.

hi!!!!!!1
i am working on a program to convert .ini file into .xml files. is dis called a parser?
actually i am working on the program which was earlier written by sumbody else and now i hav to finish it. there are lot of errors in it.could anybody help me out plz.
thanx

Your task involves parsing, namely you must
parse the ini file and read parts of it
into an internal datastructure which then can
be used as basis for the output (the xml-file).
In most cases parsing an ini file is trivial and
can be done with e.g. a regular expression parser.
You can in any case apply the method and software described
here to parse an ini file. The method described here
allows to parse arbitrary complex input.

If you have difficulty with the program alread written,
it may be better to start afresh,
because in most cases such a task can be structured
very clearly so that understanding the code
should not make any problem.
Best Regards
Martin

I just found your article and I noticed your reference to my work. Thank you very much. Incidentally I had never heard of PEG's nor Bryan Ford. I developed the YARD parser based on some of the ideas of the Spirit parser by Joel de Guzman. Great article, keep up the good work!

Hi Christopher,
I think, the ideas which were put into Bryan Ford's
PEG are nothing else then a condensed form of quite old
parsing techniques. So old and well known, that many people using them, like Joel de Guzman, fail to explain them.
By the way, I'm a frequent reader of your heron blog and find many refreshing ideas on your sites and in the design documents of heron.
My personal interests go into a similar direction as yours, namely what may be viable new programming paradigmas and features. I came to the conclusion, that multiparadim programming should be the future but that no single language can offer you all what is needed. I believe the future will be extensible languages. C++ with its template construct gives only a glimpse of what may be coming. The solution will be in the direction pointed out be Todd Veldhuizen in his doctorial thesis on active libraries (http://osl.iu.edu/~tveldhui/papers/[^]. The ultimate language will only define a kernel of constructs and provide mechanisms to build complex new abstractions based on this kernel. The future library writer will be able to define how his library fits into the rest of the language, how it is compiled into the kernel language, which compile checks have to be done and what the debugger should show to the user.
The only problem with this vision is that it is some years away from realisation.
Best Regards and thanks for all your contributions and thoughts about computing and languages.
Martin Holzherr

Hi,
according to other commercial products like
EditPadPro (which automatically "hyperlinks" character sequences conforming to the email format), test@test.com.br is a legal email address.
The sample parser provided with this article is therefore right to give the answer success on this input
>> test@.test.com.br| : success
is therefore just what one would expect.
Best Regards
Martin Holzherr

This is not guarantee that is corrent. EditPadPro is not bug-free or best parser.
Try this : _@...........................com or a%%%%@%c%.om
EditPadPro hyperlinks the text.
This is an limitation about internal parser only.

Does anybody know where I can find comparisons between the different parsing methods (top-down: recursive descent; bottom-up: LR, SLR, LALR etc.) considering speed.
How big are the differences? Is it possible to write a hand crafted LR parser?

Hi,
As far as I remember, parsing speeds are not that different
between the various parsing methods.
If you look into the standard books about compilers,
the reason of the authors to prefer one method over the other
are generally not speed but the 'powerfulness' of the grammar,
which means that some parsing methods can handle more complex
grammars than others. e.g. a pure LL(1) recursive descent grammar
(does not backtrack, only looks one symbol ahead), has
even difficulty to parse the Pascal programming languages.
LR and LALR grammars are thought to be quite powerful.
Parsing methods like LL(k) or LL(*) with arbitrary lookahead
are even more powerful, but in earlier days such methods were
rejected, because backtracking was deemed as too complex and too
expensive in term of space and time.

You can write a hand crafted LR parser. When doing so,
you probably will first manually compute some parsing tables
and then allocate them statically somewhere in your program or
you can hardcode them using a jungle of gotos.
But hand crafted LR parsers are only practical for toy grammars.

Back to the performance question: I remember a report about
a parser generator generating LL(1) recursive parsers with
performance as main goal. The speed of this parser was impressive.
This is not very astonishing, because as less information has
too be saved from the current input and as simpler the
decisions to choose between alternatives, as faster will be the
parser. In the extreme case an LL(1) recursive parser just
looks at the next character (or the next symbol) and then jumps
the the appropriate alternative.
I will look into books and articles in the next days to find some numbers and figures.
Best Regards
Martin

I really appreciate articles on this subject. It's tricky business that most intermediate programmers need help with, in order to get moving. At least I did! Anyway I wanted to mention that using C or C++ to teach a subject like this makes it really hard to learn. Half the code in C and C++ has nothing to do with your algorithm, it's just memory management or baroque string manipulation stuff. It distracts from what you are trying to teach. I'm not saying don't write apps in C or C++, just that it may not be the best choice for teaching a subject like parsers and compilers.

There is a open source Basic interpreter that might be a good reference to use for this article. http://www.scriptbasic.com I host a peer support site for ScriptBasic that has a message board and wiki.

When I took a class in this subject years and years ago, we had to be very careful of the grammar definition we used, such as the 'left factoring' mentioned in the 'Points of Interest'. In keeping track of where you've been (a partial form of 'state' keeping, is it?), it appears that you can 'have it both ways' - allowing use of a semi-LL parser form and having a grammar that does not have to be left-factored or otherwise transformed into something not easily human-readable. On the Bryan Ford presentation linked to by this article, he refers to this as the 'Pragmatic Method'. It is too bad that it took so long for programmers to actually be pragmatic in this area (the class I mentioned was in the early 80's).

A good point. So many parsing methods were invented since the 8o's (and before), but many of them were indeed to sophisticated or relied too much on heavy tools (like generators).
There is a deeper truth behind this history. A good abstraction in computer science must meet many goals, e.g. should be:
-powerful (covering many real cases)
-simple to understand, supporting thinking about it.
-efficient in terms of an implementation.

An abstraction which is meaningful in mathematical terms
might not be successful in practice.
Martin

First, regular expression grammars certainly can describe tree hierarchies, just not recursive ones. Most regular expression recognizers don't create those trees because regular expressions usually aren't used where such trees are needed. But anytime you use nested grouping constructs in a regular expression you're implicitly describing a tree.

The PEG versus regular expression email example is wrong in a couple of ways. The PEG grammar doesn't define the same language as the regular expression. For example, "x@..xx" is recognized by the PEG grammar but not the regular expression grammar. You also say that PEGs do less backtracking, but for this particular example, the regular expression doesn't require any backtracking, and the only reason the PEG does is that it was contrived that way; it's trivial to rewrite it to avoid backtracking. In fact, the difference between them isn't backtracking, but just that PEGs can describe some context free languages that can't be described by regular expressions.

The wording for this example is a bit confusing because you first display a complete PEG grammar, and then in the text you introduce a rule, "EMail: EMailChar+ '@' EMailChar+ EmailSuffix", which isn't part of the grammar. It would help if you mentioned that this rule was an alternative rule and not an example of applying the first grammar.

Ok Gerry, the chapter PEG versus regular expression will be updated.
Your remarks are all well placed.
Only one disagreement. "x@..xx" is recognized by the regular expression grammar. Not only according to my opinion, but also in the regular expression parser available under http://www.roblocher.com/technotes/regexp.aspx[^].
using [A-Za-z0-9._%-]+@[A-Za-z0-9._%-]+\.[A-Za-z]{2,4}
as Pattern andx@..xx as Test String
Martin