"ng2010" <ng2010@att.invalid> wrote in message> What elements of C++ make it so hard to parse? Is it a weakness of> compiler designs rather than a weakness of the language design? I've read> somewhere that the language requires potentially infinite look ahead.> Why? And how do compilers handle it?> [It's ambiguous syntax. Others can doubtless fill in the details. -John]

Well, there's "hard to parse" and then there's "hard to handle".

As John has observed, C++ has an ambiguous grammar.

I really don't like the phrase "hard to parse", because it's relative
to your parsing technology. GLR parsers are capable of parsing C++
easily in spite of the ambiguous grammar.

People that insist on claiming C++ is hard to parse are those that
insist on using LL or LALR parser generators. Their problem is the
classic one that comes from looking under the lamppost for one's keys.

DMS's parser generator accepts a grammar definition that is very close
to what's in the standard, with deviations, some because even the
standard grammar isn't always convenient, but mostly because our
parsers handle a wide variety of C++ dialects (ANSI, GCC, Microsoft,
...) and contain exensions to support those extensions, too,
(including what we call "dialect conditionals" [you'll see some below
in an attribute grammar rule example])..

One convenient property of DMS's parser generator is it will *tell*
you where (some) ambiguities are. [Some say this computation is
impossible; in the abstract they are correct, you can't compute *all*
possible ambiguities for any possible grammar. But you can compute
*some* by doing a depth-limited search. over abstract sequences of
tokens for nonterminals]:

Here's the output for DMS's GCC3 dialect of C++,
for a pretty shallow search:

A fair number of these ambiguities are induced by people's classic interest
in
overloading semantics on syntax, mostly by defining identifiers in the
grammar
that carry in implied type.

You can attempt to resolve ambiguities while parsing (to avoid the need
to capture multiple ambiguous parses) by providing the parser with type
information
as it runs. When you do that, you tangle parsing and semantic information
collection,
and you get the classic mess that we see in most real C and C++ parsers.

By using a parser that simply captures the ambiguities during parsing (such
as GLR),
you can avoid that awful tangle completely and thus get a parser directly
from the grammar.
The raw syntax for the DMS's GCC3 C++ parser is **2318** lines according to
wc.

What you get out of a GLR parser is an abstract syntax DAG, with subtrees
for separate derivations of various nonterminals, Ambiguity nodes where
multiple parses can occur, and subtree sharing under the ambiguity nodes.

Now we get to the distinction between "hard to parse" and "hard to handle".
C++ has a complex type system, and arguably an even more complex scoping
mechanism requires rather ugly lookup rules. Much of the C++ reference
manual is devoted to explaining the interactions between identifiers
and lookups.

With DMS, we encode "name and type" resolution using an attribute grammar
(AG)
which you can think of as a functional program coded in terms of syntax
rules.
The AG defines how information is passed up and down instance parse trees,
and mostly what is passed are symbol table scopes and identifier lists. Our
AGs are augmented by procedural code that actually build, insert-into,
or inspect specific scopes and scope links.

Our AGs also have a nice property to support handling ambiguities:
if an AG-rule, when executed,
declares an error, then the syntax tree in which it triggers is simply
deleted from that tree upward to any parent ambiguity node.
Viola, inconsist interpretations simply vanish from the tree.
Any remaining tree is consistent. So our C++ name resolver
simply does name resolution on all the variant interpretations
of each ambiguous rule, and wrong interpretations simply vanish.
What's left is the "parse" tree you really wanted.

The AG-decoration of the
grammar rule "statement = expression_statement" is:

To give you a sense of the difference in complexity between the grammar and
the full
name an type resolver, the AG for C++ is **281295** lines. The relative
differences
between this and the grammar size isn't quite what fair,
as the full AG handles *all* the dialects of C++ and I only counted the size
the
grammar rules specific to GCC3.

But two orders of magnitude difference handling the basic semantics vs. the
syntax I think
clearly makes the point that "C++ is hard to parse" vs. "C++ is hard to
handle" correctly.

If you believe that writing any kind of code/specification takes linear time
in its size,
this suggests that what will hurt you by far the most isn't the parsing, but
rather the
name and type resolution for a langauge like C++.

Other languages aren't as complex IMHO, but have much of the same
differential
between "syntax" and "semantics". So what this suggests is that building
a parser
is easy, and building the support to understand the language is a lot
harder.
Most people don't seem to understand this; I continually hear "if I just had
a parser
for X I could..."

Beyond name and type resolution, if you really want to manipulate programs,
are control and data flow analyses, ... These aren't small either.

The argument for a tool like DMS is that the community in general can't
afford
to build all this standard machinery, and by doing it once that cost can be
amortized.

(One can argue that cost has been amortized by GCC, but if you want to build
a general purpose program analysis and transformation tool, GCC isn't the
answer by a long shot).