The problem is that a single input string corresponds to more than one
possible parse tree. That is, it is a valid part of the language, but we don't
know what it means for certain!

This is a genuine problem with Java and with C, that takes extra work by
compiler-writers to solve - every identifier has to be checked (e.g. by
LEX) to see if it has already appeared in a class or typedef
declaration, in which case it definitely a type_name, otherwise it is
an ordinary id and can't become a type_name. We would also need
to modify the grammar slightly to make this distinction clear.

Ambiguous grammars are, by definition, going to be difficult to handle no
matter what tools we use. The assumption made with languages designed for
computers is that we do our best to make them unambiguous. Therefore, we would
normally expect any tools we use, like YACC, only to have to handle
unambiguous grammars. Given that, can they handle any unambiguous
grammar?

Unfortunately, the answer is ``no'' - there are unambiguous grammars that
tools like YACC and JAVACC can't handle. Luckily, for most
good tools, you are unlikely to come across such a grammar, and if you do, you
can usually modify the grammar to overcome the problems but still recognise
the same language.

Equally unfortunately, there is no way of deciding whether a grammar is
ambiguous or not - the best that can be done is to try to create a parser,
but if the process fails it can't tell us whether this is because the grammar
is really ambiguous or if it is just because the grammar is too confusing for
the kind of parser we are trying to make.

How to confuse parsers

The decision that a parser repeatedly makes is: given what it has already read
of the input, and the grammar rules it has already recognised, what grammar
rule comes next? The more input the parser can look at before it has to make a
decision, the more likely it is to be able to avoid confusion and get it
right.

For example, suppose we look at languages where assignment is a particular
kind of statement, rather than an operation that can be embedded in any
expression:

An LL(1) parser trying to compile this language would
have difficulties distinguishing between assignments
(e.g. a=x;)
and procedure calls i.e. functions/methods returning void
(e.g. a(x);).
This is because an LL(1) parser has to decide which kind of statement it is
looking at after seeing only 1 symbol (i.e. a), and it isn't until we
see the = or ( that we can tell what is intended. Suppose we
used a more complex algorithm, such as LL(3) - even this couldn't decide
between e.g. a.b=x and a.b(x). In fact, no matter how
far it looks ahead, an LL(n) parser, which looks ahead a fixed amount, can
always be confused by a sufficiently complicated target in an
assignment or call.

There are two kinds of solutions - the parser can use a variable amount of
lookahead, as JAVACC can be asked to do, so it reads as far as the
= or ( before making a decision - or we can rewrite the
grammar, by left-factorising it (as mentioned in 5.2.1),
so that the two kinds of statement are merged until we can make the decision:

An LR(1) parser has no difficulty dealing with the original grammar, as it
will have read to the end of the statement, and seen the = or (
on the way, before it has to decide whether to recognise an assignment or a
call.

It is possible to construct unambiguous grammars that would confuse any LR(n)
parser (as well as any LL(n) parser) e.g. palindromes - strings that are
their own mirror images, such as abba or abacaba:

The problem is that, although it is perfectly obvious to us what to do - find
the middle, and work out to both ends - LR(n) and LL(n) read strictly
left-to-right, and can only locate the middle of the string by using their
finite lookahead to find the end of the string. This could not
work for strings of length for LL(n), or length for LR(n).

Once an ambiguity has been pointed out in a grammar, it is usually clear
enough to the user what the problem is, even if it isn't obvious what to do
about it. However, what kinds of error messages are reported by tools like
YACC, and how easy is it to find the corresponding ambiguity or
confusion?

YACC has no problems with this grammar, but what if we modify it?
It might be nice to be able to write the example above simply as
void fred(int a, b, float x, z).
We could try rewriting the grammar like this:

param : type_name ids
;
ids : id
| ids ',' id
;

but now, YACC reports a shift/reduce conflict, and the details from
the y.output file are:

That is, when the generated parser sees a , after a list of identifiers
in a param, it doesn't know whether that , (and the id it
expects after) is part of the same param (in which case it should shift, to
include them as part of the RHS) or the start of the next param (in which case
it should reduce this RHS and start a new RHS).

This is not ambiguous, just confusing to YACC, as it needs more
lookahead to see if the next few symbols are e.g. , a b (a is a
type_name, b is a parameter name of type a) or , a ,
or , a )
(a is a parameter name of the current type). The way
to make this clear to YACC is to rewrite the grammar so that it can
see more of the input before having to make a decision:

That is, when it sees id ) it doesn't know whether the id is a
variable giving a value or a type name, so it doesn't know which rule to use
to recognise the id.

Assuming we don't already know what the problem is, this hasn't helped much,
but we can get more information by working back through the states in the
y.output file to try to find how we get here. To do so, we need to look
for states that include shift 8 or goto 8. In this example, all
we find is:

so the input must include ( id ), which
can be recognised either as a type-cast or as an expression.

This is a big hint about the source of the ambiguity in the grammar, but more
by luck than anything else - YACC remains confused even if we make
the grammar unambiguous, by removing the rule sub_exp : '-' sub_exp.
YACC still reports the same reduce/reduce conflict for this modified
grammar, as it is confused by an input as simple as ( a ) - it has to
decide whether this is a value in an expression or a type-cast before it reads
past the ) to see e.g. ( a ) 99 (i.e. a type-cast) or
( a ) - 99 (i.e. the value ).

Luckily, the solution to the general problem of the ambiguity - to somehow
get LEX to distinguish between identifiers that are really type names
(or class names)
and all other identifiers - also solves this confusion for YACC.

Most of the time, an ambiguous grammar results from an error made by the
implementor of a programming language. Sometimes, however, it is the fault of
the language designer. Many languages are defined in such a way that some part
is either inherently ambiguous or confusing (e.g. not LR(1)). Does this
matter? We should not limit language designers to what a particular type of
parser generator can cope with, but on the other hand there is no particular
merit in making a language harder to compile if a small change can simplify
the problem.

An example of this is a well-known problem with conditional
statements; the dangling else. Most imperative languages permit
conditional statements to take two slightly different forms:

if ( ... ) ...
if ( ... ) ... else ...

so the else d in if (a) if (b) c else d
could be associated either with if (a) or with if (b).

Most languages attempt to fix this problem by stating that the second
interpretation is more natural, and so is correct, although some languages have
different rules.
Whatever the language definition, it is an extra rule that anyone learning the
language has to remember.

Similarly, the compiler writer has to deal with this special case: if
we use a tool like YACC we get a shift/reduce error
- do we shift the else to get if (b) c else d,
or do we reduce the if (b) c as it stands, so we get
if (a) ... else d
To overcome this problem, we can rewrite the grammar to explicity say ``you
can't have an unmatched then (logically) immediately before an
else - the
then and the else must be paired up'':

You will need your version of the calculator for the exercise of
4. Remove the %lefts and see what error messages you
get from YACC. Do you understand how they arise? (Use the
-v flag with YACC to obtain the file y.output.)

Modify the calculator in 3.1 to make the input
format more flexible: create a grammar rule that recognises none, one or
more spaces, and insert it where appropriate in the original grammar. You
are likely to get lots of unexpected shift/reduce and reduce/reduce
conflicts, so run your answer through YACC to check it.

The grammar:
s = 'a' 'b' | 'a' s 'b' ;
will recognise a string of as followed by an equal number of
bs. This grammar has been extended to:
s = 'a' 'b' | 'a' s 'b' | 'a' s {error} | s 'b' {error} ;
to detect errors but this gives shift/reduce and reduce/reduce conflicts
(try it for yourself). Write a conflict-free grammar to detect errors.

The grammar for ANSI C given by Kernighan & Ritchie is in
$CS2121/e*/c_grammar/* Run it through YACC to discover the
conflicts, and try to work out why they are there and how you could improve
the grammar to remove them.

How would you use LEX and YACC to recognise a
language like SML, where the precedence, associativity and fix of operators
can be varied.