From this question, I gather that whether unambiguous CF grammar can be parsed in linear time is an open problem. I'd like to know what the major roadblocks to achieve this are. That is, what made the attempts to produce such a parser fail ?

2 Answers
2

There are some observations that one can make which suggest that all 'usual' algorithms cannot be extended in a simple way to parse arbitrary unambiguous context free grammars.

Firstly, note that all 'usual' algorithms proceed from left to right and in a 'walking' fashion: they 'walk' through the input, either moving left or right one character, or doing something like reducing or predicting that 'simplifies' parts of the inputs into nonterminals. They usually don't 'jump' in the input, that is, skip over a large part of the input, nor do they maintain any information aside from packing parts of the input into nonterminals.

Now consider this unambiguous grammar for even palindromes:
$$ \begin{array}{rl}S ::= & a S a \\ | & b S b \\ | & b b \\ | & a a \end{array} $$
Consider how we might parse this with an 'usual' algorithm: note that all 'usual' algorithms do not look at the length of the input, which means that this end will come as a 'surprise' for them. This means that such algorithms can only start doing reductions after reading the entire input.

This already hints at the problem: it needs to find the 'midpoint' of the palindrome, but no 'usual' algorithm stores such information. In fact, one expects that any 'usual' algorithm will check the rest of the string before deciding on any of the reductions, because they aren't sure what the rest of the string looks like (they keep forgetting it), which implies an $O(n^2)$ running time - consider what $LR(n)$ would do, that is, $LR$ augmented with as many tokens of lookahead as the rest of the string: any reduction it makes needs the rest of the string as lookahead.

A more formal way of stating that this grammar needs unbounded lookahead to be parsed is the assertion that the grammar is not $NLR(k)$ for any $k$. It is unknown whether the language itself is not $NLR(k)$ for any $k$, but it is suspected this is not the case. Note also that Earley's algorithm takes $O(n^2)$ time to parse this algorithm.

One might note that the grammar above can easily be parsed by an algorithm for linear grammars or a parser that first finds out the length of the string; however, the grammar $T ::= S c S$ with $S$ as above is neither linear nor easily parseable if the algorithm know the length of the entire string.

This suggests that any algorithm that is able to handle any unambiguous language must be quite different from what we normally use to parse context free grammars.

Which attempts? The most honest answer might be that the roadblock is that nobody has figured out how to do it.

However, one may also point to the fact that simply knowing that the grammar in question is unambiguous provides very little of a handle on its shape that can be used to start parsing. Remember that deciding whether a grammar is unambiguous is undecidable, which intuitively tells us that there exist unambiguous grammars with arbitrarily weird and convoluted structures.

As an illustration, suppose the top of the input grammar had
$$ \begin{array}{rl}S ::= & A_1A_2\ldots A_n \\ | & B_1 B_2 \ldots B_n \end{array} $$
and for each $i$ except one there is some string generated by both $A_i$ and $B_i$ (but the structure of the grammars rooted at $A_i$ and $B_i$ can otherwise be quite different). How would we get started parsing an input for such a grammar? If we are to avoid backtracking, it would seem that in the worst case we'll need to check the part of the input that corresponds to the $j$ where $A_j$ and $B_j$ distinguishes the string. But even deciding whether $A_j$ and $B_j$ have any generated strings in common is undecidable.

Perhaps the pertinent question is more how a proof that that the general problem is superlinear can be so elusive.