Edit: I realise that one of my problems is that I don't have a clear definition of my problem, which makes the question of whether it is detectable hard to answer.

I'm therefore already happy with any reference in which this particular problem is discussed at all - I haven't found any such reference myself, and with any luck, I can derive a good definition of my problem from that, which will hopefully lead to a solution.

Original:

Suppose we have this lexical definition:

X := a*
Y := a

and this context-free grammar:

S := X Y

and we use any lexer and parser combination to generate a recogniser for this language, in which the lexer uses the 'maximal munch' or 'longest match first' rule.

The specification might seem to be trivially equivalent to the regular expression "$a$+", but it isn't: in fact, it recognises no strings at all. The reason is that $X$ 'eats' all the $a$ characters present in the input because of 'maximal munch', leaving none for $Y$ to consume, so the parser always rejects the input.

I'd like to know if it is decidable if such a problem is present in a given lexer and parser specification.

Note that this is decidable if only a single token is the culprit. Let $L_T$ be the language generated by some token (= regular expression) $T$, and let $L_R$ be the language generated by the 'follow' language of this token, that is, the language of all strings that are postfixes of the occurrence of $T$ in the specification. In the example above, $T=X=a^*$ and $R=Y=a$, though in general, $R$ will be a lot more complicated.

The problem can then be formulated as (where $+\!\!\!\!+\,$ denotes concatenation):

$(L_T +\!\!\!\!+\, L_R) \cap L_T \neq \emptyset$

If this intersection is nonempty, then $T$ will eat up the match made by $R$. As $L_T$ is regular, this is decidable. (note that a better, less stringent rule might be that $L_T +\!\!\!\!+\, L_R$ is not a subset of $L_T$, I'm not sure)

Unfortunately, grammars like $a b a^* b^* a b$ (split into 6 tokens) do not accept the string $abab$, and here no single token is the culprit, so the above method doesn't work.

Searching the web turned up nothing, not even that anyone else has ever noticed this problem, although I might just be using the wrong keywords. This surprised me somewhat, so chances are I'm either wrong or this is never a problem in practice.

I stumbled upon the above problem when toying with modularised parsing, but the above problem isn't specific at all for modularised parsing (though it can more easily become a problem if someone forgets to declare whitespace somewhere, in which case I'd like to warn the user, hence the above question).

2 Answers
2

To answer in the large, the fact that the syntax of programming languages is not context-free is not news (see e.g. Floyd, 1962).

To answer more precisely, in a context of scannerless parsing like yours, a way to implement maximal munch is to employ so-called follow restrictions (van den Brand et al., 2002) by forbidding some language to follow a given rule. In your example, you could write a restriction X -/- a forbidding an $a$ after the token $X$. Provided your forbidden languages are regular, these restrictions can be compiled back into the grammar (however in van den Brand et al.'s formalism, the forbidden language can be context-free and $X$ can be any nonterminal, and this clearly leads to an undecidable emptiness problem).

Another formalism for scannerless parsing is that of parsing expression grammars (Ford, 2004), which has a greedy, maximal-munch type semantics, and an undecidable emptiness problem.

Now, it doesn't look like your specific maximal munch semantics would allow you to reduce from these undecidable problems. For a start, it seems to me that, rather than emptiness of the generated language, a tool should more broadly warn the user about any case where a token might "eat" a (non-empty) prefix of its follow language, i.e. whenever $(L_T\cdot\mathrm{Pref}_+(L_R))\cap L_T\neq\emptyset$, regardless of whether this prefix is mandatory of not. This would capture your $aba^\ast b^\ast ab$ example if I understand correctly how you would tokenize it.

To conclude, here is an attempt to formalize your notion of maximal-munch-caused emptiness: let $\langle N,T,P,S\rangle$ be a context-free grammar with nonterminal alphabet $N$, terminal alphabet $T$, production set $P$ and axiom $S$. Each terminal symbol $X\in T$ is associated with a regular language $L_X\subseteq\Sigma^\ast$ used for its tokenization. For every occurrence of a terminal symbol $X$ in some production $A\to \alpha X\beta$ of your grammar where $\alpha,\beta$ are sequences of mixed terminals and nonterminals, construct the follow language $L_{\beta,A}$ of this particular occurrence (this is a context-free language) and consider the residual language of the token $L^{\text{max-munch}}_X=(L_X^{-1}\cdot L_X)\cap\Sigma^+$, which is a regular language of strings that will be "eaten up" by the maximal munch semantics. In fact the language $L^{\text{max-munch}}_X$ is the language one would put in the follow restriction for $X$. Then, if $$L_{\beta,A}\subseteq L^{\text{max-munch}}_X\cdot\Sigma^\ast\;,$$ or equivalently $L_{\beta,A}\cap (\Sigma^\ast\backslash(L^{\text{max-munch}}_X\cdot\Sigma^\ast))=\emptyset$, any string allowed to follow this occurrence of $X$ will be "eaten", and the language of the rule $A\to\alpha X\beta$ is empty. Using the classical algorithm for emptiness checking with this extra twist for handling terminal symbols might solve your emptiness problem.

$\begingroup$Looks like this exactly answers my question, thanks. The only question that remains (and that can most likely only be answered by simply trying it out in practice) is how many false positives you get (and possibly whether the running time of this check is worth it).$\endgroup$
– Alex ten BrinkDec 11 '11 at 13:32

I would use an item based approach to solve the problem. While generating an FSA, generate the language (a list of items) that reaches each state. And generate a procedure to resolve the longest match rule. Using tupples and sets to keep track of parse:

For the first example: a*a+

The initial state is: .a*.a+
The final and only other state is: .a*.a+.

Parsing on successive "a"s produces the tupples (on transitions the "a" are simply shifted from the second token to the first):

{(,a)}

{(a,a)}

{(aa,a)}

...

I worked through the second example, there is nothing interesting there other than moving the "." past the "*" operator.

In the usual lex, a transition concatenates a symbol to the token. In general, a transition takes the set of tupples and forms a new set based on the operators in the regular grammar and the symbol associated with the transition. I don't have the time to work out the general form of the transition functions.

In the past I have done symbolic operations, including stripping off the left most character, on grammars with regular operators. The math part is easy but detailed. I no longer have my notes.

I can see that the other issues, operating on the tupples to produce new tupples, follows the same pattern. The production of the procedures associated with transitions can be done at lex generation time.

$\begingroup$I get what you're suggesting, but my problem doesn't lie with parsing regular expressions. My problem is the interplay of regular expression parsing using lexers combined with context-free grammar parsing (with say an LR parser). I don't think you can 'solve' my problem, which is why I'm interested in detecting these situations and also whether this problem has been encountered before.$\endgroup$
– Alex ten BrinkDec 12 '11 at 19:26