(* The purpose of this algorithm is to find, for each pair of a state [s] and a terminal symbol [z] such that looking at [z] in state [s] causes an error, a minimal path (starting in some initial state) that actually triggers this error. *)(* This is potentially useful for grammar designers who wish to better understand the properties of their grammar, or who wish to produce a list of all possible syntax errors (or, at least, one syntax error in each automaton state where an error may occur). *)(* The problem seems rather tricky. One might think that it suffices to compute shortest paths in the automaton, and to use [Analysis.minimal] to replace each non-terminal symbol in a path with a minimal word that this symbol generates. One can indeed do so, but this yields only a lower bound

on the actual shortest path to the error at [s, z]. Indeed, several difficulties arise, including the fact that reductions are subject to a lookahead hypothesis; the fact that some states have a default reduction, hence will never trigger an error; the fact that conflict resolution removes some (shift or reduce) actions, hence may suppress the shortest path. *)

(* We explicitly choose to ignore the [error] token. Thus, we disregard any reductions or transitions that take place when the lookahead symbol is [error]. As a result, any state whose incoming symbol is [error] is found unreachable. It would be too complicated to have to create a first error in order to be able to take certain transitions or drop certain parts of the input. *)(* We never work with the terminal symbol [#] either. This symbol never appears in the maps returned by [Lr1.transitions] and [Lr1.reductions]. Thus, in principle, we work with ``real'' terminal symbols only. However, we encode [any] as [#] -- see below. *)

moduleRun(X:sig(* If [verbose] is set, produce various messages on [stderr]. *)valverbose:bool(* If [statistics] is defined, it is interpreted as the name of a file to which one line of statistics is appended. *)valstatistics:stringoptionend)=struct

(* Build a module that represents words as (hash-consed) strings. Note: this functor application has a side effect (it allocates memory, and more importantly, it may fail). *)moduleW=Terminal.Word(structend)

(* We begin with a number of auxiliary functions that provide information about the LR(1) automaton. These functions could perhaps be moved to a separate module. We keep them here, for the moment, because they are not used anywhere else. *)

(* [can_reduce s prod] indicates whether state [s] is able to reduce production [prod] (either as a default reduction, or as a normal reduction). *)letcan_reducesprod=matchInvariant.has_default_reductionswith|Some(prod',_)whenprod=prod'->true|_->

(* [foreach_terminal_not_causing_an_error s f] applies the function [f] to every terminal symbol [z] such that [causes_an_error s z] is false. This could be implemented in a naive manner using [foreach_terminal] and

(* [reduction_path_exists s w prod] tests whether the path determined by the sequence of symbols [w] out of the state [s] exists in the automaton and leads to a state where [prod] can be reduced. It further requires [w] to not contain the [error] token. Finally, it it sees the [error] token, it sets the flag [grammar_uses_error]. *)letgrammar_uses_error=reffalseletrecreduction_path_existss(w:Symbol.tlist)prod:bool=matchwwith|[]->can_reducesprod|(Symbol.Tt)::_whenTerminal.equaltTerminal.error->grammar_uses_error:=true;false|a::w->matchSymbolMap.finda(Lr1.transitionss)with|s->reduction_path_existsswprod|exceptionNot_found->false

(* Suppose [s] is a state that carries an outgoing edge labeled with a non-terminal symbol [nt]. We are interested in finding out how this edge can be taken. In order to do that, we must determine how, by starting in [s], one can follow a path that corresponds to (the right-hand side of) a production [prod] associated with [nt]. There are in general several such productions. The paths that they determine in the automaton form a "star". We represent the star rooted at [s] as a trie. For every state [s], the star rooted at [s] is constructed in advance, before the algorithm runs. While the algorithm runs, a point in the trie (that is, a sub-trie) tells

(* [star s] creates a (new) trie whose source is [s], populated with its branches. (There is one branch for every production [prod] associated with every non-terminal symbol [nt] for which [s] carries an outgoing

(* Every (sub-)trie has a unique identity. (One can think of it as its address.) [compare] compares the identity of two tries. This can be used, e.g., to set up a map whose keys are tries. *)valcompare:trie->trie->int(* [source t] returns the source state of the (sub-)trie [t]. This is the root of the star of which [t] is a sub-trie. In other words, this tells us "where we come from". *)

(* [accepts prod t] tells whether the current state of the trie [t] is the end of a branch associated with production [prod]. If so, this means that we have successfully followed a path that corresponds to the right-hand side of production [prod]. *)valaccepts:Production.index->trie->bool

(* Since every (sub-)trie has a unique identity, its identity can serve as a unique integer code for this (sub-)trie. We allow this conversion, both ways. This mechanism is used only as a way of saving space in the encoding of facts. *)valencode:trie->intvaldecode:int->trie

(* The productions that we can reduce in the current state. In other words, if this list is nonempty, then the current state is the end of one (or several) branches. It can nonetheless have children. *)

(* We keep a mapping of integer identities to tries. Whenever a new identity is assigned, this mapping must be updated. *)lettries=lets:Lr1.node=Obj.magic()in(* yes, this hurts *)letdummy={identity=-1;source=s;current=s;productions=[];transitions=SymbolMap.empty}inMenhirLib.InfiniteArray.makedummy

(* [insert t w prod] updates the trie (in place) by adding a new branch, corresponding to the sequence of symbols [w], and ending with a reduction of production [prod]. We assume [reduction_path_exists w prod t.current] holds, so we need not worry about this being a dead branch, and we can use destructive updates without having to set up an undo mechanism. *)

(* Check whether the path [w] leads to a state where [prod] can be reduced. If not, then some transition or reduction action must have been suppressed by conflict resolution; or the path [w] involves the [error] token. In that case, the branch is dead, and is not added. This test is superfluous (i.e., it would be OK to add a dead branch) but allows us to build a slightly smaller star in some cases. *)ifreduction_path_existst.currentwprodtheninserttwprod

(* A trie [t] is nontrivial if it has at least one branch, i.e., contains at least one sub-trie whose [productions] field is nonempty. Trivia: a trie of size greater than 1 is necessarily nontrivial, but the converse is not true: a nontrivial trie can have size 1. (This occurs if all productions have zero length.) *)lettrivialt=t.productions=[]&&SymbolMap.is_emptyt.transitions

[position] (that is, a sub-trie), a [word], and a [lookahead] assumption. Such a fact means that this [position] can be reached, from the source state [Trie.source position], by consuming [word], under the assumption that the next input symbol is [lookahead]. *)

(* The lookahead symbol fits in 8 bits. *)(* In the largest grammars that we have seen, the number of unique words is about 3.10^5, so a word should fit in about 19 bits (2^19 = 524288). In the largest grammars that we have seen, the total star size is about 64000, so a trie should fit in about 17 bits (2^17 = 131072). *)(* On a 64-bit machine, we have ample space in a 63-bit word! We allocate 30 bits for [word] and the rest (i.e., 25 bits) for [position]. *)(* On a 32-bit machine, we are a bit more cramped! In Menhir's own fancy-parser, the number of terminal symbols is 27, the number of unique words is 566, and the total star size is 546. We allocate 12 bits for [word] and 11 bits for [position]. This is better than refusing to work altogether, but still not great. A more satisfactory approach might be to revert to heap allocation of facts when in 32-bit mode, but that would make the code somewhat ugly. *)letw_lookahead=8letw_word=ifSys.word_size<64then12else30letw_position=Sys.word_size-1-(w_word+w_lookahead)(* 25, on a 64-bit machine *)

This guarantees that we discover shortest paths. (We never insert into the queue a fact whose priority is less than the priority of the last fact extracted out of the queue.) *)(* [LowIntegerPriorityQueue] offers very efficient operations (essentially constant time, for a small constant). It exploits the fact that priorities are low nonnegative integers. *)moduleQ=LowIntegerPriorityQueueletq=

|None->()|Someposition->(* ...then insert an initial fact into the priority queue. *)(* In order to respect invariants 1 and 2, we must distinguish two cases. If [s] is solid, then we insert a single fact, whose lookahead assumption is [any]. Otherwise, we must insert one initial fact for every terminal symbol [z] that does not cause an error in state [s]. *)letword=W.epsilonin

For every triple of [position], [a], and [z], we store at most one fact, (whose word has minimal length). Indeed, we are not interested in keeping track of several words that produce the same effect. Only the shortest such word is of interest.

Thus, the total number of facts accumulated by the algorithm is at most [T.n^2], where [T] is the total size of the tries that we have constructed, and [n] is the number of terminal symbols. (This number can be quite large. [T] can be in the tens of thousands, and [n] can be over one hundred. These figures lead to a theoretical upper bound of 100M. In practice, for T=25K and n=108, we observe that the algorithm gathers about 7M facts.) *)

(* We need to query the set of facts in two ways. In [register], we must test whether a proposed triple of [position], [a], [z] already appears in the set. In [query], we must find all facts that match a pair [current, z], where [current] is a state. (Note that [position] determines [current], but the converse is not true: a position contains more information besides the current state.) To address these needs, we use a two-level table. The first level is a matrix indexed by [current] and [z]. At the second level, we find sets of facts, where two facts are considered equal if they have the same triple of [position], [a], and [z]. In fact, we know at this level that all facts

(* Compare the two positions first. This can be done without going through [Trie.decode], by directly comparing the two integer identities. *)letc=Pervasives.compare(identityfact1)(identityfact2)inassert(c=Trie.compare(positionfact1)(positionfact2));

letm=table.(i)in(* We crucially rely on the fact that [M.add] guarantees not to change the set if an ``equal'' fact already exists. Thus, a later, longer path is ignored in favor of an earlier, shorter path. *)letm'=M.addfactminm!=m'&&beginincrcount;table.(i)<-m';trueend

assert(not(Terminal.equalzany));(* If the state [current] is solid then the facts that concern it are stored in the column [any], and all of them are compatible with [z]. Otherwise, they are stored in all columns except [any], and only those stored in the column [z] are compatible with [z]. *)

It maintains a set of quadruples [s, nt, w, z], where such a quadruple means that in the state [s], the outgoing edge labeled [nt] can be taken by consuming the word [w], under the assumption that the next symbol is [z]. Again, the terminal symbol [a], given by [W.first w z], plays a role. For each quadruple [s, nt, a, z], we store at most one quadruple [s, nt, w, z]. Thus, internally, we maintain a mapping of [s, nt, a, z] to [w]. For greater simplicity, we do not allow [z] to be [any] in [register] or [query]. Allowing it would complicate things significantly, it seems. *)

such that, in state [s], the outgoing edge labeled [nt] can be taken by consuming the word [w], under the assumption that the next symbol is [z], and the first symbol of the word [w.z] is [a]. The symbol [a] can be [any].

The function [foreach] can be either [foreach_terminal] or of the form [foreach_terminal_not_causing_an_error _]. It limits the symbols [z] that are considered. *)valquery:Lr1.node->Nonterminal.t->Terminal.t->(* foreach: *)((Terminal.t->unit)->unit)->

(* At a high level, we must implement a mapping of [s, nt, a, z] to [w]. In practice, we can implement this specification using any combination of arrays, hash tables, balanced binary trees, and perfect hashing (i.e., packing several of [s], [nt], [a], [z] in one word.) Here, we choose to use an array, indexed by [s], of hash tables, indexed by a key that packs [nt], [a], and [z] in one word. According to a quick experiment, the final population of the hash table [table.(index s)] seems to be roughly [Terminal.n * Trie.size s]. We note that using an initial capacity of 0 and relying on the hash table's resizing mechanism has a significant cost, which is why we try to guess a good initial capacity. *)

We can limit ourselves to symbols that do not cause an error in state [s]. Those that do certainly do not have an entry; see the assertion in [register] above. *)foreach_terminal_not_causing_an_errors(funa->

(* [new_edge s nt w z] is invoked when we discover that in the state [s], the outgoing edge labeled [nt] can be taken by consuming the word [w], under the assumption that the next symbol is [z]. We check whether this quadruple already exists in the set [E]. If not, then we add it, and we compute its consequences, in the form of new facts, which we insert into the priority queue for later examination. *)

Hence, this state is not solid. In order to satisfy invariant 2, we must create fact whose lookahead assumption is not [any]. That's fine, since our lookahead assumption is [z]. In order to satisfy invariant 1, we must check that [z] does not cause an error in this state. *)

(* Throughout this rather long function, there is just one [fact]. Let's name its components right now, so as to avoid accessing them several times. (That could be costly, as it requires decoding the fact.) *)

(* The state [target] is solid, i.e., its incoming symbol is terminal. This state is always entered without consideration for the next lookahead symbol. Thus, we can use [any] as the lookahead assumption in the new fact that we produce. If we did not have [any], we would have to produce one fact for every possible lookahead symbol. *)

(* We need to know how this nonterminal edge can be taken. We query [E] for a word [w] that allows us to take this edge. In general, the answer depends on the terminal symbol [z] that comes *after* this word: we try all such symbols. We must make sure that the first symbol of the word [w.z] satisfies the lookahead assumption