Learning regular languages with large alphabets

Algorithm used to produce state machines

Vertices are added iteratively. The process of adding a vertex v is as follows:

Choose a random ambassador node w and form an edge v → w.

Generate two random numbers x and y that are geometrically distributed with means f / (1 – f) and fb / (1 – fb). Node v selects x in-edges and y out-edges of w to nodes that are not yet visited. If there are not enough nodes available, it selects as many as it can.

Node v forms out-edges to the end-points of the selected edges from and to w and applies step (2) recursively for each of those nodes. As the process continues, nodes cannot be revisited.

The adapted version of the above algorithm accepts three additional parameters:

a : an upper limit on the size of the alphabet.

l : the probability that a state loops to itself (which cannot occur in the straightforward Forest-fire algorithm)

p : the probability of parallel edge labels (which also cannot occur in the conventional algorithm).

Moreover, the following rules have been added:

Every time a state is added, it has a 50% chance of being an accepting or rejecting state.

To ensure that each state is reachable, every time a state is added, instead of connecting an edge from the new state to an ambassador state, the reverse edge is added (from the ambassador to the new state).

Every time a new edge is added, it is labelled with an element from the alphabet. The possible alphabet choices are curtailed to ensure that a selected element will not cause the machine to be non-deterministic. If it is impossible to select an element that will produce a deterministic machine, the edge is not added.

At the end of the process, the machine is checked for equivalent states, and these are merged. The number of states n was chosen to produce machines that have about 50 states (though this can vary slightly due to the minimisation process).

For the final competition machines, the following parameters were chosen:

a = {2,5,10,20,50} (see table above)

n = varied according to alphabet size – chosen to ensure that size of machines is about 50 states after minimization.