CFG represents a context free grammar. It holds productions in the prod attribute, which is a dictionary mapping a symbol to a list of its possible productions. Each production is a tuple of symbols. A symbol can either be a terminal or a nonterminal. Those are distinguished as follows: nonterminals have entries in prod, terminals do not.

gen_random is a simple recursive algorithm for generating a sentence starting with the given grammar symbol. It selects one of the productions of symbols randomly and iterates through it, recursing into nonterminals and emitting terminals directly.

Here's an example usage of the class with a very simple natural-language grammar:

the suit kicked my suit
she followed she
she shot a jeans
he shot I
a elephant followed the suit
he followed he
he shot the jeans
his cat kicked his elephant
I followed Joe
a elephant shot Joe

The problem with the simple algorithm

Consider the following grammar:

cfgg = CFG()
cfgg.add_prod('S', 'S S S S | a')

It has a single nonterminal, S and a single terminal a. Trying to generate a random sentence from it sometimes results in a RuntimeError exception being thrown: maximumrecursiondepthexceededwhilecallingaPythonobject. Why is that?

Consider what happens when gen_random runs on this grammar. In the first call, it has a 50% chance of selecting the SSSS production and a 50% chance of selecting a. If SSSS is selected, the algorithm will now recurse 4 times into each S. The chance of all those calls resulting in a is just 0.0625, and there's a 0.9375 chance that at least one will result in S and generate SSSS again. As this process continues, chances get slimmer and slimmer that the algorithm will ever terminate. This isn't good.

You may now think that this is a contrived example and real-life grammars are more well-behaved. Unfortunately this isn't the case. Consider this (rather ordinary) arithmetic expression grammar:

When I try to generate random sentences from it, less than 30% percent of the runs terminate [2].

The culprit here is the (EXPR) production of FACTOR. An expression can get expanded into several factors, each of which can once again result in a whole new expression. Just a couple of such derivations can be enough for the whole generation process to diverge. And there's no real way to get rid of this, because (EXPR) is an essential derivation of FACTOR, allowing us to parse expressions like 5*(1+x).

Thus, even for real-world grammars, the simple recursive approach is an inadequate solution. [3]

An improved generator: convergence

We can employ a clever trick to make the generator always converge (in the mathematical sense). Think of the grammar as representing an infinite tree:

The bluish nodes represent nonterminals, and the greenish nodes represent possible productions. If we think of the grammar this way, it is obvious that the gen_random method presented earlier is a simple n-nary tree walk.

The idea of the algorithm is to attach weights to each possible production and select the production according to these weights. Once a production is selected, its weight is decreased and passed recursively down the tree. Therefore, once the generator runs into the same nonterminal and considers these productions again, there will be a lower chance for the same recursion to occur. A diagram shows this best:

Note that initially all the productions of expr have the same weight, and will be selected with equal probability. Once term-expr is selected, the algorithm takes note of this. When the same choice is presented again, the weight of term-expr is decreased by some factor (in this case by a factor of 2). Note that it can be selected once again, but then for the next round its weight will be 0.25. This of course only applies to the same tree branch. If term-expr is selected in some other, unrelated branch, its weight is unaffected by this selection.

This improvement solves the divergence problem of the naive recursive algorithm. Here's its implementation (it's a method of the same CFG class presented above):

defgen_random_convergent(self,
symbol,
cfactor=0.25,
pcount=defaultdict(int)
):
""" Generate a random sentence from the grammar, starting with the given symbol. Uses a convergent algorithm - productions that have already appeared in the derivation on each branch have a smaller chance to be selected. cfactor - controls how tight the convergence is. 0 < cfactor < 1.0 pcount is used internally by the recursive calls to pass on the productions that have been used in the branch. """
sentence = ''# The possible productions of this symbol are weighted# by their appearance in the branch that has led to this# symbol in the derivation#
weights = []
for prod inself.prod[symbol]:
if prod in pcount:
weights.append(cfactor ** (pcount[prod]))
else:
weights.append(1.0)
rand_prod = self.prod[symbol][weighted_choice(weights)]
# pcount is a single object (created in the first call to# this method) that's being passed around into recursive# calls to count how many times productions have been# used.# Before recursive calls the count is updated, and after# the sentence for this call is ready, it is rolled-back# to avoid modifying the parent's pcount.#
pcount[rand_prod] += 1for sym in rand_prod:
# for non-terminals, recurseif sym inself.prod:
sentence += self.gen_random_convergent(
sym,
cfactor=cfactor,
pcount=pcount)
else:
sentence += sym + ' '# backtracking: clear the modification to pcount
pcount[rand_prod] -= 1return sentence

Note the cfactor parameter of the algorithm. This is the convergence factor - the probability by which a weight is multiplied each time it's been selected. Having been selected N times, the weight becomes cfactor to the power N. I've plotted the average length of the generated sentence from the expression grammar as a function of cfactor:

As expected, the average length grows with cfactor. If we set cfactor to 1.0, this becomes the naive algorithm where all the productions are always of equal weight.

Conclusion

While the naive algorithm is suitable for some simplistic cases, for real-world grammars it's inadequate. A generalization that employs weighted selection using a convergence factor provides a much better solution that generates sentences from grammars with guaranteed termination. This is a sound and relatively efficient method that can be used in real-world applications to generate complex random test cases for parsers.

Some algorithms, like this one by Randal Schwartz, assign fixed weights to each production. While it could be used to decrease the chances of divergence, it's not a really good general solution for our problem. However, it works great for simple, non-recursive grammars like the one presented in his article.