10 September 2007

How many well-formed formulas are there?

The following is exercise 1.1.2 of "A Mathematical Introduction to Logic", by Herbert Enderton:

Show that there are no [well-formed formulas] of length 2, 3, or 6, but that any other positive length is possible.

Enderton defines a wff as follows: every sentence symbol An is a wff. If α and β are wffs, then so are (¬ α), (α ∧ β), (α ∨ β), (α → β), and (α ↔ &beta); furthermore, that's all of the wffs. (The parentheses count as symbols. The sentence symbol An counts as a single symbol.)

The problem is pretty straightforward. A, (A ∧ B), and (A ∧ (B ∧ C)) are wffs of length 1, 5, and 9 respectively. If α is a wff of length n, then (¬ α) is a wff of length n+3; thus we can generate wffs with lengths in the following sequences: {1, 4, 7, ...}, {5, 8, 11, ...}, {9, 12, 15, ...}. This is all the integers except 2, 3, and 6.

Now, each of the ways of forming new wffs from old takes one or two wffs, adds them together, and adds three new symbols. So we can't get a wff of length 2 or 3 because it would have to come from a wff of length -1 or 0. Furthermore, there are no wffs of length 6, because they'd either have to be of the form (¬ α) where α has length 3, or (α * β) where * is ∧, ∨, →, or ↔ and α and β have total length 3, which would mean that one of α and β has length 2.

The natural follow-up question, to me at least, is "how many wffs are there"? This is silly to ask because if we have infinitely many sentence symbols, there are of course infinitely many wffs of any length for which there is at least one wff. So let's restrict ourselves to one sentence symbol, A. Let fn be the number of wffs of length n. Let F(z) = Σn fn zn be the generating function of this sequence. Then we have

f(z) = z + z3f(z) + 4z3f(z)2

where the first term on the right-hand side corresponds to the formula which consists of just the sentence symbol, the second term to all those formulas of type (¬ α), and the third term to all those forms of type (α * β). (The coefficient 4 comes up because there are four things that "*" could be.)

which is a bit surprising; the coefficients don't grow "smoothly". This is because, for example, wffs of length 9 are of the form (A * (A * A)) or ((A * A) * A) where * is as above (and yes, these two forms are different), so there are 32 of these; the only wff of length 10 is (¬ (¬ (¬ A))), since any other such formula would come from two formulas with total length 7, and one of those would have to be of length 2, 3, or 6 which is impossible. But I'm more interested in the asymptotics.

It's a general principle that the coefficients of a generating function grow roughly like fn ~ (1/ρ)n, where ρ is the absolute value of the singularity of f(z) = Sigma;n fn zn; this is true up to a "subexponential factor" which has to do with the nature of that singularity (polar, algebraic, etc.) The singularities of f(z) in this case arise from taking the square root of numbers near 0; thus they are located at the roots of the polynomial 1 - 2z3> +z6 -16z4. The smallest root is about ρ = 0.4728339090, so the number of wffs of length n grows like (1/ρ)n, or about 2.11n, times some subexponential factor. If you take larger coefficients they do have a ratio of about 2.11.

The subexponential factor is probably something like n-3/2 -- that is, the number of wffs of length n is probably about 2.11n n-3/2 -- because that particular subexponential factor often arises when one analyzes the asymptotics of tree-like structures. Don't ask me to explain that.

The interpretation this brings to mind is that if you write a wff by just writing symbols at random, at any given step there are in some sense "on average" 2.11 symbols that you could write down which would keep the wff grammatical. There are eight possible symbols, so this suggests that the best special-purpose compression algorithm for this formal language would shorten formulas to log(2.11)/log(8), or about 36%, of their original length.