September 06, 2006

[Language_] A statistical language model assigns probabilities to the strings in a language. Speech recognition and other language technologies rely on language models to judge the likeliness of an utterance. The field has long been dominated by n-gram language models which estimate the probability for a word conditional on the previous n-1 words (where n is typically 3). Data sparseness is a big problem, no corpus contains enough samples of all n-word sequences. Here I present some experiments with using Google counts as a language model.

Some rough estimates to give the reader intuition on the sparseness of data: English has roughly 100,000 unique words. This means the number of unique bigrams is on the order of 10 billion and unique trigrams is on the order of 10^15. The distributions are far from flat: the most frequent word "the" occurs approximately 5% of the time. The frequency of a "typical" English noun would be in the 1/10,000 - 1/50,000 range, so to get a good sampling of its concordance (say to observe it 1000 times) would require a corpus of 50 million words. A typical novel contains 200,000 words. A year of Wall Street Journal has 20 million. The Gigaword news corpus from LDC contains roughly one billion (as the name suggests).

How many words does Google have? Well I took some sample words and did some calculations. The columns of the table below give:1. word2. bnc frequency in a million words3. google count in millions of words4. implied total number of words in google if it had the same ratios as bnc

My first intuition of using the most frequent words backfired. See, google gives document counts, not occurance counts. That means words that occur multiple times in a document get under-counted. So less frequent words give a more accurate guess (their document and occurance counts will be close). I take the number of English words indexed by Google to be 1e13 looking at the above data.

We still need to do smoothing. I used the baseline smoothing as described by Goodman and Chen and looked for the optimal parameters on some test data (it was hard to find pieces of text that do not exist in Google). Basically I searched for some numbers to multiply each k-gram estimate where k=0...n that gave me the highest likelihood. The biggest bang for the buck comes from 4-grams, in fact the increase in model likelihood from 3 to 4 grams is comparable to the one from 2 to 3 grams. When we go to 5 grams, not so hot.

Here is the results for the file A2/A20.txt from bnc (132 lines, 1557 words). a0 is multiplied by 1/vocabulary_size which is assumed to be 1e6. a1 is multiplied by the frequency of the word which is assumed to be google_count/1e13. a2..a6 are the coefficients of bigram...six-gram conditional probabilities.

The resulting bit-rate per word is 8.5. The best models reported in Goodman and Chen dip below 7 bits with 10 million words of data and trigrams. The possible explanations are:

- I ignored sentence breaks. (Starting word of a sentence is conditioned on the last words in the previous sentence).- They only used news data. (Their results on the brown corpus are significantly worse).- Google counts are not very accurate. Here are some examples: 223 cherished memory is 497 most cherished memory is 958 author of the legend 1010 and author of the legend- Bias due to document vs. occurance counts effect the most frequent words badly. (The frequency of "the" is 0.1% according to my estimate)- I did nothing special for numbers, proper nouns etc. (They are not clear about the treatment of unknown words, but most likely they just relied on 1/vocabulary_size which they took to be about 50,000)

Which of these reasons dominate or whether there is another significant reason is a matter of future research :)