@CodeInChaos: Nice find. The Wikipedia list is also a good example of the fact that the source material does make a difference in word frequencies. I'm pretty sure that, for example, in almost any other corpus "median" would not be the 122-nd most common English word (and, in particular, it would not outrank "average", which is 142-nd in the Wikipedia list).
–
Ilmari KaronenApr 23 '12 at 14:11

1 Answer
1

One nice way to generate a word list is distilling it from a large amount of texts.

There are a few interesting effects with this creation method:

The word frequency depends a lot on the chosen texts. For example when feeding it wikipedia the language is rather formal, and common every-day words are pretty rare. It might be interesting to compare Simple Wikipedia with the normal wikipedia

For some texts you also get common misspelling of words. That's nice for some uses, but bad if you want to use the list for spell checking. Should be fine in your case

I found a nice wordlist created from the english wikipedia which lists the words ordered by frequency. So you can truncate it if you want fewer words. Unfortunately it doesn't contain the frequency of each word, but since the source is available, adding this feature shouldn't be hard.