Friday, February 15, 2008

Fun with aspell word lists

The GPL spell checking program aspell has support for many languages, including my native language, Dutch. Since it's open source, so are the dictionaries, which means that we should be able to extract a word list. And complete word lists for a language are simply fun to play with.

Unfortunately aspell is too advanced to use a plain text word list. But there is a way to dump it:

aspell dump master

This will print the entire word list for your default language. You can specify the language used with -l:

aspell -l nl dump master

The argument to -l is the ISO 639 language code (see man aspell for details). The argument master tells aspell to use the systemwide dictionary, not your personal wordlist. The dictionary must be installed on your system; on Ubuntu the Dutch language package is called aspell-nl.

When we run aspell dump master for Dutch we get something unexpected:

blaat/MWPG
bloeit/KU
bloot/G
blootte
blote/N

There are strange tags attached to the end of many words. These are affixes and they represent variations of that word. (Although there is an English affix file, no affixes tags are printed if we dump an English dictionary.) We can expand the affix tags into all possible variations by sending them through aspell expand:

If we now pipe this through tr we get all variations on separate lines as well. Thus the final command to get a word list for any aspell-supported language becomes:

aspell -l nl dump master | aspell -l nl expand | tr ' ' '\n'

(Note that this breaks for words that originally contained spaces. The Dutch word list does not have these, though.)

Then I got interested in how these affixes work. Take blaat (to bleat) for example; it is followed by M, W, P and G. Looking in the affix definition file /usr/lib/aspell/nl_affix.dat there are some lines that define the meaning of these characters:

So what does this all mean? SFX and PFX stand for suffix (ending) and prefix (beginning). The first line of each block gives the number of lines in the rest of the block; the Y or N before the number indicate whether the suffix may be combined with prefixes or vice versa.

The line SFX M 0 ten t simply creates the plural past form blaatten, by appending -ten if the original word ends in t. We also see suffixes for other cases in which consonant doubling is required.

The next suffix SFX W is more interesting. This one takes care of various conjugations, including the past tense and the past participle (‘voltooid deelwoord’). We see that -te is appended when the word ends in k, f, s, t or p, and -de otherwise. Any Dutch person will immediately recognize this as the dreaded ‘kofschipregel’ that is the cause of so many spelling errors. In this case it gives rise to the word forms blaatte and blaatten.

The suffix SFX P at ten aat takes care of the infinitive. Note that the double a has been replaced by a single one; the pronunciation remains identical. (Dutch works in mysterious ways…) This replacement is done by the third field on the line, that indicates the text to strip off the end of the word; so far, it has been 0, which means to strip off nothing. (The SFX M z is the exception; as far as I can tell, it is not used anywhere in the aspell dictionary.) We take blaat, which matches -aat, so we strip off -at and stick -ten in its place, resulting in blaten.

Finally, we have a prefix rule PFX G, which sticks ge- before anything, leading to the form geblaat (bleating, as in “the bleating of the sheep”).

The file nl_affix.dat also contains a list of general replace rules (for example, replacing g by ch and vice versa) and specific ones (kado by cadeau). These rules are used when suggesting possible corrections for a misspelled word.

Thanks. This is really useful. I'm now some way toward generating a list of two-letter words for playing Scrabble in Welsh. Now to work out how to get the computer to count the seven digraphs (ch, dd, ff, ng, ll, rh, th) as single letters...

Thanks Thomas! This was really helpful for me (I needed a way to load an english dictionary onto my android phone that I got in China, and did it between aspell, a text file, and an app to add it to the user dictionary)

Before, I kept on seeing the /q etc suffixes on everything and didn't know how to convert them into a workable file.

Hello. I saw your blog when I was looking informations about exporting the words list from hunspell/aspell and ispell. What I was trying to do, was to convert the file : american-huge (ispell) to aspell, to be able to dump it from aspell by the command line. but I wasn't able to find the info.

Do you know where I can get the file for aspell or better, how can I do it myself ?

About Me

Me elsewhere

Taekwindow

Taekwindow is a small and light Windows program that I wrote. It enables you to move windows by Alt-dragging anywhere in their interior, and resizing by Alt-right-dragging, just like in many X window managers.