Introduction

The basic algorithm for spell checking is simple: for each word in the text, check if it's in your dictionary. If it is not, report a misspelling. The dictionary is usually implemented with a hash table. Handling affixes and capital letters adds some complexity, but the program is still straightforward.

Suggesting the correct word is harder. The usual approach is to check all variations of the word obtained by inserting a letter, replacing a letter with another, skipping a letter, or transposing adjacent letters. For the misspelled word "teh", they will be:

The program looks up for each variation in the dictionary, and reports the ones that are found (in this case, tea, tech, eh, and the). The most probable spelling suggestions are then shown to the user. This approach is used in Ispell and the spelling corrector in Python.

This article outlines a new solution to the problem based on ternary DAGs.

Ternary Search Tries

In his book "Algorithms in C", Robert Sedgewick describes a data structure that is suitable for finding similar words. It's called ternary search trie (TST).

Each node of the tree contains three links: the middle one leads to the tree for the next character; the left and right links lead to the nodes of the characters that are smaller and larger than the current one. Each level of the TST is essentially a BST (binary search tree) for searching the character in the current position. When the character is found, the program follows its middle link, which leads to the tree of all words starting with the already scanned characters.

The figure shows a TST for the words red, read, reading, and reads. The null links are not shown. Because one word may be a prefix of another (e.g., read and reading), you need a special node to denote the end of the word. A node with a null character conveniently fits this purpose (shown as an empty circle on the figure). The word characters are stored in nodes, so there is no need to keep the strings in memory, as with a hash table.

You can find more explanations on TSTs in the book by Sedgewick and in his article in Dr. Dobbs (co-authored with Jon Bentley).

When the dictionary is stored as a TST, you can traverse the adjacent nodes and quickly find spelling suggestions for a given word. Only the words that are in TST are checked. For example, when looking for suggestions for "readimg", you don't need to check all of readiAg, readiBg, readiCg, etc. Only the beginnings of existing words (readiLy, readiEr, readiNess, readiNg, etc.) are checked.

Stemming

It would be interesting to try TST for spelling correction, but it's just too large: a medium-sized English dictionary (100,000 words) requires 320,000 nodes, which is around 5 MB.

Many nodes are taken by affixes: -s, -ed, -ing, un-, etc. The spellcheckers based on hash tables usually strip affixes before searching for a word, so that the hash table contains only root words.

Designing a stemming engine that can be customized for any language is a challenging task. Ispell follows this path; you can find the inflection rules of different languages in the .aff files of Ispell dictionaries.

Language

Number of rules

Estonian

9249

Finnish (medium-sized)

8861

Polish

5454

Spanish

3274

Czech

2544

Russian

1077

Danish

380

German

361

Brazilian Portuguese

271

Italian

210

Dutch

116

English

45

More rules do not necessarily mean a more complex morphology because some dictionary authors keep inflected variants in a word list, while others add more affixes for them.

Ternary DAGs

When using TST, we can choose another approach. Note that TST "compresses" equal prefixes; e.g., on the figure above, read, reading, and reads use the same nodes for the first four characters. What if we try to compress the suffixes, too? The resulting data structure will have only 55,000 nodes instead of 320,000. Let's call it a ternary DAG (a ternary directed acyclic graph).

Miyamoto et al used this data structure for pattern matching, but it's not limited to storing suffixes of a word. An arbitrary set of words can be stored in the DAG:

The DAG contains the words red, read, reading, reads, and listen, listening, listens. The middle links of N in listen and D in read lead to the same subgraph containing the suffixes -ing, -s, and a zero-length suffix.

Building a ternary DAG

The DAG can be built from a TST. Let's make a hash table of the TST subtrees (a node and all its children are hashed). When identical subtrees are found, replace all links to them with a link to a single subtree.

By using this procedure, you can automatically find common endings. Affix stripping is not necessary, which has multiple advantages:

Spelling suggestion search in TST is simplified when all inflected forms are stored in the dictionary.

The program can find more common endings than a human who makes an affix list. The affix list for English usually contains -ed, -ing, and -s. The program will also find -ment in agreement and increment, -logist in biologist and psychologist, etc. It can find "suffixes" that do not make sense from a linguistic viewpoint, e.g., -ead in head and read, but it's not a problem for spellchecking.

The program can not only find single common endings, but also subtrees including multiple endings. On the figure above, there is a subtree including -ing, -s, and a zero-length suffix.

The found subtree may be a part of another subtree. Given the words lock, locking, head, heads, headed, heading, and news, the program can compress TST to the following DAG (roots are shown as a single node to save space):

As mentioned above, each level of the TST is a BST. A word list is usually sorted alphabetically, so the BSTs are unbalanced. You can use a usual balancing algorithm to fix this.

The attached program (create_dictionary.c) implements these techniques.

Searching

Searching in a ternary DAG is not different from searching in a TST. You can use the recursive algorithm by Sedgewick and Bentley for exact matching.

A fuzzy matching algorithm for DAG examines adjacent nodes to find a spelling suggestion. Common error patterns (inserting a letter, deleting a letter, etc.) can be expressed elegantly using recursion:

The attached spelling corrector (spellchecker.c) handles a single missing letter, an extra letter, a wrong letter, transposed letters, and a missing space.

Performance Comparison

The attached benchmark contains a hash table spelling corrector (benchmark.c), which is similar to the one found in Ispell. It generates all possible variants of the misspelled word, and tries to find them in the hash table.

The tests were run on a Pentium M processor; execution time was measured in ticks using the RDTSC instruction. The dictionary was the medium-sized word list from SCOWL (100,000 English words).

<tody />

Algorithm

Spell checking time

Spelling correction time

Size, MB

MSVC

GCC

MSVC

GCC

hash table

900

1400

125'700

167'600

2.2

ternary DAG

1700

1700

81'400

111'200

0.4

Results:

The hash table implementation is faster for spellchecking. A balanced tree is usually slower than a hash table for searching because of more memory accesses, so this was expected.

The ternary DAG implementation is faster for spelling correction.

The ternary DAG takes much less memory than the hash table.

No affix stripping scheme is used in the benchmark (all inflected forms are stored in the hash table). In a real-world spell checker, the hash table would contain only the root words, so it will be smaller, but also slower, which means that the ternary DAG result is really not so bad.

Instead of keeping a word in the hash table, you can store one bit that determines if the word is correct, as described in Programming Pearls book and Bloom Filters blog post. In this case, the size of hash table can be even less than the size of ternary DAG, however, an incorrect word will be reported as correct if their hash values are the same.

Using the Code

Run the create_dictionary program to generate the dictionary. Include spellchecker.c and spellchecker.h in your project. The following functions are declared in the header file:

First, you open the dictionary, and then call ST_find_exact for each word of the text. Splitting the text into words is your program's responsibility. If the user wants to see the spelling suggestions, call ST_enum_suggestions to find them. Finally, close the dictionary.

The spellchecker_test.c demo program checks the spelling of the text in the console window.

Future Work and Ideas

Improving the suggestion engine. The quality of spelling suggestions is far from that of Microsoft Word or Aspell. The program suggests only the words that are an edit distance of 1 from the misspelled word. A better spelling corrector would also check edit distance 2. The suggestions should be sorted from the most probable to the least probable. Sound-like data and a list of common errors can be included in the corrector.

Inflection rules support. When using the create_dictionary program, you should provide a word list with all the inflected forms (e.g., address, addresses, address's, addressing, addressed). It's not a problem for English, for which the number of forms is relatively small. For synthetic languages, it may be handier to use a word list with roots only and a separate list of inflection rules, like Ispell does.

Adding words at runtime. The current program compiles the dictionary into ternary DAG format, and then reads from it. Online construction of the ternary DAG is possible, but requires keeping more information in memory (a hash table for TST, and auxiliary counters in the nodes). In this case, a more complex algorithm for adding words should be implemented to use the existing suffixes in DAG and to keep the BSTs balanced.

Conclusion

Ternary DAG is a compact representation of a spell-checking dictionary, which is faster than a hash table for spelling correction, but not for spellchecking. With a ternary DAG, you can avoid affix stripping, which complicates a spell checker.

Share

About the Author

Peter lives in Siberia, the land of sleeping sun, beautiful mountains, and infinitely deep snow. He recently started a wiki about algorithms and code optimization, where people could share their ideas, learn, and teach others.

Comments and Discussions

One little thing: is hashing of all subgraphs the most efficient way to collapse identical subgraphs into one? I wonder if there is some clever algorithm out there that does it in time proportional to the number of leaves of the original DAG...

I'm aware that the hash-all-subgraphs approach can be made to work in O(number_of_leaves) time provided the hashtable is large enough to cover all practical problem sizes, but "linear" algorithms relying on sufficiently-large hashtables are somehow always slightly unsatisfying to me... Also I'm aware that the DAG compaction is a one-time cost so it's really unimportant how long it takes ... But I'm still curious!