Introduction

Computers are very good at being precise. Writing code to compare strings is easy and the code will find the smallest differences between those strings.

But what if you want the comparisons to be a little less precise? For example, what if you want to find a string but you're not completely sure about the spelling? In this case, it can be useful to be able to compare strings based on how similar they sound, even though the spelling may be different.

The Soundex Algorithm

Soundex was developed by Robert C. Russell and Margaret K. Odell in 1918. It allowed the comparison of names that may sound similar but were spelled different. The rules for Soundex were simple.

The first letter of the word is the letter of the Soundex code, and is not coded to a number.

Replace consonants with digits as follows (after the first letter):

b, f, p, v => 1

c, g, j, k, q, s, x, z => 2

d, t => 3

l => 4

m, n => 5

r => 6

h, w are not coded

Two adjacent letters with the same number are coded as a single number. Letters with the same number separated by an h or w are also coded as a single number.

Continue until you have one letter and three numbers. If you run out of letters, fill in 0s until there are three numbers.

Using this algorithm, both "Robert" and "Rupert" return the same string "R163" while "Rubin" yields "R150". "Ashcraft" and "Ashcroft" both yield "A261".

Listing 1 shows my Soundex class implemented in C#. It contains one public method, Encode(). This is a static method that encodes a string according to the Soundex rules.

The Metaphone Algorithm

The English language is rather complex and inconsistent. While Soundex is useful, it certainly has it's deficiencies. In 1990, Lawrence Philips developed the Metaphone algorithm to address some of these deficiencies.

Metaphone codes use the 16 consonant symbols 0BFHJKLMNPRSTWXY. The '0' represents "th", 'X' represents "sh" or "ch", and the others represent their usual English pronunciations. The vowels AEIOU are also used, but only at the beginning of the code.

Drop duplicate adjacent letters, except for C.

If the word begins with 'KN', 'GN', 'PN', 'AE', 'WR', drop the first letter.

Drop 'B' if after 'M' and if it is at the end of the word.

'C' transforms to 'X' if followed by 'IA' or 'H' (unless in latter case, it is part of '-SCH-', in which case it transforms to 'K'). 'C' transforms to 'S' if followed by 'I', 'E', or 'Y'. Otherwise, 'C' transforms to 'K'.

'D' transforms to 'J' if followed by 'GE', 'GY', or 'GI'. Otherwise, 'D' transforms to 'T'.

Drop 'G' if followed by 'H' and 'H' is not at the end or before a vowel. Drop 'G' if followed by 'N' or 'NED' and is at the end.

'G' transforms to 'J' if before 'I', 'E', or 'Y', and it is not in 'GG'. Otherwise, 'G' transforms to 'K'. Reduce 'GG' to 'G'.

Drop 'H' if after vowel and not before a vowel.

'CK' transforms to 'K'.

'PH' transforms to 'F'.

'Q' transforms to 'K'.

'S' transforms to 'X' if followed by 'H', 'IO', or 'IA'.

'T' transforms to 'X' if followed by 'IA' or 'IO'. 'TH' transforms to '0'. Drop 'T' if followed by 'CH'.

'V' transforms to 'F'.

'WH' transforms to 'W' if at the beginning. Drop 'W' if not followed by a vowel.

'X' transforms to 'S' if at the beginning. Otherwise, 'X' transforms to 'KS'.

Drop 'Y' if not followed by a vowel.

'Z' transforms to 'S'.

Drop all vowels unless it is the beginning.

Metaphone uses a more complex set of rules to provide more accurate phonetic comparisons. Listing 2 shows my Metaphone class. Since it is more complex than the Soundex class and needs to track its state in variables, the Encode() method is not static and the class must be instantiated in order to use.

As you can see, Metaphone is quite a bit more complex than Soundex. Since it's more complex, it doesn't run as fast. But it's quite a bit more accurate than Soundex.

Conclusion

As previously mentioned, the English language does not follow a simple set of rules. It can be even more complex when you're working with names since names may originate from countries that speak different languages. And, of course, to use these algorithms with other languages would require a complete overhaul.

As a result, Metaphone has its deficiencies as well. Endless variations of this algorithm have been developed including yet another algorithm from Philips called Double Metaphone. If the code I've presented doesn't quite meet your needs, you can try and tweak it or research some of the other algorithms that have been developed.

The attached download includes the source code and a sample program that will search a list of words using either algorithm.