The Soundex Algorithm

Cultural differences and input errors can lead to words being spelled differently to a user's expectations. This makes it difficult to locate information quickly. The Soundex algorithm can alleviate this by assigning codes based upon the sound of words.

Adding Soundex Character Codes

The GetSoundex method calls several private methods that have yet to be defined. The first is the AddCharacter method, which encodes a letter as a Soundex character and appends it to the code. The first letter is copied to the Soundex string; subsequent letters are converted to digits first and added only if they are not duplicates of the previous digit.

Determining Soundex Digits

The GetSoundexDigit method encodes letters as digits. The letter is converted to a value between one and six according to the algorithm rules. If the letter is not encodable, a full stop (period) character is used as a placeholder. The placeholders ensure that duplicates are not removed when separated by a vowel, H, W or Y.

Comparing Strings

The second public method of the Soundex class compares two strings to determine if they sound alike. In this case we use a simple algorithm. Firstly, the two strings are encoded using the Soundex algorithm. Next, the pairs of characters at each of the four positions are compared. The method returns the number of matching pairs. A result of four indicates the best possible match and zero the worst possible match. These values are useful when sorting a list of possible matches with the most likely appearing first.

Variations

There are variations on the Soundex algorithm, which makes it difficult to compare Soundex codes generated by different systems. One variation is to identify when the first few consonants of a word are encoded as the same numeric digit and remove this duplication. We can add this modification by changing the AddCharacter method as shown below. Note the additional check for a one-character Soundex code where duplicated digits are not added.

Another common variation is to treat the letters H, W and Y differently to vowels, ignoring them completely and removing duplicate codes that are separated by one of the three letters. You can use this variation by modifying the GetSoundexDigit method as follows: