Abstract

Simple information searches -- name lookups, word searches, etc. -- are often implemented in terms of an exact match criterion. However, given both the diversity of homophonic (pronounced the same) words and names, as well as the propensity for humans to misspell surnames, this simplistic criterion often yields less than desirable results, in the form of reduced result sets, missing records that differ by a misplaced letter or different national spelling.

This article series discusses Lawrence Phillips' Double Metaphone phonetic matching algorithm, and provides several useful implementations which can be employed in a variety of solutions, to create more useful, effective searches of proper names in databases and other collections.

Introduction

When building solutions which perform searches of textual data given a search criteria, it is often desirable to account for the way in which the search terms are pronounced, compared to the pronunciation of the data being searched, producing as a result those records, which "sound like" the search terms. This type of comparison is known as "phonetic matching", and the algorithms by which it is performed are called "phonetic matching algorithms" or sometimes "phonetic encoding algorithms."

A significant amount of effort has been expended in search of better phonetic matching algorithms, with mixed success. One of the oldest phonetic matching algorithms, and likely the most familiar, is the Soundex algorithm, which was patented in the early 20th century, as an algorithm used on mechanical punch-card based machines. Soundex produces a key, based on a simplistic computation, which is supposed to represent the pronunciation of the original word -- or at least the first part of the word.

If one has ever implemented Soundex in a solution with more than a few records, one is aware of its limitations. It fails to match similar-sounding words, and proceeds to match inappropriate words. While Soundex is simple and fast, it is sadly lacking as a general purpose phonetic matching algorithm.

In 1990, Lawrence Phillips published the Metaphone algorithm in Computer Language magazine. Phillips' algorithm took into account several English pronunciation rules, yielding a more reliable phonetic key, representing the sound of a word, in particular a proper name. While an improvement over Soundex, Metaphone still failed to match some obviously homophonic words, such as "Bryan" and "Brian."

Finally, in the June 2000 issue of C/C++ User's Journal, Phillips published the Double Metaphone phonetic matching algorithm. Double Metaphone implemented additional heuristics to correct previous shortcomings of Metaphone, and produced a second, "alternate" key, which represented the name's native, as opposed to American, pronunciation. With these enhancements, Phillips had created a phonetic matching algorithm reliable enough for main stream use.

This article series discusses the practical use of the Double Metaphone algorithm to phonetically search name data, using the author's implementations written for C++, COM (Visual Basic, etc.), scripting clients (VBScript, JScript, ASP), SQL, and .NET (C#, VB.NET, and any other .NET language). For a discussion of the Double Metaphone algorithm itself, and Phillips' original code, see the link above.

Part I introduces Double Metaphone and describes the author's C++ implementation and its use. Part II discusses the use of the author's COM implementation from within Visual Basic. Part III demonstrates use of the COM implementation from ASP and with VBScript. Part IV shows how to perform phonetic matching within SQL Server using the author's extended stored procedure. Part V demonstrates the author's .NET implementation. Finally, Part VI closes with a survey of phonetic matching alternatives, and pointers to other resources.

If one is only interested in how to implement phonetic matching for a particular language, one may still benefit from the background information in this part of the series.

Background

Phillips' original Double Metaphone implementation took the form of a C++ class, MString, which derived from the MFC string class CString. One simply called the DoubleMetaphone method, and two keys were produced for the word contained in the string. The first, or primary key represented the pronunciation of the word by American pronunciation conventions, while the second, or alternate key represented the native pronunciation of a foreign word or surname. The alternate key can be, and usually is, empty, indicating no alternate native pronunciation.

The Metaphone keys are four letters long (three for short words). As a consequence, the Metaphone keys represent the pronunciation of part of the word, specifically the first part. Empirical tests by Phillips and the author have confirmed that four letters seems to be the "magic length", balancing tolerance with relevance of matches.

To compare two words for a phonetic match, one compares the primary and alternate keys of the first word with the primary and alternate keys of the second. The two words are said to match phonetically if any one (or more) of the four combinations match. Relative match strength can be computed with the following table:

Primary Key = Primary Key

Strongest Match

Secondary Key = Primary Key

Normal Match

Primary Key = Secondary Key

Normal Match

Alternate Key = Alternate Key

Minimal Match

Depending upon the implementation requirements, results may be filtered to include only certain match levels.

The improved implementation

The implementation presented in this article takes the form of a C++ template, taking a single template parameter, which specifies the key length to produce. As stated above, four letters seems ideal, but the template allows the programmer some experimentation.

This implementation has several advantages over Phillips' original code:

First, it is not based on MFC; therefore the MFC runtimes are not required to use the template. In fact, it is not based on any Windows-specific elements; therefore it should be usable with nothing more than a modern compiler with template support.

Second, it is not derived from a string class. In the author's opinion, the entity being represented by the Double Metaphone implementation class should be the Metaphone keys. This leads to the next advantage…

Third, it implements the C++ == and != comparison operators, in terms of the Metaphone keys, thereby making this implementation ideally suited to use as a key in certain container classes, which will be discussed in more detail below.

Fourth, due to minor optimizations and the fully inline nature of templates, this implementation computes keys approximately 2x faster than Phillips' original implementation, based on simplistic tests.

To use this implementation, simply include the header file:

#include"DoubleMetaphone.h"

and instantiate a class based on the template by specifying the key length, then pass the word to be keyed to the constructor or the computeKeys method:

Note that if no alternate key is present for the word associated with the instance, getAlternateKey returns NULL.

As a convenience to users of the class, the char* passed to the constructor or computeKeys is returned by getOriginalWord. Note that this is not a copy of the original word, but rather a copy of the string pointer from which the current keys were computed. If this pointer is freed after it is passed to the class, do not expect the pointer returned by getOriginalWord to be valid.

To compare two words for phonetic similarity using the four-way comparison described in the previous section, simply instantiate two DoubleMetaphone objects, pass the two words to the constructors or the computeKeys methods of the respective objects, and compare them:

Practical use of the implementation

The previous section provided a very brief discussion of the implementation's interface, and how to compute Metaphone keys for a word, while this section demonstrates the construction of a phonetic search application, Word Lookup, which retrieves all like-sounding names, given a search name.

While this solution is hardly representative of a production-grade system, it is demonstrative of the key principles of phonetic matching w/ Double Metaphone, and provides a test environment within which, various parameters may be tested and tuned to a particular environment. For example, try varying the length of Metaphone keys produced, and note what effect this has on the match results. Or, try using a list of words instead of proper names; does the phonetic matching perform as well for general words as it does for proper names? Each application of phonetic matching is different; therefore it is reasonable to expect that some tuning will be required for maximal performance in a given application.

In practice, when implementing a solution which calls for phonetic matching, one does not find oneself comparing two words for similarity as in the above code snippets. Rather, a collection of records, consisting perhaps of last names or some other candidate data, must be searched in its entirety, producing one or more matching records.

For this article, the aforementioned collection will be an STL multimap class, with a key type of an STL string class containing a Metaphone phonetic key (primary or alternate), and a value type of an STL string class containing a name. The collection will be populated from a list of 21k names, obtained from the Moby project. For the complete source code for this sample application, and a copy of the name list, download the source archive associated with this article.

The STL multimap class was chosen for two reasons: first it is simple and familiar, and second, it implements an associative container associating one key with one or more values. This is critical since the objective is to associate one key (a Metaphone phonetic key representing the sound of a word) with multiple values (all words for which Double Metaphone produces that key).

Before searching can take place, the container must be populated. To do this, the words are read from the file (one per line), each word is passed to the computeKeys method on an instance of DoubleMetaphone4 to compute the word's keys, and an entry added to the multimap mapping the DoubleMetaphone4 class' keys with the string value class:

In the above snippet, file is of type ifstream, and has been open to the list of names being used to populate the map. While file is not at the end of the file, a line is read into line, and then line is passed to computeKeys, to compute the Metaphone keys for the word contained in line. An entry is then added to the wordMap (the multimap mapping one string to multiple string instances) mapping the primary Metaphone key of the word to the word itself. If an alternate key exists for the word, another entry is placed in wordMap, mapping the alternate key to the word as well.

One must keep in mind that the multimap maps one key to multiple values, therefore at the end of this loop, for any given phonetic key in the multimap, one or more words will be associated with that key. As a result, when searching for phonetically matching words which sound like a search word, the primary (and alternate, if present) Metaphone keys for the search word are used to query the multimap, and all words present in the multimap at either of the two keys have some level of phonetic similarity to the search word. A snippet from the word lookup sample application which performs arbitrary word searches appears below:

The above snippet contains the implementation of phoneticMatch, a function from the Word Lookup sample application which searches a multimap for all words phonetically matching a search word, searchWord. All matching words are placed in an STL set, matchingWords, for use by the caller.

Clearly, the search process is remarkably simple, considering the sophisticated results being produced. First, a DoubleMetaphone4 instance is created with the search word, causing the Metaphone keys for the search word to be computed. Next, a for loop iterates over the words in the map associated with the primary Metaphone key of the search word, placing each matching word in the results set, matchingWords. (For those not familiar with the STL multimap, lower_bound returns an iterator pointing to the first value associated with a key, and upper_bound returns an iterator pointing just beyond the last value associated with a key; therefore traversing the range between the two iterators yields all values associated with a given key). Finally, if an alternate Metaphone key exists for the search word, all words associated with that alternate key are also placed in the results set.

The final component of Word Lookup which will be examined here is the computeMatchLevel function. This simple function implements the result scoring technique described in the table in the previous section. It simply determines which Metaphone keys of a search word match which Metaphone keys of a candidate word, returning a match score from 1 (strong, Primary-Primary match) to 3 (minimal, Alternate-Alternate match). In Word Lookup, computeMatchLevel is called for every word matching the search word. The score returned by computeMatchLevel is printed to the console along with the matching word itself, to give the user a sense of the strength of the match:

When experimenting with the Word Lookup application, one often encounters matches of dubious validity; however these matches almost always have a match score of 3, indicating that the Double Metaphone algorithm considers them only distant matches. This is encouraging to the practitioner, as it means the width of the net cast in the search for phonetic matches, can be tuned by limiting results to a certain match level, with reasonable assurance that valid matches will not be rejected.

The Unsigned Short optimization

While the implementation techniques in the previous sections are quite effective and reasonably efficient, the astute reader may be troubled by the significant number of string comparisons implicitly performed by the multimap class during word searches. Additionally, while a 21,000 word list requires minimal memory, scaling the list to hundreds of thousands, or millions, would result in a significant memory footprint due to the number of STL string class instances used to store not only the words themselves, but the phonetic keys for those words.

Fortunately, an alternative exists. In his 2000 CUJ article, Phillips proposed representing four-character Metaphone keys as four, four-bit nibbles, together forming a 16-bit unsigned short. Since a four-bit nibble can represent values from hex 0 to hex f (0-15 decimal), and the Metaphone algorithm produces keys from only 12 characters, one could assign each character a numerical equivalent, thereby representing a key as a "string" of nibbles, packed into 16-bit integer.

While Phillips did not implement this optimization in his MString class, I have implemented it in the ShortDoubleMetaphone class, which is a subclass of DoubleMetaphone4 -- that is, it inherits from the class which results from specifying a key length of 4 to the DoubleMetaphone template. The reason for deriving ShortDoubleMetaphone from DoubleMetaphone4, instead of making it another template with a variable key length, should be obvious; key lengths greater than four cannot be represented in a 16-bit short, and key lengths less than four are not sufficiently specific, and yield a significant number of weak matches.

The use of ShortDoubleMetaphone is almost identical to DoubleMetaphone4; in fact, since it is a subclass of DoubleMetaphone4, it can be used in exactly the same way, though doing so will not realize the benefit of the unsigned short optimization.

ShortDoubleMetaphone adds two methods: getPrimaryShortKey and getAlternateShortKey. As their names imply, they return an unsigned short value representing their respective keys. getAlternateShortKey returns METAPHONE_INVALID_KEY (0xffff) to indicate the lack of an alternate key. Using these methods, and the unsigned short type, in place of the string equivalents, are all that is required to convert from a DoubleMetaphone - based implementation to a ShortDoubleMetaphone one.

While not discussed here, the source archive includes a version of the Word Lookup sample which uses ShortDoubleMetaphone instead of DoubleMetaphone. Other than a 50%-150% improvement in lookup speed, depending upon the data, there are no significant differences in behavior between the two versions.

Conclusion

This article has explored the phonetic matching problem, introduced the Double Metaphone algorithm as a candidate solution, and demonstrated the use of the DoubleMetaphone template class implementation of Double Metaphone in practical applications. While this information is important to the development of a phonetic matching system, it is missing some significant pieces of information: how to adapt this technology to a relational database system where the vast majority of information requiring phonetic searching is likely to be stored; and how to use Double Metaphone from other modern programming languages.

This article does not address these questions, primarily because the objective was simply to introduce phonetic matching, Double Metaphone, and the implementation in C++. The reader is encouraged to continue reading this article series, for further exploration of Double Metaphone applications, as well as examples of phonetic matching using relational databases, and other programming languages. Part II introduces a COM component implementation of Double Metaphone, which is callable from Visual Basic, Visual FoxPro, Delphi, and any other COM-compatible language. Part II also introduces phonetic matching against a relational database, and includes a sample Visual Basic application to that end. Part III demonstrates using the COM implementation from ASP and VBScript, including phonetic searching against a database. Part IV introduces the author's SQL Server extended stored procedure, which enables computation of Double Metaphone keys from within SQL. Part IV also discusses optimizations of relational database phonetic matching solutions. Part V explores the author's .NET Double Metaphone implementation, and includes a sample which performs phonetic searches against a relational database. Finally, Part VI concludes the discussion of Double Metaphone with an examination of alternate phonetic matching techniques, and pointers to other resources and Double Metaphone implementations.

Share

About the Author

My name is Adam Nelson. I've been a professional programmer since 1996, working on everything from database development, early first-generation web applications, modern n-tier distributed apps, high-performance wireless security tools, to my last job as a Senior Consultant at BearingPoint posted in Baghdad, Iraq training Iraqi developers in the wonders of C# and ASP.NET. I am currently an Engineering Director at Dell.

I have a wide range of skills and interests, including cryptography, image processing, computational linguistics, military history, 3D graphics, database optimization, and mathematics, to name a few.