Abstract

Simple information searches -- name lookups, word searches, etc. -- are often implemented in terms of an exact match criterion. However, given both the diversity of homophonic (pronounced the same) words and names, as well as the propensity for humans to misspell surnames, this simplistic criterion often yields less than desirable results, in the form of reduced result sets, missing records that differ by a misplaced letter or different national spelling.

This article series discusses Lawrence Phillips' Double Metaphone phonetic matching algorithm, and provides several useful implementations, which can be employed in a variety of solutions to create more useful, effective searches of proper names in databases and other collections.

Introduction

This article series discusses the practical use of the Double Metaphone algorithm to phonetically search name data, using the author's implementations written for C++, COM (Visual Basic, etc.), scripting clients (VBScript, JScript, ASP), SQL, and .NET (C#, VB.NET, and any other .NET language). For a discussion of the Double Metaphone algorithm itself, and Phillips' original code, see Phillips' article in the June 2000 CUJ, available here.

Part I introduces Double Metaphone and describes the author's C++ implementation and its use. Part II discusses the use of the author's COM implementation from within Visual Basic. Part III demonstrates use of the COM implementation from ASP and with VBScript. Part IV shows how to perform phonetic matching within SQL Server using the author's extended stored procedure. Part V demonstrates the author's .NET implementation. Finally, Part VI closes with a survey of phonetic matching alternatives, and pointers to other resources.

Background

Part I of this article series discussed the Double Metaphone algorithm, its origin and use, and the author's C++ implementation. While this section summarizes the key information from that article, readers are encouraged to review the entire article, even if the reader has no C++ experience.

The Double Metaphone algorithm, developed by Lawrence Phillips and published in the June 2000 issue of C/C++ Users Journal, is part of a class of algorithms known as "phonetic matching" or "phonetic encoding" algorithms. These algorithms attempt to detect phonetic ("sounds-like") relationships between words. For example, a phonetic matching algorithm should detect a strong phonetic relationship between "Nelson" and "Nilsen", and no phonetic relationship between "Adam" and "Nelson."

Double Metaphone works by producing one or possibly two phonetic keys, given a word. These keys represent the "sound" of the word. A typical Double Metaphone key is four characters long, as this tends to produce the ideal balance between specificity and generality of results.

The first, or primary, Double Metaphone key represents the American pronunciation of the source word. All words have a primary Double Metaphone key.

The second, or alternate, Double Metaphone key represents an alternate, national pronunciation. For example, many Polish surnames are "Americanized", yielding two possible pronunciations, the original Polish, and the American. For this reason, Double Metaphone computes alternate keys for some words. Note that the vast majority (very roughly, 90%) of words will not yield an alternate key, but when an alternate is computed, it can be pivotal in matching the word.

To compare two words for phonetic similarity, one computes their respective Double Metaphone keys, and then compares each combination:

Word 1 Primary - Word 2 Primary

Word 1 Primary - Word 2 Alternate

Word 1 Alternate - Word 2 Primary

Word 1 Alternate - Word 2 Alternate

Obviously if the keys in any of these comparisons are not produced for the given words, the comparisons involving those keys are not performed.

Depending upon which of the above comparisons matches, a match strength is computed. If the first comparison matches, the two words have a strong phonetic similarity. If the second or third comparison matches, the two words have a medium phonetic similarity. If the fourth comparison matches, the two words have a minimal phonetic similarity. Depending upon the particular application requirements, one or more match levels may be excluded from match results.

.NET implementation

The .NET implementation of Double Metaphone is very similar in design and use to the C++ implementation presented in Part I. To use the .NET implementation, simply add the Metaphone.NET.dll assembly to your project's references in Visual Studio. NET, import the nullpointer.Metaphone namespace into the source files, and instantiate the DoubleMetaphone or ShortDoubleMetaphone classes, for string and unsigned short Metaphone keys, respectively.

For example, to compute the Metaphone keys for the name "Nelson", code similar to that listed below may be used (C# code listed; the .NET implementation is callable from VB.NET, J#, and all other .NET languages):

As with all of the implementations presented in this article series, a sample application—CS Word Lookup--written in C# is presented to demonstrate the use of the .NET implementation. CS Word Lookup uses a Hashtable collection class to map Metaphone phonetic keys to an ArrayList class, containing the words which produce the said Metaphone keys.

Performance notes

While the .NET CLR performs reasonably well, it must be stated that the C++ implementation of Double Metaphone will likely perform significantly faster than the .NET version, due primarily to the fact that the C++ version judiciously avoids memory allocation and buffer copies, while the .NET implementation is unable to avoid such constructs. The ambitious reader is encouraged to optimize the .NET implementation, perhaps through the use of the unsafe keyword, to perform direct memory access, at the expense of CLR compliance.

Conclusion

This brief article introduced the author's .NET implementation of Double Metaphone, including code snippets and a brief discussion of performance issues. Continue to Part VI for a review of alternative phonetic matching techniques, and a list of phonetic matching resources, including links to other Double Metaphone implementations.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

Share

About the Author

My name is Adam Nelson. I've been a professional programmer since 1996, working on everything from database development, early first-generation web applications, modern n-tier distributed apps, high-performance wireless security tools, to my last job as a Senior Consultant at BearingPoint posted in Baghdad, Iraq training Iraqi developers in the wonders of C# and ASP.NET. I am currently an Engineering Director at Dell.

I have a wide range of skills and interests, including cryptography, image processing, computational linguistics, military history, 3D graphics, database optimization, and mathematics, to name a few.

Hello, thanks for the great article. I found the C# version not so idiomatic for my habits, so I tried to create a new one also because I needed a portable library. You can find the code <a href="http://1drv.ms/SvaVlK">here</a>. A few comments on it:

a) some of the optimizations are not strictly needed in C# (e.g., probably it's useless to store the length of a string where strings are immutable). In fact, once removed such data members I could rewrite the whole class as a static helper (which also avoids creating a new instance for each key generation).

b) the usage of StringBuilder looks a bit weird to me. You can just use Append rather than using a for to append each single character.

c) there is probably a bug in AreStringsAt: I tried with the word "c" and I got an exception; I could fix it by adding a bounds check for length: if ((start < 0) || (start + length > word.Length)) return false.

d) once the main class became a helper, I refactored the short-version to be just a helper too which takes a string and packs it into bits, using an int (which is CLS-compliant) rather than an ushort (should anyone need an ushort, a cast on the caller side will suffice). This allowed for lengths up to 8 (adding a Take for avoiding overflows).

e) avoiding ref's parameters where not needed is usually a good thing, so I removed them and added a Tuple as return value.

Is it acceptable to compile the source into a strongly signed dll for inclusion with another commercially available strongly signed dll? If so, are there any pitfalls we should be aware of to ensure we comply with licensing concerns?

Adam, Thanks for your work on these implementations - we’ve been using the extended stored procedure successfully for the past 2½ years. We’ve recently upgraded to SQL Server 2005 and will soon be changing to 64-bit hardware, which requires us to make some changes since 32-bit dlls aren’t supported on the new hardware. We would like to change this over to a CLR implementation, since Microsoft has deprecated extended stored procedures for SQL Server 2005. I’d like to request your help with a couple of issues:

1. Converting the DoubleMetaphone and ShortDoubleMetaphone classes to .NET 2.0, with interfaces suitable for use with the new CREATE ASSEMBLY statement (requires a static method), and accessible via a SQL scalar user-defined function (requires a single output parameter that matches a native SQL data type). We can handle this conversion ourselves, but I was hoping you might take an interest since the days of xp_metaphone.dll appear to be numbered.

2. The .NET implementation you published doesn’t return the same primary and alternate keys as the COM implementation for some names. (We found 1389 differences out of 159,289 names we have indexed.) I took a quick step through in debug and couldn’t see where the problem is, but based on spot checks it appears that the .NET implementation is the one with problems. Here are some examples; I’ll be happy to send you the entire list of differences if you’d like.

AGNEW, ALLOIS: No alternate key from .NET
ALLECIA, ARCHILLA: Different alternate keys
AUTHIER: This case might represent a gap in the algorithm, since neither the COM nor the .NET implementations return the keys I expected. The anglicized pronunciation is au-thir´ (key 0R), while the French pronunciation is o-tya´ (key T).
BAUMB, BAUX: Different primary keys
BEAUBIER, ROZIER: Alternate, primary keys out of sync

Re-packaging the metaphone impl into a static class with a scalar function shouldn't be too hard. It shouldn't take but a few minutes.

This is the first I've heard of output disparities between the COM and .NET impl. Thanks for brining it to my attention, and with test data no less. I'll investigate further to see about fixing the problem. I might not get to it until the weekend.

Mike:
Thanks for looking into this, and my apologies for the delayed response.

I've put together a test rig that runs a list of names through Philips' original Double Metaphone impl, my C++ impl, and my C# impl. I didn't see the exception you reported for the 'CAESAR' case, but I do see several names producing different results under C# vs C++. I'm looking into this now.

Regarding SQL Server, it seems a static class with the [SqlFunction] attribute wrapping the existing DoubleMetaphone class would do the trick.

Adam,
Thanks for your response. We ran into a few glitches with the CLR implementation for SQL Server:

1. SQL Server apparently doesn't allow namespaces in CLR classes, so we had to remove this from your original source.
2. Only a single dll file can be registered via the CREATE ASSEMBLY statement, so we had to combine the source files in order to use ShortDoubleMetaphone.
3. A SQL scalar UDF can only return a single parameter, so there wasn't a clean mapping to replace xp_metaphone with its separate output parameters for the primary and alternate metaphone keys. We opted to combine the values into a single BINARY(4) output parameter and then parse this back into two SMALLINTs after the UDF call, but this seems like a kludge. This is also where we ran into the glitch for the "WF" input parameter - we got x0000 back instead of the expected xFFFF for the alternate key.

What we have is working, but I would still be interested in your thoughts regarding a well-thought-out approach for SQL 2005.

Sorry if my 16:16 9 Jan '07 posting was unclear - the exception occurred at line 139 for an input of "C" rather than "CAESAR". The change to use 5 spaces of padding has corrected this.

Mike:
I've implemented the fixes you proposed, and my test rig now confirms the C# impl produces identical results for all 21k test names, including 'CAESAR' and 'WJ'. I'm going to update the article with the new code, but that's done via email and may take some time; in the meanwhile, I could send you the code if you like.

First, I just want to thank the author for this code. It works great. I'm looking at using the extended stored procedure as well as the .net assembly.
So my question is how do we know if a particular word has no alternate key when we are using the unsigned short version of the keys?

Thanks. When I asked the question, I was trying to figure out what value represented the lack of an alternate key for a word. In the SQL xp implementation, you get a null value, but with ShortDoubleMetaphone, you get 65535. For several reasons, we wanted to compute the metaphone keys in our .Net app and compare them against a table of keys in SQL.
The only tricky part was figuring out how to translate between SQL server's smallint values and .Net's UInt16 values. So what I ended up doing is converting the results of the SQL XP to the equivalent UInt16 values and storing those in the key tables. Here is what worked well for us:
--This is the value used to represent a null or invalid metaphone key
DECLARE @maxKeyValue int SET @maxKeyValue = 65535
EXEC master..xp_metaphone @WorkWord, @primaryMetaphoneTemp output, @alternateMetaphoneTemp output
if @alternateMetaphoneTemp is null
set @alternateMetaphone = @maxKeyValue
else
if @alternateMetaphoneTemp < 0 --convert this smallint value to the equivalent unsigned int value
set @alternateMetaphone = @alternateMetaphoneTemp + @maxKeyValue + 1
else
set @alternateMetaphone = @alternateMetaphoneTemp

I think I might have found a small bug in your otherwise excellent code (thanks for doing this!).

When running CSWordLookup with this dictionary file the function nullpoint.Metaphone.DoubleMetaphone.areStringsAt(start,length,strings) failed with a index out of range error. I added a simple check to fix it. Modified function:

private bool areStringsAt(int start, int length, params String[] strings)
{
if (start < 0 || m_word.Length < length)
{
//Sometimes, as a result of expressions like "current - 2" for start,
//start ends up negative. Since no string can be present at a negative offset, this is always false
return false;
}

I have readed your work "Implement Phonetic ("Sounds-like") Name Searches with Double Metaphone".
It is very interesting. Recently I found a paper (Phonetic String Matching: Lessons from Information
Retrieval - Justin Zobel,Philip Dart) talking about aproximate string matching.
Im plannig to experiment with Editex algorithm. Do you know where I can find more data about this?

I was almost ready to use the metaphone method when I stumbled across your articles on double metaphone. You did a VERY good job of explaining it and offering examples. The only thing I wish for was the source code in VB, but not a big deal.
The only major thing I'll need to add is looking up on multiple words.

Does your verion of the .NET implementation produce the exact results as the orginal Philips version? I need to know this because we are currently using the Philips version and want to insure the compatibility of both versions for comparisons.

Gary:
My .NET implementation should be completely compatible with Phillips' original version; in fact, it should be algorithmically identical. In the course of development, I generated a corpus of DMetaphone keys using Phillips' original code, and compared this to the same corpus processed with my implementation, and only when the entire corpous of ~14k names matched did I consider my algorithm done.

Therefore, my answer to your question is yes, my impl SHOULD produce the same Double Metaphone keys as Phillips' impl, given the same input. Should you find names for which this assertion does not hold, that would constitute a bug in my implementation, and I would very much like to know about it.

Thomas:
Since different languages use different phonemes, any phonetic matching algorithm designed to support one language (say, English) will require some modification to fully support another language (say, German).

However, when Phillips designed Double Metaphone, he had ethnic pronunciations in mind, including German. Several special cases exist in his algorithm to deal with German names like "Schmidt". Therefore, if you are limiting your application to phonetic matching of German surnames, Double Metaphone might work acceptably well without modification, though obviously that depends entirely on the application.

If Double Metaphone is not adequate, modifying the code will likely be tedious and error prone. From a simple peruse of Phillip's Double Metaphone, all of the special cases are clearly the result of exhaustive trial and error until a workable algorithm was produced. Therefore, any attempts to extend the algorithm further will likely involve a similar process.

Thus, if you find Double Metaphone does not suit your needs, I suggest you consider some of the alternative techniques described in Part VI. You might also read the referenced papers, and have a look at Second String, the Java toolkit providing a number of approximate string matching techniques.

I had been holding off hyperlinking between pages, since after editing, the URLs will change. However, it's now been several days since I first posted, and the articles remain unedited, therefore I have implemented your suggestion. Each time an article makes reference to another article, that reference should now be a hyperlink to that article.

Hopefully when the editors post my articles to their final destinations, the editors will update the hyperlinks as well.