Abstract

Simple information searches -- name lookups, word searches, etc. --are often implemented in terms of an exact match criterion. However, given both the diversity of homophonic (pronounced the same) words and names, as well as the propensity for humans to misspell surnames, this simplistic criterion often yields less than desirable results, in the form of reduced result sets, missing records that differ by a misplaced letter or different national spelling.

This article series discusses Lawrence Phillips' Double Metaphone phonetic matching algorithm, and provides several useful implementations which can be employed in a variety of solutions to create more useful, effective searches of proper names in databases and other collections.

Introduction

This article series discusses the practical use of the Double Metaphone algorithm to phonetically search name data, using the author's implementations written for C++, COM (Visual Basic, etc.), scripting clients (VBScript, JScript, ASP), SQL, and .NET (C#, VB.NET, and any other .NET language). For a discussion of the Double Metaphone algorithm itself, and Phillips' original code, see Phillips' article in the June 2000 CUJ, available here.

Part I introduces Double Metaphone and describes the author's C++ implementation and its use. Part II discusses the use of the author's COM implementation from within Visual Basic. Part III demonstrates use of the COM implementation from ASP and with VBScript. Part IV shows how to perform phonetic matching within SQL Server using the author's extended stored procedure. Part V demonstrates the author's .NET implementation. Finally, Part VI closes with a survey of phonetic matching alternatives, and pointers to other resources.

Background

Part I of this article series discussed the Double Metaphone algorithm, its origin and use, and the author's C++ implementation. While this section summarizes the key information from that article, readers are encouraged to review the entire article, even if the reader has no C++ experience.

The Double Metaphone algorithm, developed by Lawrence Phillips and published in the June 2000 issue of C/C++ Users Journal, is part of a class of algorithms known as "phonetic matching" or "phonetic encoding" algorithms. These algorithms attempt to detect phonetic ("sounds-like") relationships between words. For example, a phonetic matching algorithm should detect a strong phonetic relationship between "Nelson" and "Nilsen", and no phonetic relationship between "Adam" and "Nelson."

Double Metaphone works by producing one or possibly two phonetic keys given a word. These keys represent the "sound" of the word. A typical Double Metaphone key is four characters long, as this tends to produce the ideal balance between specificity and generality of results.

The first, or primary, Double Metaphone key represents the American pronunciation of the source word. All words have a primary Double Metaphone key.

The second, or alternate, Double Metaphone key represents an alternate, national pronunciation. For example, many Polish surnames are "Americanized", yielding two possible pronunciations, the original Polish, and the American. For this reason, Double Metaphone computes alternate keys for some words. Note that the vast majority (very roughly, 90%) of words will not yield an alternate key, but when an alternate is computed, it can be pivotal in matching the word.

To compare two words for phonetic similarity, one computes their respective Double Metaphone keys, and then compares each combination:

Word 1 Primary - Word 2 Primary

Word 1 Primary - Word 2 Alternate

Word 1 Alternate - Word 2 Primary

Word 1 Alternate - Word 2 Alternate

Obviously if the keys in any of these comparisons are not produced for the given words, the comparisons involving those keys are not performed.

Depending upon which of the above comparisons matches, a match strength is computed. If the first comparison matches, the two words have a strong phonetic similarity. If the second or third comparison matches, the two words have a medium phonetic similarity. If the fourth comparison matches, the two words have a minimal phonetic similarity. Depending upon the particular application requirements, one or more match levels may be excluded from match results.

SQL Server extended stored procedure

In parts II and III of this series, sample applications in Visual Basic and Active Server Pages have been presented, which query a Microsoft Access database of words and Double Metaphone keys for phonetic similarity. In both of these applications, all phonetic matching logic existed at the application layer; that is, the database was simply a gopher, retrieving exact matches of specific criteria produced by the application layer.

For some applications, this arrangement is desirable. It places all application logic within the application, making it easier to maintain, as well as move to an alternate database platform. Indeed, any logic more advanced than an INNER JOIN is often moved to a "business logic" layer by software design purists, often for good cause, but sometimes simply because the latest trendy development fad calls for it.

In reality, certain functionality can be implemented more efficiently, or sometimes exclusively, from within the relational database itself. In general, the situations which lead to such a predicament are difficult to foresee from a generalist's perspective, yet when they do crop up, they must be dealt with. Thus is the reason for a SQL Server implementation of Double Metaphone: not everyone will need it at all, but there will be situations when it is the best, or only, option for a phonetic matching system.

Using the XP

SQL Server extended stored procedures (XP's from now on) are implemented as Win32 DLLs. To install an XP, copy it's DLL to the Binn directory in the SQL Server install path, then run sp_addextendedproc from the master database. For the Double Metaphone XP, the command will look like this:

use master
exec sp_addextendedproc 'xp_metaphone', 'XPMetaphone.dll'

Once this command has been executed, a new XP will be available, which for all intents and purposes is equivalent to a SQL stored procedure declared like so:

If you have been reading the previous articles, the use of this XP should be obvious. A word is passed to the @word parameter, containing the word for which to compute Double Metaphone keys. The two output parameters are set to the two Double Metaphone keys. @alternateKey is set to null if no alternate key is computed. Note that both of the key parameters are smallint; the XP uses the unsigned short optimization to compute Double Metaphone keys. The text version of Double Metaphone keys is not available from the XP.

To use the XP, simply call it as you would any stored procedure. When querying a table of keys, make sure all four possible key relationships are performed. Obviously, the weaker comparisons may be eliminated if your particular application calls for fewer irrelevant results. An example block of SQL to query a table called Words follows:

Executing the above SQL against a table populated with words from the 21,000 name sample data:

Note that the Double Metaphone keys for the words in the Words table have been pre-computed and stored in the key1 and key2 column. This is critical for tolerable database performance, as computing the keys for each word during the search would eliminate any ability of the database to optimize, and would result in suboptimal performance.

Implementation tips

This section contains some observations regarding implementation of phonetic matching, specifically in SQL Server. While SQL Server is targeted specifically, most of these tips are equally applicable to any modern RDBMS.

Make intelligent index decisions

As with any other database search, index decisions have a significant impact on the performance of phonetic searches. Since phonetic searches deal only with the phonetic key columns, special attention should be paid to optimizing the query performance of these columns. Of course, if the phonetic key columns share a table with other data upon which searches are required, the relative performance gain of optimizing one query over the other must be weighed. As with many database optimization decisions, it is often useful to try a few different configurations, to determine which yields the best performance.

For the sample database queried above, which consists of a single table of words, I have created a clustered index on the key1, key2 column set, which causes the data within the table to be physically stored in order, or key1, then key2. As a result, when SQL Server is searching for matching data using the primary Double Metaphone key (by far the most common match), all data being searched are located physically (on the disk) together, meaning faster read times and better read cache utilization. This optimization makes little difference with a 21,000 record database, but were it is to scale to 2,100,000 or 21,000,000, clustering by the phonetic key would dramatically improve phonetic match speed.

Unfortunately, only one clustered index per table is possible, for obvious reasons. Therefore, in many cases, a clustered index will already exist on some equally performance-critical column. In this case, a non-clustered index will often be better than none at all. Another alternative, with additional benefits, is described in the next tip.

Leave existing application tables unchanged

When building a new application, designing Double Metaphone into the database schema is relatively easy. However, when adding phonetic matching to an existing application, modifications to the schema may be an undesirable or forbidden option. Further, even if a new application is being built, by mixing Metaphone keys and other phonetic matching artifacts in with the rest of the schema, a significant risk is being taken that, the phonetic matching technique being used will be suitable. Phonetic matching research continues, and there is little reason to believe improved algorithms will not be released in the short term, therefore it behooves anyone considering phonetic matching to de-couple the specific algorithm selection from the immutable parts of the application as much as possible.

To this end, it makes a great deal of sense to create separate tables to contain phonetic key data, related to the original table by the latter's primary key. For example, if a Customers table contains a list of customers, one might create a CustomerPhoneticKeys table which contains the phonetic keys for each customer's last name, as well as the customer ID, which relates back to the Customers table. This also solves the problem from the previous tip of requiring a clustered index on the phonetic key columns: a clustered index might be placed on phonetic key columns of the CustomerPhoneticKeys table, thereby yielding the performance gains of a clustered index without any modification to the underlying table.

When performing searches including phonetic criteria, simply JOIN with the phonetic keys table and add the phonetic match criteria to the WHERE clause.

Use triggers to keep phonetic keys updated

Regardless of where the phonetic key data are stored, it remains critical that the phonetic key data remains updated. If phonetic matching is being added to an existing application, this can present a significant problem, since the application layers will not compute updated phonetic keys when submitting updates to the database. Even when building new applications with phonetic functionality, depending upon the constraints, computation of updated phonetic data at the application level may be undesirable or infeasible.

Fortunately, SQL Server allows the programmer to write "hooks" which are called when table events (insert, update, and delete) take place. By writing a trigger which calls the Double Metaphone XP and updates phonetic key data automatically, one can be assured of the freshest phonetic keys without additional programming. For example, to update the phonetic keys for my Words test table, I might create the following trigger (thanks to Diego Mijelshon for pointing out the naive error in my trigger code in a previous revision of this article):

If any records are inserted into the Words table, or updates applied to the Words table, this trigger is executed. If the word column is affected, the phonetic keys are re-computed and the Words table updated again, with new keys. Since this update does not change the word column, the trigger does not result in infinite recursion, though the trigger will be called again due to the update. Therefore, the SQL Server "recursive triggers" DB option must be enabled, for this trigger to work correctly.

Obviously, if the previous tip is implemented, and the phonetic keys are located in a separate table, the trigger would not update the table it is monitoring, but rather the table containing the phonetic keys which are to be kept current. In this situation, recursive triggers need not be enabled.

Now that this trigger is in place, one can insert and change words in the Words table at will, assured that the phonetic keys will always be up to date.

Compute match score in SELECT statement

In the Word Lookup sample application in Part I, a simple C function computed the match score for a word in the search results, based on which phonetic key from the search word matched which phonetic key from the result word. When performing phonetic matching with SQL, this computation can be performed inline; if needed, the results can be constrained by the resulting match score values. Consider this SQL:

The key here is the use of the CASE statement to evaluate a column to a different value depending upon which of a set of conditions is true. By using the same expression in the WHERE clause, results can be limited to a minimum match score just as readily.

Wrap search logic in stored procedures

When building a database application, it is often tempting to build SQL queries dynamically from within the application code, to perform the required insert, update, delete, and select operations. Unfortunately, this has a few problems:

First, it is very inefficient. Building dynamic strings in most languages is expensive, requiring several memory allocations and copies. Once the dynamic strings are built and passed to the database, the queries have to be parsed and validated, then a query plan computed. All of this takes a significant amount of time.

Second, it is difficult to modify. Any change in the database requires modification of application layer code. This is not de-coupling.

Third, it is potentially insecure. If care is not taken, attackers can insert database commands into search strings, which are then executed. Depending upon the database, these commands could lead to data loss or compromised security.

Fourth, it is ugly and bothersome to code.

A stored procedure, on the other hand, has none of these problems (except perhaps the fourth, depending upon one's disposition).

While these arguments are valid in the general case, a more compelling argument can be made for applications requiring phonetic matching functionality: the phonetic matching logic is contained entirely in the database, which arguably is where it belongs.

to perform a phonetic search. Note no reference to Double Metaphone, phonetic keys or any other information save the search criteria. The details of phonetic matching have been abstracted to the data store itself, which in many cases is the appropriate design.

While it was stated above, it should be re-iterated: by wrapping phonetic searches in stored procedures, and automatically updating phonetic keys with either stored procedures or triggers, no phonetic matching support is required anywhere else in the application. This is particularly appealing if one does not control the application code, or if the application code will be written in multiple versions to service different needs or platforms.

Conclusion

This article has introduced the Double Metaphone XP and several tips for building phonetic matching functionality, with SQL Server in particular and relational databases in general. The reader should now be well-equipped to design and build phonetic matching systems based in his or her particular requirements.

Continue on to Part V for a look at the .NET implementation of Double Metaphone, and finally Part VI for a survey of alternative phonetic matching techniques, and pointers to additional resources and implementations.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

Share

About the Author

My name is Adam Nelson. I've been a professional programmer since 1996, working on everything from database development, early first-generation web applications, modern n-tier distributed apps, high-performance wireless security tools, to my last job as a Senior Consultant at BearingPoint posted in Baghdad, Iraq training Iraqi developers in the wonders of C# and ASP.NET. I am currently an Engineering Director at Dell.

I have a wide range of skills and interests, including cryptography, image processing, computational linguistics, military history, 3D graphics, database optimization, and mathematics, to name a few.

Comments and Discussions

Before I start, I have to say you did a great job on this article series!

Nonetheless, your trigger code is not 100% right. It assumes there will be only one record updating, when they could be more.
Unfortunately, SQL Server (up to version 2000, at least) doesn't have a FOR EACH ROW kind of trigger, like Oracle or others.

One alternative is checking if SELECT COUNT(*) FROM inserted > 1 and, if that's the case, use a cursor to update each row.

The other is creating two helper FUNCTIONs, and then execute something like

You are, of course, right about the trigger. How I missed such an egregious error I do not know. Good catch and thanks for the heads up.

I've updated this article to reflect a somewhat more cogent trigger implementation. I've decided to just use a cursor regardless of the row count in inserted, for simplicity, even though I realize cursor performance lies somewhere between "three-legged dog" and "abysmal". Perhaps I will escape through that (in)famous authorial trapdoor: "[performance enhancements] left as an exercise to the reader."