I am particularly interested in address query parsing, including matching and weighting parts of the query, dealing with misspellings and variations, as well as in details about the physical data storage (e.g. schemas for direct relational database queries, approaches to data indexing etc.).

I have studied some documents about ArcGIS 10 geocoding, but they touch the actual implementation details just a little. Detailed documentation of other high-quality production implementations could also be helpful. The more technical the better. Theoretical algorithm papers are also great.

5 Answers
5

Daniel W. Goldberg, John P. Wilson, and Craig A. Knoblock
Abstract: This article presents a survey of the state of the art in geocoding practices through a cross-disciplinary historical review
of existing literature. We explore the evolving concept of geocoding and the fundamental components of the process. Frequently
encountered sources of error and uncertainty are discussed as well as existing measures used to quantify them. An examination
of common pitfalls and persistent challenges in the geocoding process is presented, and the traditional methods for overcoming
them are described.

The paper Mapperz linked to is very good and has alot of citations that will probably be of interest, but I don't think they do a very good job of describing string matching and its importance to the process of geocoding. They did briefly mention Soundex, but Soundex isn't the only option and not even the best option for addresses IMO. They did list quite a few citations that are pertinent to the topic, so those papers will be of interest to you.

This thread on the Stats exchange site talks about fuzzy matching two sets of strings, and all of the same techniques apply when matching addresses. Particularly I think using edit distances makes more sense than Soundex, especially with address details that have no Soundex analog. Calculating the Levenshtein distance between two strings is not all that complicated, and their are plenty of examples floating around the internet (here is one in Python).

I have just spent the past hour trying to find how ESRI implements their spelling sensitivity and their different candidate and match scores. I have found nothing but simple descriptions (the best of those I found in this PDF and 9.3's online help section). So if anyone can point me to some more detailed documentation I would be appreciative as well as the OP.