In order to efficiently search using levenshtein distance, you need an efficient, specialised index, such as a bk-tree. Unfortunately, no database system I know of, including MySQL, implements bk-tree indexes. This is further complicated if you're looking for full-text search, instead of just a single term per row. Off-hand, I can't think of any way that you could do full-text indexing in a manner that allows for searching based on levenshtein distance.

unfortunately this result in it being 10% slower. I have however implemented the string length, he proposes using string at max or smaller, I have implemented a compare on only string +/- 1 length.
–
Andrew ClarkMar 13 '09 at 14:42

Sorry for the noob question but when I copy this to a text file leven, and then run \. leven, I get multiple errors from MySQL 5: ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server... near '' at line 4.
–
maxNov 1 '12 at 7:15

I am setting up a search based on Levenshtein or Damerau-Levenshtein (probably the latter) for multiple searches over an indexed text, based on a paper by by Gonzalo Navarro and Ricardo Baeza-yates: link text

After building a suffix array (see wikipedia), if you are interested in a string with at most k mismatches to the search string, break the search string into k+1 pieces; at least one of those must be intact. Find the substrings by binary search over the suffix array, then apply the distance function to the patch around each matched piece.

I had a specialized case of k-distance searching and after installing the Damerau-Levenshtein UDF in MySQL found that the query was taking too long. I came up with the following solution:

I have a very restrictive search space (9 character string limited to numeric values).

Create a new table (or append columns to your target table) with columns for each character position in your target field. ie. My VARCHAR(9) ended up as 9 TINYINT columns + 1 Id column that matches my main table (add indexes for each column). I added triggers to ensure that these new columns always get updated when my main table gets updated.

where s is your search string and m is the required number of matching characters (or m = 9 - d in my case where d is the maximum distance I want returned).

After testing I found that a query over 1 million rows that was taking 4.6 seconds on average was returning matching ids in less than a second. A second query to return the data for the matching rows in my main table similarly took under a second. (Combining these two queries as a subquery or join resulted in significantly longer execution times and I'm not sure why.)

Though this is not Damerau-Levenshtein (doesn't account for substitution) it suffices for my purposes.

Though this solution probably doesn't scale well for a larger (length) search space it worked for this restrictive case very well.

The function given for levenshtein <= 1 above is not right -- it gives incorrect results for e.g., "bed" and "bid".

I modified the "MySQL Levenshtein distance query" given above, in the first answer, to accept a "limit" that will speed it up a little. Basically, if you only care about Levenshtein <= 1, set the limit to "2" and the function will return the exact levenshtein distance if it is 0 or 1; or a 2 if the exact levenshtein distance is 2 or greater.

This mod makes it 15% to 50% faster -- the longer your search word, the bigger the advantage (because the algorithm can bail earlier.) For instance, on a search against 200,000 words to find all matches within distance 1 of the word "giggle," the original takes 3 min 47 sec on my laptop, versus 1:39 for the "limit" version. Of course, these are both too slow for any real-time use.