Description

The Levenshtein distance is defined as the minimal number of
characters you have to replace, insert or delete to transform
str1 into str2.
The complexity of the algorithm is O(m*n),
where n and m are the
length of str1 and
str2 (rather good when compared to
similar_text(), which is O(max(n,m)**3),
but still expensive).

In its simplest form the function will take only the two
strings as parameter and will calculate just the number of
insert, replace and delete operations needed to transform
str1 into str2.

A second variant will take three additional parameters that
define the cost of insert, replace and delete operations. This
is more general and adaptive than variant one, but not as
efficient.

Parameters

str1

One of the strings being evaluated for Levenshtein distance.

str2

One of the strings being evaluated for Levenshtein distance.

cost_ins

Defines the cost of insertion.

cost_rep

Defines the cost of replacement.

cost_del

Defines the cost of deletion.

Return Values

This function returns the Levenshtein-Distance between the
two argument strings or -1, if one of the argument strings
is longer than the limit of 255 characters.

// if this distance is less than the next found shortest // distance, OR if a next shortest word has not yet been foundif ($lev <= $shortest || $shortest < 0) {// set the closest match, and shortest distance$closest = $word;$shortest = $lev; }}

// update the encoding map with the characters not already metforeach ($matches[0] as $mbc) if (!isset($map[$mbc]))$map[$mbc] = chr(128 + count($map));

// finally remap non-ascii charactersreturn strtr($str, $map);}

// Didactic example showing the usage of the previous conversion function but,// for better performance, in a real application with a single input string// matched against many strings from a database, you will probably want to// pre-encode the input only once.//function levenshtein_utf8($s1, $s2){$charMap = array();$s1 = utf8_to_extended_ascii($s1, $charMap);$s2 = utf8_to_extended_ascii($s2, $charMap);

Here is an implementation of the Levenshtein Distance calculation that only uses a one-dimensional array and doesn't have a limit to the string length. This implementation was inspired by maze generation algorithms that also use only one-dimensional arrays.

I have tested this function with two 532-character strings and it completed in 0.6-0.8 seconds.

At the time of this manual note the user defined thing in levenshtein() is not implemented yet. I wanted somethinglike that, so I wrote my own function. Note that thisdoesn't return levenshtein() difference, but instead an array of operations to transform a string to another.

Please note that the difference finding part (resync)may be extremely slow on long strings.

I really like [the manual's] example for the use of the levenshtein function to match against an array. I ran into the need to specify the sensitivity of the result. There are circumstances when you want it to return false if the match is way out of line. I wouldn't want "marry had a little lamb" to match with "saw viii" simply because it was the best match in the array. Hence the need for sensitivity:

Using PHP's example along with Patrick's comparison percentage function, I have come up with a function that returns the closest word from an array, and assigns the percentage to a referenced variable:

Try combining this with metaphone() for a truly amazing fuzzy search function. Play with it a bit, the results can be plain scary (users thinking the computer is almost telepathic) when implemented properly. I wish spell checkers worked as well as the code I've written.

I would release my complete code if reasonable, but it's not, due to copyright issues. I just hope that somebody can learn from this little tip!

One application of this is when you want to look for a similar match instead of an exact one. You can sort the results of checking the distances of a word to a dictionary and sort them to see which were the more similar ones. Of course it will be a quite resourse consuming task anyway.

// calculate the distance between the input word, // and the current word$lev = levenshtein($input, $word);

// if this distance is less than the next found shortest // distance, OR if a next shortest word has not yet been foundif ($lev <= $shortest || $shortest < 0) {// set the closest match, and shortest distance$closest = $word;$shortest = $lev; }}

I am using this function to avoid duplicate information on my client's database.

After retrieving a series of rows and assigning the results to an array values, I loop it with foreach comparing its levenshtein() with the user supplied string.

It helps to avoid people re-registering "John Smith", "Jon Smith" or "Jon Smit".

Of course, I can't block the operation if the user really wants to, but a suggestion is displayed along the lines of: "There's a similar client with this name.", followed by the list of the similar strings.

$best_i = 0;$best_lcs = 0; foreach($left as $i => $lcs_left){$option = $lcs_left + $right[$i]; if($best_lcs < $option){$best_lcs = $option;$best_i = $i; } } return self::lcs(self::substr($a,0,$best_i), self::substr($b,0,$bl>>1)).self::lcs(self::substr($a,$best_i), self::substr($b,$bl>>1)); }?>This is a classic implentation in which several tricks are used:1. the strings are exploded into multi-byte characters in O(n lg n) time2. instead of searching for the longest path in a precomputed two-dimensional array, we search for the best point which lays in the middle column. This is achieved by spliting the second string in half, and recursively calling the algorithm twice. The only thing we need from the recursive call are the values in the middle column. The trick is to return the last column from each recursive call, which is what we need for the left part, but requires one more trick for the right part - we simply mirror the strings and the array so that the last column is the first column. Then we just find the row which maximizes the sum of lenghts in each part.3. one can prove that the time consumed by the algorithm is proportional to the area of the (imaginary) two-dimensional array, thus it is O(n*m).

I wrote this function to have an "intelligent" comparison between data to be written in a DBand already existent data. Not ony calculating distances but also balancing distances foreach field.<?php/*This function calculate a balanced percentage distance between an array of strings"$record" and a compared array "$compared", balanced through an array ofweights "$weight". The three arrays must have the same indices.For an unbalanced distance, set all weights to 1.The used formula is:percentage distance = sum(field_levenshtein_distance * field_weight) / sum(record_field_length * field_weight) * 100*/function search_similar($record, $weights, $compared, $precision=2) {$field_names = array_keys($record);# "Weighted length" of $record and "weighted distance".foreach ($field_names as $field_key) {$record_weight += strlen($record[$field_key]) * $weights[$field_key];$weighted_distance += levenshtein($record[$field_key],$compared[$field_key]) * $weights[$field_key]; }# Building the result..if ($record_weight) { return round(($weighted_distance / $record_weight * 100),$precision); } elseif ((strlen(implode("",$record)) == 0) && (strlen(implode("",$compared)) == 0)) { // empty recordsreturn round(0,$precision); } elseif (array_sum($weights) == 0) { // all weights == 0return round(0,$precision); } else { return false; }/* Be very careful distinguising 0 result and false result. The function results 0 ('0.00' if $precision is 2 and so on) if: - $record and $compared are equals (even if $record and $compared are empty); - all weights are 0 (the meaning could be "no care about any field"). Conversely, the function results false if $record is empty, but the weights are not all 0 and $compared is not empty. That cause a "division by 0" error. I wrote this kind of check:

Regarding the post by fgilles on April 26th 2001, I suggest not to use levenshtein() function to test for over-uppercasing unless you've got plenty of time to waste in your host. ;) Anyhow, I think it's a useful feature, as I get really annoyed when reading whole messages in uppercase.

PHP's levenshtein() function can only handle up to 255 characters, which is not realistic for user input (only the first paragraph oh this post has 285 characters). If you choose to use a custom function able to handle more than 255 characters, efficiency is an important issue.

I think 10% is enough for written English (maybe other languages like German, which use more capital letters, need more). With some sentencies in uppercase (everybody has the right to shout occasionally), 20% would be enough; so I use a threshold of 30%. When exceeded, I lowercase the whole message.

Hope you find it useful and it helps keeping the web free of ill-mannered people.

For spell checking applications, delay could be tolerable if you assume the typist got the first two or three chars of each word right. Then you'd only need to calc distances for a small segment of the dictionary. This is a compromise but one I think a lot of spell checkers make.For an example of site search using this function look at the PHP manual search button on this page. It appears to be doing this for the PHP function list.