Distance Between Strings

It's a nice feature of the latest word processor programs that they are capable of suggesting
a replacement for a mistyped word. Spelling checkers know how to evaluated distance between a misspelled word and the words in its files. Words whose evaluated distance is the smallest are offered as candidates for replacement. The applet below helps you acquaint yourself with two possible distances between strings. These metric functions attempt to ascribe a numeric (actually, integer) value to the degree of dissimilarity between two strings. That the functions do indeed satisfy the metric axioms can be shown by induction. (Which is a good exercise too.)

Hamming Distance

The Hamming distance H is defined only for strings of the same length. For two strings s and t, H(s, t) is the number of places in which the two string differ, i.e., have different characters.

Levenshtein Distance

The Levenshtein (or edit) distance is more sophisticated. It's defined for strings of arbitrary length. It counts the differences between two strings, where we would count a difference not only when strings have different characters but also when one has a character whereas the other does not. The formal definition follows.

For a string s, let s(i) stand for its ith character. For two characters a and b, define

r(a, b) = 0 if a = b. Let r(a, b) = 1, otherwise.

Assume we are given two strings s and t of length n and m, respectively. We are going to fill an (n+1)×(m+1) array d with integers such that the low right corner element d(n+1, m+1) will furnish the required values of the Levenshtein distance L(s, t).

(Type your strings in the two edit controls at the bottom of the applet. Click "Do It!".)

You can experiment with the applet to verify the triangle inequality before trying to prove it. You may try to discover other features of functions H and L. If strings s and t have the same length then both H(s, t) and L(s, t) are defined. Must they be equal?

Both Hamming and Levenshtein functions are metrics. The only portion of the definitions that may appear non-trivial is the triangle inequality. We show it by induction.

What do functions H and L stand for? H(s, t), by definition, is the number of places at which strings s and t differ. In other words, it takes H(s, t) changes to make string t out of string s. The triangle inequality

H(s, t)≤ H(s, r) + H(r, t)

just asserts that changing s to t "via" r could not be worse than changing s into t directly.

L(s, t) has a similar interpretation, although the allowed "changes" are more sophisticated, e.g. one is allowed to add or remove a character. These are known as edit operations. L(s, t) is the minimum number of edit operations that transform s into t. (As a matter of fact, this assertion requires a proof of its own, see below.) However, to edit s into t one may first edit s into r and then r into t. That such a change from s to t is not necessarily optimal is exactly the subject of the triangle inequality:

L(s, t)≤ L(s, r) + L(r, t).

Now, why is L(s, t) is the minimum number of edit operations that transform s into t. Let's look more carefully at the definition:

(1)

d(i, j) = min(d(i-1, j)+1, d(i, j-1)+1, d(i-1, j-1) + r(s(i), t(j)))

To estimate the effort needed to transform s(0, i) - the first i characters of s - into t(0, j) - the first j characters of t - consider the three terms in (1), which correspond to the following situations:

Use d(i, j-1) operations to edit s(0, i) into t(0, j-1) and then append t(j).

Use d(i-1, j-1) operations to edit s(0, i-1) into t(0, j-1) and then replace s(i) with t(j), if necessary.

If the three terms on the right in (1) are optimal, then the term on the left is also optimal, for, on the last step of editing, one can't do better than making a single change or not making any change at all.