Turn a Parsnip into a Turnip with Edit Distance Algorithms : Page 2

Edit distance algorithms tell you how different two strings are from each other. That lets you see the differences between different versions of a string or file, or add differencing tools to your applications.

by Rod Stephens

Dec 10, 2009

Page 2 of 2

Programming Edit Distance

For any two strings, you could build a program that constructs the edit graph and then performs a shortest path calculation on it to find the best solution and the edit distance. For information on finding shortest paths in general, see my article Network Know-How: Finding Shortest Paths. The edit graph, however, has some special structure that we can use to our advantage.

The basic idea is to calculate the shortest paths between the start node in the upper left corner to every node in the graph. Because all of the links lead to the right and down (or both in the case of diagonals), we can calculate the distances starting in the upper left corner and sweeping to the right and down.

Start by labeling the upper left corner with the distance 0, because the path from that node to itself is zero.

Next label the nodes in the leftmost column with the values 1, 2, 3, and so forth. The paths to these nodes correspond to adding successive letters from the final string. In Figure 1, the “t” node has cost 1, the “u” node has cost 2, and so forth.

Similarly label the nodes across the top row of the graph with the values 1, 2, 3, and so forth. These paths correspond to removing letters from the initial string. Now you can begin sweeping through the graph to fill in the other values.

Consider each node in the second column from top to bottom. There are three possible paths that you could take to get to this node.

First, you could get to the node from the node above. That move corresponds to adding a new letter to the string so the cost to get to this node is 1 greater than the cost needed to get to the node above.

Second, you could get to the node from the node to the left. That move corresponds to removing a letter from the original string so the cost to get to this node is 1 greater than the cost needed to get to the node to the left.

Third, you could get to the node along a diagonal link if one ends at this node. Diagonal links have zero cost so the cost is the same as the cost to get to the node to the upper left.

To find the best cost to get to this node, the algorithm considers all three of these possible paths and then picks the best one. It records the node’s new distance and, since you’ll probably want to know what the best path is, it records from which node the path came.

The StringEditDistance program shown in Figure 2 uses this algorithm. Notice that the result follows the blue path in Figure 1: it deletes “pa,” adds “tu,” and deletes “s.”

Listing 1 shows the MakeEditGraph function that program StringEditDistance uses to build the edit graph and calculate node distances. It builds the graph’s array and initializes the leftmost column and top row. It then moves through the array from left-to-right filling in other node distances.

Listing 2 shows the DisplayResults routine that the program uses to display the results. It follows the best path through the edit graph backwards from the finish node to the start node. At each node it visits, it checks the type of change that was made to get to this node (deletion, insertion, or leaving a letter alone) and it adds an appropriate letter to the beginning of the output. As you can see in Figure 2, the program draws deleted letters in crossed-out red, inserted letters in underlined blue, and letters that are the same in both strings in black.

Comparing Files Line-by-Line

Example program FileStringEditDistance shown in Figure 3 uses the same algorithm to compare two files character-by-character. (If you look very closely at Figure 3, you’ll see that the program removed the first “i” in the line that declares the variable distance1. See if you can figure out why. I’ll give you the answer near the end of this article.)

Figure 3. File Guile: Program FileStringEditDistance finds the difference between two files treated as strings.

Program FileStringEditDistance does a reasonable job of comparing files. By looking at Figure 3, you can easily tell which characters were added or deleted.

Unfortunately program FileStringEditDistance has a big drawback: it can potentially use a lot of memory. The files used in Figure 3 were relatively small but they hold more than 2,500 characters so the edit graph array contains more than 6.5 million entries. A typical chapter in one of my books could easily contain more than 50,000 characters. The edit graph for two versions of such a file would contain more than 2.5 billion entries! Not only would that use up a huge chunk of memory, but it would take the program quite a while to build the graph.

Fortunately it’s easy to modify the algorithm to compare the files one line at a time. Simply replace the two strings with two arrays of strings. Instead of comparing characters within two strings, you now compare strings within two arrays of strings.
The FileLineEditDistance example program shown in Figure 4 uses this approach to compare files line-by-line.

Download the examples and take a look at their text to see the differences. In fact, you can use the FileLineEditDistance program to compare the files as shown in Figure 4.

Different Differences

Download the examples and give them a try. Then if you’re feeling adventurous, make some modifications. There are several variations that you might like to try.

For example, these examples find a single shortest path through the edit graph but not necessarily the best. For example, suppose you wanted to find the difference between the strings BBBBC and BC. The StringEditDistance program drops the middle Bs to give BBBBC but you might prefer that it drop the first Bs so it can keep the two remaining characters together as in BBBBC.

A variation of the FileLineEditDistance program might compare lines and, if the lines are different, then compare the text within the lines. The result would be almost the same as the result given by the FileStringEditDistance program but without the enormous edit graph.

Another popular variation on the FileLineEditDistance program displays the two files side-by-side with corresponding lines highlighted. (That’s what Visual Source Safe’s different tool does.)

If you keep all of your files in a source control system or you like Microsoft Word’s revision tracking system, then you may never need these techniques. But for other stray files, these methods can be useful. You can even add them to your applications so users can easily find the changes in different versions of scripts, memos, and other text they have saved.

Rod Stephens is a consultant and author who has written more than a dozen books and two hundred magazine articles, mostly about Visual Basic. During his career he has worked on an eclectic assortment of applications for repair dispatch, fuel tax tracking, professional football training, wastewater treatment, geographic mapping, and ticket sales. His VB Helper web site receives more than 7 million hits per month and provides three newsletters and thousands of tips, tricks, and examples for Visual Basic programmers.