Introduction

You've probably used the Unix diff program, or one of its Win32 descendants, or some program that depends directly on it, such as patch, or the CVS source code maintenance system. All of these require calculating the smallest set of differences between two sequences (or equivalently, figuring out the longest subsequences that are the same). It's an intuitively simple problem, but surprisingly difficult to solve and implement correctly. That's what this module does, and it does it well.

Genealogy

The algorithm implemented in the module was originally published in a May 1977 Communications of the ACM article entitled "A Fast Algorithm for Computing Longest Common Subsequences" by J. W. Hunt and T. G. Szymanski. The Unix diff program was originally written by D. McIlroy and J. W. Hunt, who described their work in Bell Laboratories Technical Report 41 in 1976. This module was originally written by Mark-Jason Dominus, but the existing version is by Ned Konz. It's available on CPAN (alternate).

Details

The difficulty of the algorithm precludes sitting down and just "grinding it out"; when you need it, this module's a life-saver. It provides three separate ways to get at the algorithm -- in other words, three interfaces to the core functionality. The documentation mentions these in order of increasing generality, but they're also slightly different in what they do. They are:

LCS

This routine takes two array references, and returns the longest common subsequence. In other words, the returned result is an array (or a reference to one) containing the longest run of elements which are not different.

diff

The output of this routine is the smallest set of changes that are required to bring the two input arrays into agreement. It's similar to diff output. The result is a multidimensional array; each element is a so-called "hunk" -- that is, one logical group of differences. Within each hunk are one or more arrays of three elements each: a + or - sign, the index of the item to be inserted or deleted, and the data itself.

This is a lot easier to show than to describe. This example appears in the POD, although I've reformatted it very slightly. Suppose the two input arrays contain:

This function works in a callback style: it examines each input array an element at a time, and calls a supplied routine depending on whether the element is an element of the LCS or not. Up to five callbacks may be defined; referring to the input sequences as A and B, they are:

MATCH

Called when the element under consideration is a member of the LCS;

DISCARD_A and DISCARD_B

Called when the item in the respective array has to be discarded to bring the lists into agreement; and

A_FINISHED and B_FINISHED

This is really a special case of the above. When one array &lq;runs out&rq; of elements, this routine is called if it is available (otherwise the DISCARD callback for the other is invoked).

When would you use each one? You might consider using LCS if all you need is the longest match between the arrays. You could use diff to print diff-like output. For example, here's an extremely simple diff program (more for illustration than usefulness):

Finally, traverse_sequences is very handy when you already know what you need to do in response to a difference (say, delete an item).

All of these routines accept a "key generation function" as an optional final argument. This is needed because internally the routine uses numeric comparison (the <=> operator, in principle). But what if the values you have are not numeric? Or if you have a more complicated data structure, such as an array of hashes (such as object instances)? The key generation function should, given a reference to an object in your array, return a numeric value that can be used to meaningfully compare it to another element. That is, it should return some sort of hash or key that represents the element. If the elements are logically the same, it should return the same value (note that the ref operator would almost never be a good choice.)

Drawbacks

To me, the key generation functions have always seemed unnatural. I'm more used to being able to supply the actual comparison function, in the style of sort or grep. But that may cause a significant performance hit -- I don't know.

Another seeming inflexibility is that the input arrays have to be in memory. It would be nice to be able to process files without having to slurp them, by providing routines which returned the next element for the given array. (But my guess is that in the worst case -- having no elements in common -- that would still end up reading both files completely into memory.)

The above two are minor nit-picks; the only real problem I have with this module is that the documentation is hard to understand. The first time I downloaded it from CPAN, it took a good deal of trial-and-error to figure out what it was really doing. But once I got the hang of it, it's been extremely useful. Hopefully this node will help in that direction.

Summary

Algorithm::Diff is fast and relatively flexible. It implements a difficult algorithm in a nice package. Take the time to read and understand the documentation, and it will serve you well.

Update 2002-Aug-21

Brother tye has written an instructive node discussing the interfaces for returning data in Algorithm::Diff icult to use. There he also shows code for a more simplified interface that may be incorporated into the module in the future. If you're trying to figure out how to use the module, that code may be of great help.

When putting a smiley right before a closing parenthesis, do you:

Use two parentheses: (Like this: :) )
Use one parenthesis: (Like this: :)
Reverse direction of the smiley: (Like this: (: )
Use angle/square brackets instead of parentheses
Use C-style commenting to set the smiley off from the closing parenthesis
Make the smiley a dunce: (:>
I disapprove of emoticons
Other