An O(ND) Difference Algorithm for C#

This article is about comparing text files and the proven, best and most famous
algorythm to identify the differences between them. The source code that you can
find in the download implements a small class with a simple to use API that just
does this job. You should have it in the bag of your algorythms.

Beside the class that implements the algorythm there is also a sample web application
that compares 2 files and generates html output with a combined and colored document.

The algorythm was first published 20 Years ago under "An O(ND) Difference
Algorithm and its Variations" by Eugene Myers Algorithmica Vol. 1 No. 2, 1986, p
251 . You can find a copy if it at
http://xmailserver.org/diff2.pdf. In this article you can find a abstract
recursive definition of the algorythm using some pseudo-code that needs to be transferred
to a existing programming language.

There are many C, Java, Lisp implementations public available of this algorythm
out there on the internet. Before I wrote the C# version I discovered that almost
all these implementations seem to come from the same source (GNU diffutils) that
is only available under the (unfree) GNU public license and therefore cannot be
reused as a source code for a commercial or redistributable application without
being trapped by the GNU license.

There are very old C implementations that use other (worse) heuristic algorithms.
Microsoft also published source code of a diff-tool (windiff) that uses some tree
structures. Also, a direct transfer from a C source to C# is not easy because there
is a lot of pointer arithmetic in the typical C solutions and I wanted a managed
solution. I tried a lot sources but at least found no usable solution written for
the .NET platform.

These are the reasons why I implemented the original published algorithm from the
scratch and made it available without the GNU license limitations under a BSD style license. The history
of this implementation is back to 2002 when I published a Visual Studio add-in that
also can compare files, see
http://www.codeproject.com/KB/macros/WebReports8.aspx. I found no more bugs in
the last 3 years so I think that the code is stable.

I did not need a high performance diff tool. I will do some performance tweaking
when needed, so please let me know. I also dropped some hints in the source code
on that topic.

How it works (briefely)

Comparing the characters of 2 huge text files is not easy to implement and tends
to be slow. Comparing numbers is much easier so the first step is to compute unique
numbers for all textlines. If textlines are identical then identical numbers are
computed.

There are some options before comuting these numbers that normally are usefull
for some kind of text: stripping off space characters and comparing case insensitive.

The core algorithm itself will compare 2 arrays of numbers and the preparation
is done in the private DiffCodes method and by using a Hashtable.

The methods DiffText and DiffInts.

The core of the algorythm is built using 2 methods:LCS: This is the divide-and-conquer implementation of the longes
common-subsequence algorithm.SMS: This method finds the Shortest Middle Snake.

To get some usable performance I did some changes to the original algorithm. The
original algorithm was described using a recursive approach and comparing zero indexed
sequences and passes parts of these sequences as parameters. Extracting sub-arrays
and rejoining them is very CPU and memory intensive. To avoid copying these sequences
as arrays around the used arrays together with the lower and upper bounds are passed
while the sequences are not copied around all the time. This circumstance makes
the LCS and SMS functions more complicate.

I added some code to the LCS function to get a fast response on sub-arrays that
are identical, completely deleted or inserted instead of recursively analysing them
down to the single number case.

The result is stored in 2 arrays that flag for modified (deleted or inserted)
lines in the 2 data arrays. These bits are then analyzed to produce a array of objects
that describe the found differences.

Read the original article if you want to understand more.

The API

To use this functionality you only have to call one of the DiffText methods. They
all get a pair of strings and return an array of items that describe the inserts
and deletes between the 2 strings. There are no "changes" reported. Instead you
can find a "insert" and "deleted" pair.

DiffText(string TextA, string TextB)

Find the difference in 2 texts, comparing by textlines without any conversion.
A array of Items containing the differences is returned.

License

Built-in Self-TEST

The class has now a built-in self-test, that's working without changing the code.
If you compile the diff and debug class files together you get a exe file that's
testing some simple diff-scenarios that where used to discover the bugs in the Version
1 and 2 (OutOfArrayBounds typically).