Introduction

During one of my daily visits to CodeProject, I ran across an excellent article by Aprenot: A Generic - Reusable Diff Algorithm in C#. Aprenot’s method of locating the Longest Common Sequence of two sequential sets of objects works extremely well on small sets. As the sets get larger, the algorithm begins experiencing the constraints of reality.

At the heart of the algorithm is a table that stores the comparison results of every item in the first set to every item in the second set. Although this method always produces a perfect difference solution, it can require an enormous amount of CPU and memory. The author solved these issues by only comparing small portions of the two data sets at a time. However, this solution makes the assumption that the changes between the data sets are very close together, and reports inefficient results when the data sets are large with dispersed changes.

This article will present an algorithm based on the one presented by aprenot. The goals of the algorithm are as follows:

Maintain the original concepts of Generic and Reusable.

Correctly handle large data sets.

Greatly lower the number of comparisons necessary.

Greatly lower the memory requirements.

The Problem Defined

Given a data set of 100,000 items, with the need to compare it to a data set of similar size, you can quickly see the problem with using a table to hold the results. If each element in the result table was 1 bit wide, you would need over a Gigabyte of memory. To fill the table, you would need to execute 10 billion comparisons.

Some Terminology

There are always two sets of items to compare. To help differentiate between the two sets, they will be called Source and Destination. The question we are usually asking is: What do we need to do to the Source list to make it look like the Destination list?

Generic and Reusable

In order to maintain the generic aspects of the original algorithm, a generic structure is needed. I chose to make use of C#’s interface.

Both the Destination list and Source list must inherit IDiffList. It is assumed that the list is indexed from 0 to Count()-1. Just like the original article, the IComparable interface is used to compare the items between the lists.

Included in the source code are two structures that make use of this interface; DiffList_TextFile and DiffList_BinaryFile. They both know how to load their respective files into memory and return their individual items as IComparable structures. The source code examples for these objects should be more than adequate for expanding the system to other object types. For example, the system can be easily expanded to compare rows between DataSets or directory structures between drives.

The Overall Solution

The problem presented earlier is very similar to the differences in sorting algorithms. A shell sort is very quick to code but highly inefficient when the data set gets large. A quick sort is much more efficient on a large dataset. It breaks up the data into very smaller chunks using recursion.

The approach this algorithm takes is similar to that of a quick sort. It breaks up the data into very smaller chunks and processes those smaller chunks through recursion. The steps the algorithm takes are as follows:

Find the current Longest Matching Sequence (LMS) of items.

Store this LMS in a results pile.

Process all data left above the LMS using recursion.

Process all data left below the LMS using recursion.

These steps recursively repeat until there is no more data to process (or no more matches are found). At first glance, you should be able to easily understand the recursion logic. What needs further explanation is Step 1. How do we find the LMS without comparing everything to everything? Since this process is called recursively, won’t we end up re-comparing some items?

Finding the Longest Matching Sequence

First, we need to define where to look for the LMS. The system needs to maintain some boundaries within the Source and Destination lists. This is done using simple integer indexes called destStart, destEnd, sourceStart and sourceEnd. At first, these will encompass the entire bounds of each list. Recursion will shrink their ranges with each call.

To find the LMS, we use brute force looping with some intelligent short circuits. I chose to loop through the Destination items and compare them to the available Source items. The pseudo code looks something like this:

For each Destination Item in Destination Range
Jump out if we can not mathematically find a longer match
Find the LMS for this destination item in the Source Range
If there is a match sequence
If it’s the longest one so far – store the result.
Jump over the Destination Items that are included in this sequence.
End if
Next For
If we have a good best match sequence
Store the match in a final match result list
If there is space left above the match in both
the Destination and Source Ranges
Recursively call function with the upper ranges
If there is space left above the match in both
The Destination and Source Ranges
Recursively call function with upper ranges
Else
There are no matches in this range so just drop out
End if

The Jumps are what gives this algorithm a lot of its speed. The first mathematical jump looks at the result of some simple math:

This formula calculates the theoretical best possible match the current Destination item can produce. If it is less than (or equal too) the current best match length then there is no reason to continue in the loop through the rest of the Destination range.

The second jump is more of a leap of faith. We ignore overlapping matching sequences by jumping over the destination indexes that are internal to a current match sequence, and therefore, cut way down on the number of comparisons. This is a really big speed enhancement. If the lists we are testing contain a lot of repetitive data, we may not come up with the perfect solution, but we will find a valid solution fast. I found in testing that I had to manually create a set of files to demonstrate the imperfect solution. In practice, it should be very rare. You can comment out this jump in the source code and run your own tests. I have to warn you that it will greatly increase the calculation time on large data sets.

There is a very good chance during recursion that the same Destination Item will need to be tested again. The only difference in the test will be the width of the Source Range. Since the goal of the algorithm is to lower the number of comparisons, we need to store the previous match results. DiffState is the structure that stores the result. It contains an index of the first matching item in the source range and a length so that we know how far the match goes for. DiffStateList stores the DiffStates by destination index. The loop simple requests the DiffState for a particular destination index, and DiffStateList either returns a pre-calculated one or a new uncalculated one. There is a simple test performed to see if the DiffState needs to be recalculated given the current Destination and source ranges. DiffState will also store a status of 'No Match' when appropriate.

If a DiffState needs to be recalculated, an algorithm similar to the one above is called.

For each Source Item in Source Range
Jump out if we can not mathematically find a longer match
Find the match length for the Destination Item
on the particular Source Index
If there is a match
If it’s the longest one so far – store the result
Jump over the Source Items that are included in the match
End if
End for
Store the longest sequence or mark as 'No Match'

The Jumps are again giving us more speed by avoiding unnecessary item comparisons. I found that the second jump cuts the run time speed by 2/3rds on large data sets. Finding the match length for a particular destination item at a particular source index is just the result of comparing the item lists in sequence at those points and returning the number of sequential matches.

Gathering the Results

After the algorithm has run its course, you will be left with an ArrayList valid match objects. I use an object called DiffResultSpan to store these matches. A quick sort of these objects will put them in the necessary sequential order. DiffResultSpan can also store the necessary delete, addition and replace states within the comparison.

To build the final result ArrayList of ordered DiffResultSpans, we just loop through the matches filling in the blank (unmatched) indexes in between. We return the result as an ordered list of DiffResultSpans that each contains a DiffResultSpanStatus.

You can now process the ArrayList as is necessary for your application.

Algorithm Weaknesses

When using the algorithm described above, and when the data sets are completely different, it will compare every item in both sets before it finds out that there are no matches. This can be a very time consuming process on large datasets. On the other hand, if the data sets are equivalent, it will find this out in one iteration of the main loop.

Although it should always find a valid difference, there is a chance that the algorithm will not find the best answer. I have included a set of text files that demonstrate this weakness (source.txt and dest.txt). There is a sequence of 5 matches that is missed by the system because it is overlapped by a previous smaller sequence.

Algorithm Update 6/10/2004

To help address the 2nd weakness above, three levels of optimization were added to the Diff Engine. Tests identified that large, highly redundant data produced extremely poor difference results. The following enum was added:

This can be passed to an additional ProcessDiff() method. The engine is still fully backward compatible and will default to the original Fast method. Only tests can identify if the Medium or SlowPerfect levels are necessary for your applications. The speed differences between the settings are quite large.

The differences in these levels effect when or if we jump over sections of the data when we find existing match runs. If you are interested, you will find the changes in the ProcessRange() method. They start on line 107 of Engine.cs.

Included Files

The project DiffCalc is the simple front-end used to test the algorithm. It is capable of doing a text or binary diff between two files.

The DLL project DifferenceEngine is where all the work is done.

Engine.cs: Contains the actual diff engine as described above.

Structures.cs: Contains the structures used by the engine.

TextFile.cs: Contains the structures designed to handle text files.

BinaryFile.cs: Contains the structures designed to handle binary files.

Special Note

The first line in Structures.cs is commented out. If you uncomment this line, you will lower the memory needs of the algorithm by using a HashTable instead of an allocated array to store the intermediate results. It does slow down the speed by a percent or two. I believe the decrease is due to the reflection that becomes necessary. I am hoping that the future .NET Generics will solve this issue.

Summary

If you take a step back from the algorithm, you will see that it is essentially building the table described in aprenot's original article. It simply attempts to intelligently jump over large portions of the table. In fact, the more the two lists are the same, the quicker the algorithm will run. It also stores what it does calculate in a more efficient structure instead of rows & columns.

Hopefully, others will find good uses for this code. I am sure there are some more optimizations that will become apparent over time. Drop me a note when you find them.

History

May 18, 2004 'Off by One' error on line 106 in Engine.cs fixed. Thanks goes to Lutz Hanusch for identifying the bug.

May 27, 2004 Incrementor was missing on line 96 in Results.cs in the demo. Thanks goes to Cash Foley for identifying the bug.

June 10, 2004 Speed optimizations led to very poor diff results on redundant data. Two slower diff engine levels have been added to solve this problem. Thanks goes to Rick Morris for identifying this weakness.

Comments and Discussions

Found I needed to make change to your your very cool code.I was getting a difference reported when there wasn't. Has to be many real differences to make this occur.Did some poking in the code and found something subtle you may want to look into.It may not matter and I may have broken something to have caused this but figured I should pass it along.

In ProcessRange() toward the bottom of it, there is the processing for what comes after the match...And there is one spot where you have a ">" and this may need to be a ">=".This is the last conditional where ( sourceEnd > upperSourceStart ).Might need to be ( sourceEnd >= upperSourceStart ).

My apologies if I broke this to make it occur in the first place. And thanks again for a wizard contribution!

Thanks. This is exactly what I was looking for. I had question regarding this.currently if there is change in any line it highlights the entire line with red and green color. Is that possible to highlight exact difference in line.

I was looking for some diff viewer, and took a look on code project project, but unfortunately even thus this looks like good piece of code - if you want to make your own diff engine, but if you need diff viewer, then it's easier to pick up something already available.

very nice but if someone can just highlight only those words or characters which are inserted or deleted or modified instead of highlighting the whole line i think that better and it becomes more professional can someone do This if you can then please reply me after improving I am waiting for your reply God bless you ..

very nice but if someone can just highlight only those words or characters which are inserted or deleted or modified instead of highlighting the whole line i think that better and it becomes more professional can someone do This if you can then please reply me after improving I am waiting for your reply God bless you ..

very nice but if someone can just highlight only those words or characters which are inserted or deleted or modified instead of highlighting the whole line i think that better and it becomes more professional can someone do This if you can then please reply me after improving I am waiting for your reply God bless you ..

Am not the author, but I feel this is not good to ask him to work for you. he already done a very good job by posting this nice article for us and you are asking him to do your work.

why dont you try it urself modifying the code and share it here or use it. Also please dont post with tags urgent. it may not be urgent for the author and more over tags like these will discourage people to answer for your question.

I've implemented a tool which uses your diff algorithm to create changes similar to TFS'Annotate feature and am wondering how I could take the next leap by ignoring whitespaces. I thought about some possibilities like preprocessing input (remove whitespaces before the differ is invoked) or by postprocessing results (remove results based on whitespaces only) but I wonder if there is a much easier way to do it.

Thanks for your help in advance,Ramin

PS: Have you migrated to .net 4.0 with that code by taking advantages of new features such as generics, linq and therelike?

However, as is, if lines are added just before the last line of text then the last line is treated as a replaced line rather than a 'No Change' line. Just an 'off by one' error fixed below. To find the modified code search for "// Original:".

I was using XDelta3 (a command line app) to compare two VHD files and then send the delta to my off site server for backups. I saw this and it looks great but you read the entire file in to the arraylist and a BinaryReader can only accept a Int as an argument for the readBytes. That limits your file size to around 2 GB (int.maxsize). Not that I want to use all my memory or that the framework will even let me make an array list that large.

10+ from me. I love this work, it's clean and concise and doesn't fail on a 96 megabyte file like the Eugene Myers algorithm. It may not have the backing of the big league but I cannot find a failure for the life of me.

I have been researching difference algorithms that have been ported to .NET for about a week now. The first one I decided to go with was an implementation of the Eugene Myers diff algorithm as it seemed bullet proof and there are visual descriptions about how it works. Not to mention it seems to be the standard and creme de la creme of differencing algorithms according to the majority of the world. You can find implementations here [^]and here[^].

The logic seems undeniable. I must have read the white paper 5 times and reviewed at least 4 implementations each of which I tried.

However, I've had nothing but problems with all of them. The returned differencing items are very unreliable from set to set and the amount of code required to do anything real is insane. There are structures such as StartA, StartB, InsertedB, and DeletedA. The problem lies in the interpretation of the results from one set to another. The structures cannot be evaluated stand alone as the values in the structure mean different things depending on the change type (inserted, deleted, unchanged etc...).

Here's an example that is just plain impossible with the algorithm. When you require multiple change sets to be diffed and merged back into one to get the complete view of all of the changes. Anyone know how subversion does this? (I know probably the Myers diff algorithm).

Here's a full blown example of the problem (no the auto-text is not always all caps):

FILE A

This is file A's textit has multiple lines that arealways different than file C.

FILE B

This is file B's text. This text is alwaysdifferent than file A's text. These two filesare really sections of File C.Because this section is much longer than file AIt may have multiple inserts from some otherprogram.

FILE C

This is the main file it contains multiple lines and contains the text of both file A and file B. Howeveranother program had this file before I did and insertedsome text in between the file A text and some in between thefile B text like this:

This is file A's textit has multiple lines that are

SOMEONE PUT SOME TEXT HERE

always different than file C.

This is file B's text. This text is always

SOME AUTO FOOTER TEXT THAT SOME PROGRAM ENTERED

different than file A's text. These two filesare really sections of File C.Because this section is much longer than file AIt may have multiple inserts from some other

SOME AUTO FOOTER TEXT THAT SOME PROGRAM ENTERED

program.

__________________________________________________________________Now after you have saved the results of both diffs try to figure out howto replace the A file text in File C with a token like this {FileA} and the same forfile B like this {FileB}. You cannot, at least not reliably. Because the diffItems startA, startB, and insertdB do not make sense from set to set and you cannot properly determine the line numbers in the source and destination files. What am I missing? I would love it if you could change my mind.

hey i am trying to compare .css or js files with this...but when downloading those files from internet the format is all gone and thus even a single word mismatch it highlights whole ting.can u help on this how to show ur results in asp.ent project?

hi i'm a student and new to dotnet programming.. i'm trying to implement this difference engine on my web application assignment to compare 2 versions of a text article and so far i'm stuck.. can anyone please help me out?i have 2 multiline textboxes containing bunch of words and a button.after clicking the button the differences and similarities would be shown on the textboxes.i would be very grateful for any help.. please email me at kap.kidlat@yahoo.com.thanks.

There is some bug within Engine.cs in line 69 //jump over the match sourceIndex += curBestLength; This jumping is not correct because brings to a chance to lose a better match. Please look at the instance:source = "1112"destination = "112"When we found the match "11" there, we jump to source[2] (that's to the rest source substring "12") and so lose the best match "112" that starts from source[1].

A handy little diff engine. I also managed to cheat & get it to do a word by word comparison by replacing spaces between words with newlines and convert the generated results to HTML to render it in browsers instead of listviews.

I am developing a tool for my company. I found this module really useful and would like to include it to my program. Is there any license associated to the demo project and the diff module? Thanks a lot!

Kudos to good work. Just one thing: there are situations in which it is not easy to properly implement IComparable interface - sometimes it is not even possible at all. And anyway, why would diff engine need to know which one of two items is less or greater? Unless it is a problem to use .NET framework version 2.0 or higher, I'd suggest to replace IComparable with IEquatable.Regards, M.

Compliments for the very nice and usefull article. I'm developing a engine to compare different type of files. First of all I convert these files to one file type like XML or just text files. Now I want to compare the two files and present the diferrence in percents. Like 10% difference or instead of presenting the difference, I would like to present the match in percents, like 'Match: 90%'. Is it possible to implement this in your solution? If so, how would you do this?

Thanks for your article. I needed something to compare VBA code in two workbooks.While I didnt read your code, I implemented your idea in VBA Excel and it worksquite well. I get around some of the limitations by slicing the VBA I am comparinginto individual Sub and Function and run the diff algorithm on each of these peices ofcode. This also avoids the diff problems that occur when one moves Subs and Functionswith in a module.

I have posted the VBA recursive portion as well as the output engine (each called separatelyfor each Sub and Function in the Excel workbooks I am comparing):

Else If iSRCBlk1(iRow1) = 0 Then ''Stop Else bAdvance1 = False End If If iSRCBlk2(iRow2) = 0 Then ''Stop Else bAdvance2 = False End If End If End If If bAdvance1 Then Set rng = ws.Cells(iRow, 1) rng.Value = "'" & sSRC1(iRow1) If IsEntry(rng.Value, s0) Then iRowEntry = rng.Row End If ws.Cells(iRow, 5) = "'" & iSRCBlk1(iRow1) & "/" & iRow1 iRow1 = iRow1 + 1 End If If bAdvance2 Then Set rng = ws.Cells(iRow, 2) If Not bContentEqual Then rng.Value = "'" & sSRC2(iRow2) End If ws.Cells(iRow, 6) = "'" & iSRCBlk2(iRow2) & "/" & iRow2 iRow2 = iRow2 + 1 End If If Not bContentEqual Then If bIs1 And bIs2 Then If ws.Cells(iRow, 1).Value <> ws.Cells(iRow, 2).Value Then Set rng = ws.Cells(iRow, 2) rng.Interior.ColorIndex = COLOR_INDEX_YELLOW End If End If End If If iRow1 >= UBound(sSRC1) And iRow2 >= UBound(sSRC2) Then Exit Do End If Loop

The C# project seems to have been written with older version of Visual Studio. I have the Visual Studio 2005 Express Edition... When I open the project it says that this project was written in older version and to make it compatible with this one u need to convert it using an in-built functionality of Express Edition. But it fails because of some reason... So I am not able to open this project. Is there any other way using which I will be able to open the project?