Introduction

I wrote this article mostly for fun and as a candidate entry for The Code
Project's
Lean and Mean competition to write a text diffing tool. The library
and demo application are both written in C# using VS 2010 Beta and tested on a
64 bit Windows 7 RC machine. Some of the articles that have been submitted so
far do not really do a diff in the proper sense of the word and so that gave me
more of an impetus to write one that actually does a standard diff. I primarily
referred to the
Wikipedia entry on the LCS algorithm but also referred to
this
article (university lecture notes). I've called the class SimpleDiff<T>
because while it does do a proper LCS based diff, it could probably be further
optimized, though I've applied some bare optimizations of my own.

Performance/Memory info

The implementation is intended to be more mean than lean and
the primary objective was to write a reasonably fast diff-engine. This does not
mean it cannot be made faster, nor does it mean it cannot be made leaner without
losing out majorly on speed. For the test files provided by Code Project,
here're the timing results for five iterations, along with the average.

35.0069 ms

33.3978 ms

34.1942 ms

34.9919 ms

34.2274 ms

Average : 34.3636 ms

The tests were performed on a 64-bit Core i7 machine with 12 GB RAM.

The memory usage (in bytes) is probably not as impressive as it may have been for a C++
app or for one that did not load both files into memory in their entirety.

Paged memory

6,492,160

Virtual memory

0

Working set

6,766,592

Not surprisingly (given the 12 GB ram), virtual memory usage was 0. But the
working set (delta) was 6.7 MB. The two biggest memory hogs are the fact that we
store both files in memory throughout the process, and the fact that the LCS
matrix can be pretty huge for files of any non-trivial size. I've considered a
version that never fully loaded either file but I couldn't think of a reasonable
way to avoid a performance fall (which invariably accompanies frequent disk
reads for fetching/seeking data).

Note on memory calculation

The memory consumption was calculated using the
Process class's PeakWorkingSet64 and related
properties. I took the values once, invoked the code, and read the
values again and calculated the delta. To account for JIT memory, the
diff class was created once but not used prior to calculating the
memory.

Using the demo app

The demo app takes 2 parameters, the source and target files (or the left and
right files as you may prefer). And it prints out all the lines, and uses a ++ prefix to indicate a line added to the left file, and a -- prefix to denote a line removed from the left file. Here are
some screenshots that show it run against Chris M's sample files and also
against my simple test files.

Figure 1: The screenshot shows the left and right files as well as the
console output.

Figure 2: The screenshot shows a partial output of comparing the test files
provided by the competition creators.

If you are wondering about line numbers, note that this display is the
function of the calling program, my diff library does not really provide any
output - it just provides an event that callers can hook onto. For my console
app I chose to imitate the Unix diff program (though not identically), but it'd
be trivial to add line numbers. It'd be equally simple to write a WPF or
WinForms UI for this library.

Class design

It's a generic class where the generic argument specifies the type of the
comparable item. In our case, since we are comparing text files, this would be
System.String (representing a line of text). Using the class would
be something like :

When RunDiff is called the first time, the most important and also the most
time consuming part of the code is executed, once. That's where we calculate the
LCS matrix for the two arrays. But prior to that, as an optimization there's
code that will look for any potential skip text at the beginning or end of the
diff-arrays. The reason we can safely do this is that if the identical lines are
removed from both ends, and the LCS calculated for the middle portion, then we
can add back the trimmed content to get the LCS for the original array-pair. If
that's not clear consider these two strings (imagine we are comparing characters
and not strings) :

The inline comments should be self explanatory. Interestingly my assumptions
about what the JIT optimizer would take care of turned out to be rather
inaccurate and hazy. Of course I did not run detailed enough tests to make any
serious conclusions, but to be safe it's probably best to do some level of
optimizing on your own instead of always thinking, hey the pre-JIT will catch
that one. Once the LCS is calculated, all that's left is to traverse the
matrix and fire the events as required, and also remembering to iterate through
the skipped entries if any :

I did not spent too much time trying to optimize ShowDiff since
my profiling showed that it was not anywhere near as time consuming as the LCS
calculation. 87% of the execution time was spent in the LCS matrix loops.

Note on hashing

Many C++ implementations compute the hash of items before comparing them. In
our case, since we are using .NET that'd actually be a de-optimization because
we'd lose out on the benefits of string interning. In most cases, the majority
of lines would be the same (in real life scenarios, only a small percentage of lines are
changed between file versions). And since we use Object.Equals
which does a reference comparison as the first step, and because identical
strings are interned, this comparison is extremely fast. Where we slow down on
is when we compare long lines that may differ by one character at the right end
of the line - that'd give us our worst case false-compare time.

Conclusion

I had initially thought of writing this in C++/CLI so I could mix types -
which'd specially be useful when creating the array. The .NET array's big
disadvantage is that it's zero-initialized, and while that's one of the fastest
operations any CPU can typically perform, it's still time consuming because of
the large size of the array. I could have avoided that by using a native array.
But the lack of intellisense drove me nuts after a few minutes and I gave up and
went back to C#. Maybe if I get time I'll write another version which may do
part of the calculations in native code and the C# code can P/Invoke it, though
that itself may bring in inefficiencies of its own. Anyway, any suggestions and
criticisms are extremely welcome.

Share

About the Author

Nish is a real nice guy who has been writing code since 1990 when he first got his hands on an 8088 with 640 KB RAM. Originally from sunny Trivandrum in India, he has been living in various places over the past few years and often thinks it’s time he settled down somewhere.

Nish has been a Microsoft Visual C++ MVP since October, 2002 - awfully nice of Microsoft, he thinks. He maintains an MVP tips and tricks web site - www.voidnish.com where you can find a consolidated list of his articles, writings and ideas on VC++, MFC, .NET and C++/CLI. Oh, and you might want to check out his blog on C++/CLI, MFC, .NET and a lot of other stuff - blog.voidnish.com.

Comments and Discussions

Excepting leading/trailing matching lines, SimpleDiff.ShowDiff() recurses once for each line in the diff! This will quickly lead to a stack overflow when you have thousands of lines. Here is an iterative version that does not suffer from this problem:

The problem arises from the optimizing methods CalculatePreSkip() and CalculatePostSkip(). The preSkip between the left file and the right file is calculated to be 4 (word1 word2 word3 word3). The postSkip between the left file and the right file is calculated at 4 as well (word3 word3 word4 word5). This is because, while CalculatePreSkip() makes sure _preSkip is always less than the number of lines in any of the files, CalculatePostSkip() only checks that _postSkip is less than what remains in the left file (11 lines - 4 preSkip = 7 lines), but ignores the length of the right file. Hence we end up in a situation where totalSkip is greater than the number of lines in the right file, causing the method CreateLCSMatrix() to bailout immediately, never initializing _matrix, and causing a negative index reference in method ShowDiff (you only check whether rightIndex == 0, but not if it is < 0).

To fix this problem all we need to do is to update CalculatePostSkip() as follows:

Would you be able to assist me with the following error messages while trying to build:

Error 71 'VoidNish.Diff.DiffEventArgs<T>.DiffType.get' must declare a body because it is not marked abstract or extern C:\Projects\Private\Sandbox\FIFF\VoidNish.Diff\DiffEventArgs.cs 7 36 VoidNish.Diff
Error 72 'VoidNish.Diff.DiffEventArgs<T>.DiffType.set' must declare a body because it is not marked abstract or extern C:\Projects\Private\Sandbox\FIFF\VoidNish.Diff\DiffEventArgs.cs 7 41 VoidNish.Diff
Error 73 'VoidNish.Diff.DiffEventArgs<T>.LineValue.get' must declare a body because it is not marked abstract or extern C:\Projects\Private\Sandbox\FIFF\VoidNish.Diff\DiffEventArgs.cs 9 30 VoidNish.Diff
Error 74 'VoidNish.Diff.DiffEventArgs<T>.LineValue.set' must declare a body because it is not marked abstract or extern C:\Projects\Private\Sandbox\FIFF\VoidNish.Diff\DiffEventArgs.cs 9 35 VoidNish.Diff
Error 75 'VoidNish.Diff.SimpleDiff<T>.ElapsedTime.get' must declare a body because it is not marked abstract or extern C:\Projects\Private\Sandbox\FIFF\VoidNish.Diff\SimpleDiff.cs 28 39 VoidNish.Diff
Error 76 'VoidNish.Diff.SimpleDiff<T>.ElapsedTime.set' must declare a body because it is not marked abstract or extern C:\Projects\Private\Sandbox\FIFF\VoidNish.Diff\SimpleDiff.cs 28 52 VoidNish.Diff

I also get warnings:
Warning 1 The referenced component 'Microsoft.CSharp' could not be found.
Warning 77 Could not resolve this reference. Could not locate the assembly "Microsoft.CSharp". Check to make sure the assembly exists on disk. If this reference is required by your code, you may get compilation errors. C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\Microsoft.Common.targets VoidNish.Diff

It will be simpler if you can upload the executable to your article. Then I do not have to compile it at all.

Thanks,
Ilka

I would imagine if you could understand Morse Code, a tap dancer would drive you crazy.
-- Mitch Hedberg (American Comedian, 1968-2005)

I tried a C++ version, done by other competitor, and his version is 2 ou 3 times faster than yours. Maybe this is because C++ is more eficient than C#, in terms of speed.

The only C++ version submitted so far does not do a proper diff, so any speed comparison is meaningless. While C++ code can definitely be more performant that C# code, in this particular case the core time consuming function is calculating the LCS matrix, and I don't think doing that in unmanaged code will have a 2-3 times speed boost. There are some optimizations possible with native code that can improve speed, but not by so much.

reinaldohf wrote:

I also wanted to ask you, if you plan to make it genarate HTML file output. If yes, it would be nice if it was with color highlight. So it would be useful for documenting sources code. You agree?

If I get time, I do intend to write a simple UI that can view the diff output, and yeah with colors too

I did a quick test using patl (http://code.google.com/p/patl/[^]) using VS2010 and profile guiding opt. I got a runtime 1/15th of yours based on the file provided by Chris.

In short there are much better ways of implementing LCS.

Thanks for the feedback, Arash.

Oh yes, there are better ways to implement LCS, but the divide between C# and C++ is thin there. Those same better methods can possibly be done in C# too. My point was that merely implementing an algorithm in C++ will not give a 2-3 times increase in speed. The same algorithm in C++ and C# should yield very close speed results.

Also I'd like to point out that running a speed test is mostly meaningless unless it's all done on the same machine.

Once again, thanks for your very good feedback. It's quality comments such as yours that makes CodeProject such an awesome place.

I think your code is very well built, but you may have overlooked that one of the requirements to the competition is that you use as little memory as possible. Your code is reading in the entire contents of both files and then doing an in-memory comparison. Since the solution would need to work on any combination of text files (some of which are quite large), this poses a serious problem for this approach.

My take on this is that this one focuses on the "mean" rather than on the "lean". For the actual LCS matrix calculation, both files are indeed kept in memory - but where that compromises on memory, it makes up for it with better performance. Thanks for the feedback, it's pushed my thinking in some new directions. So I really appreciate your post.

Need is a considerably relative word. The spirit of the contest that this code was written for is to produce code that leaves the smallest memory footprint possible with optimizations for speed. I.e. "back in the days of yore" when we had very little conventional memory for our code to run in.

The code written here is beautiful and it got a solid vote from me. No doubt about that, but I have to completely disagree that it satisfies the "using as little memory as possible" requirement for the contest given that file comparisons can be hundreds upon thousands of lines long (server log files come to mind) -- the code here will read the entire contents of those files and do an in-memory comparison.

It will be up to the CodeProject judges and our peers to determine if this is approach has merit or not, but my point was really to the author, asking if he had considered the memory part of the equation or not. As he already answered: he had considered and dismissed it under the notion that if his code ran fast enough, it would compensate for using potentially vast amounts of memory.

Winning? No chance to win, no need to spend time? Is it just about winning?

I think the spirit of the contest was to "remember the good old days" and try to put ourselves back in time and see it we are still able to fit in the machines of that time. It's not about winning, it's a friendly contest between friends and I think most of the people here thought "will I be able to do it?", way before even thinking about winning...

"Maybe if I get time I'll write another version which may do part of the calculations in native code and the C# code can P/Invoke it, though that itself may bring in inefficiencies of its own. Anyway, any suggestions and criticisms are extremely welcome. "

You do know that somewhere, a cute fluffy bunny has just died as a result of you switching over to the dark side of .NET, Karma is maintained at a cost.

Good job though, but I'd have liked to see some memory usage notes in there.

+5 from me.

"WPF has many lovers. It's a veritable porn star!" - Josh Smith

As Braveheart once said, "You can take our freedom but you'll never take our Hobnobs!" - Martin Hughes.

Thanks Pete, and as I was replying to Chris down below, I may add some timing/memory stats to the article.

And as for the intellisense bit, I'd normally have been okay with some loss of intellisense, but in VS 2010 the complete absence of intellisense was a rather negative experience, I might as well have used a plain text editor.

Thanks for the suggestions, man - I didn't go the static route for mostly personal preference reasons, but also because it leaves more flexibility for thread safety and class extensibility (however unlikely the chance for that may be).

An enumerator would be interesting! I guess it'd be pretty simple to write a helper class that'd serve as an enumerator for this one.

Once again thanks for the ideas - it's this sort of peer review/feedback that enhances the CP author experience for me. Got my 5! (your post that is).

Thanks Chris. I didn't think the timing/memory stats would be interesting since it'd all be relative to the test machine and the size of the files. But I guess it would give some sort of perspective to readers. I'll try and update the article with some sample timings/memory usage figures.

Oh, and yeah, as for the getting old part, perhaps we need to start a Code Project retiree home (fully WiFi-ed and all rooms will have the latest gadgets, and a free msdn dev subscription to its residents, including full beta/alpha program partnership).