Search

Diff, Patch, and Friends

Diff is designed to show you the
differences between files, line by
line. It is fundamentally simple to use, but takes a little
practice. Don't let the length of this article scare you; you can
get some use out of diff by reading only the first page or two. The
rest of the article is for those who aren't satisfied with very
basic uses.

While diff is often used by developers to show differences
between different versions of a file of source code, it is useful
for far more than source code. For example, diff comes in handy
when editing a document which is passed back and forth between
multiple people, perhaps via e-mail. At Linux Journal, we have experience with this. Often both the
editor and an author are working on an article at the same time,
and we need to make sure that each (correct) change made by each
person makes its way into the final version of the article being
edited. The changes can be found by looking at the differences
between two files.

However, it is hard to show off how helpful diff can be in
finding these kinds of differences. To demonstrate with files large
enough to really show off diff's capabilities would require that we
devote the entire magazine to this one article. Instead, because
few of our readers are likely to be fluent in Latin, at least
compared to those fluent in English, we will give a Latin example
from Winnie Ille Pu, a translation by
Alexander Leonard of A. A. Milne's Winnie The
Pooh (ISBN 0-525-48335-7). This will make it harder for
the average reader to see differences at a glance and show how
useful these tools can be in finding changes in much larger
documents.

You may be able to find one or two changes after some careful
comparison, but are you sure you have found
every change? Probably not: tedious,
character-by-character comparison of two files should be the
computer's job, not yours.

The file names and last dates of modification are
shown in a “header” at the top. The dates may not mean anything
if you are comparing files that have been passed back and forth by
e-mail, but they become very useful in other circumstances.

The file names (in this case, 1 and 2—are preceded
by --- and +++.

After the header comes a line that includes
numbers. We will discuss that line later.

The lines that did not change between files are
shown preceded by spaces; those that are different in the different
files are shown preceded by a character which shows
which file they came from. Lines
which exist only in a file whose name is preceded by
--- in the header are preceded by a
- character, and vice-versa for lines preceded
by a + character. Another way to remember this
is to see that the lines preceded by a -
character were removed from the first
(---) file, and those preceded by a
+ character were added to
the second (+++) file.

Three spelling changes have been made:
“desendendi” has been corrected to “descendendi”, “non
nunquam” has been corrected to “nonnunquam”, and “no” has been
corrected to “eo”.

Perhaps the main thing to notice is that you didn't need this
description of how to interpret diff's output in order to find the
differences. It is rather easy to compare two adjacent lines and
see the differences.

It's not always this easy

Unfortunately, if too many adjacent lines have been changed,
interpretation isn't as immediately obvious; but by knowing that
each marked line has been changed in some way, you can figure it
out. For instance, in this comparison, where the file 3 contains
the damaged contents, and file 4 (identical to file 2 in the
previous example) contains the correct contents, three lines in a
row are changed, and now each line with a difference is not shown
directly above the corrected line:

It takes a little more work to find the added mistakes;
“nodum” for “modum” and “exitare” for “exstare”. Imagine if
50 lines in a row had each had a one-character change, though. This
begins to resemble the old job of going through the whole file,
character-by-character, looking for changes. All we've done is
(potentially) shrink the amount of comparison you have to
do.

Fortunately, there are several tools for finding these kinds
of differences more easily. GNU Emacs has “word diff”
functionality. There is also a GNU “wdiff” program which helps
you find these kinds of differences without using Emacs.

Let's look first at GNU Emacs. For this example, files 5 and
6 are exactly the same, respectively, as files 3 and 4 before. I
bring up emacs under X (which provides me with colored text), and
type:

M-x ediff-files RET
5 RET
6 RET

In the new window which pops up, I press the space bar, which
tells Emacs to highlight the differences. Look at Figure 1 and see
how easy it is to find each changed word.

GNU wdiff is also very useful, especially if you aren't
running X. A pager (such as less) is all that is required—and that
is only required for large differences. The exact same set of files
(5 and 6), compared with the command wdiff -t 5
6, is shown in Figure 2.

If you are getting extra character sequences like
ESC[24 instead of getting underline and reverse
video, it's probably because you are using
less, which by default doesn't
pass through all escape characters. Use less -r
instead, or use the more pager. Either should work.

wdiff uses the termcap database (that's
what the -t option is for) to find out how to
enable underline and reverse video, and not all termcap entries are
correct. In some instances, I've found that the
linux termcap entry works well for other
terminals, since the codes for turning underline and reverse video
on and off don't differ very much across terminals. To use the
linux termcap entry, you can do this:

TERM=linux wdiff -t 5 6 | less -r

This will work only with bourne shell derivatives such as
bash, not with csh or tesh. But since you need to do this only to
correct for a broken termcap database, this limitation shouldn't be
too much of a problem.

wdiff isn't always built with the termcap support needed to
underline and reverse video, and it's not always what you want even
if you have a working termcap database, so there's an alternate
output format that is just as easy to understand. We'll kill two
birds with one stone by also showing off wdiff's ability to deal
with re-wrapped paragraphs while showing off its ability to work
without underline and reverse video. File 8 is the same as the
correct file 2, shown at the beginning of this article, but file 7
(the corrupted one) now has much shorter lines, which makes them
even harder to compare “by eye”:

Remember the + and -
characters? They mean the same thing with wdiff as they mean with
diff. (Consistent user interfaces are wonderful.)

Chunks

Near the beginning of this article, I promised to explain
this line:

-1,9 +1,9

that describes the chunk
that diff found differences in. In each file, the chunk starts on
line 1 and extends for 9 lines beyond the first line. However, with
this small example, the chunk shown in the example contains the
whole file. With larger files, only the lines
around the changes, called the
context, are shown.

In files 9 and 10, I've inserted a lot of blank lines in the
middle of the paragraph, in order to show what multiple chunks look
like. File 9 is damaged, file 10 is correct (except for the blank
lines in the middle of the paragraph):

So you see that we have one seven-line chunk starting at line
1 and one seven-line chunk starting at line 33 are shown
here.

You should notice several things here:

There is one header at the top of each
chunk.

Blank lines are included as part of a chunk's
context.

Lines that are not changed and that are not within
three lines of a changed line are not included in any chunk.

“Patches” (or “diffs”) are the output of the diff
program. They include all the chunks of changes between the two
files.

Other formats

This only brushes the surface of diff. For one thing, the
three lines of unchanged context is configurable. Instead of using
the -u option, you can use the -U
lines option to specify any
reasonable number of lines of context. You can even specify
-U 0 if you don't want to use any context at
all, though that is rarely useful.

What does the -u (or -U
lines) argument mean? It
specifies the unified diff format, which is
the particular format covered here. Other formats include:

“context diffs” which have the same information
as unified diffs, but are less compact and less readable

“ed script diffs” or “normal diffs” which are
in a format that can be easily converted into a form that can be
used to cause the (nearly obsolete) editor ed to automatically
change another copy of the old file to match the new file. This
format has no context and could easily be replaced by -U
0, except for compatibility with older software and the
POSIX standard.

You will almost never want to create context or normal diffs,
but it may be useful to recognize them from time to time. Context
diffs are marked by the use of the character !
to mark changes, and normal diffs are marked by the use of the
characters < and > to
mark changes.

In context diffs, the *
character is used in place of the unified diff's
- character, and the -
character is used in place of the + character.
The context diff format was designed before the unified diff
format, but the unified diff format's choice of characters is
mnemonic and therefore preferable.

Context diffs repeat all context twice for each
chunk. This is a waste of space in files, but far more importantly,
it separates the changes too widely, making patches less
human-readable.

Normal, old-style diffs are very contracted and use
very little space. They are useful in situations where you don't
normally expect a human to read them, where saving space makes a
lot of sense, and where they will never be applied to files which
have changed. For example, RCS (covered in the May 1996 issue of
LJ) uses a format almost identical to
old-style diffs to store changes between versions of files. This
saves space and time in a situation where any context at all would
be a waste of space.

Using Patches

When someone changes a file that other people have copies of
(source code, documentation, or just about any other text file),
they often send patches instead of (or in addition to) making the
entire new file available. If you have the old file and the
patches, you might wish that you could have a program apply the
patches. You might think that normal diff format, which was made to
look like input to the ed program, would be the best way to
accomplish this.

As it turns out, this is not true.

A program called patch has
been written which is specifically designed to apply patches to
files (change the files as specified in the patch). It correctly
recognizes all the formats of patches and applies them. With
unified and context diffs, patch can usually apply patches,
even if lines have been added or removed from the
file, by looking for unchanged context lines. Only if
the context lines have themselves been changed is patch likely to
fail.

To apply patches with patch, you normally have a file
containing the patch (we'll call it
patchfile), and then run patch:

patch < patchfile

Patch is very verbose. If it gets confused by anything, it
stops and asks you in English (it was written by a linguist, not a
computer scientist) what you want to do. If you want to learn more
about patch, the man page is unusually readable.

Other Related Tools

If you read the RCS article in the May issue (Take
Command: Keeping Track of Change,
LJ #25, May 1996), you may have noticed that
the article talked a bit about a program called rcsdiff. rcsdiff is
really just a front end to diff. That is, it looks for arguments
that it understands (such as revision numbers and the filename) and
prepares two files representing the two versions of the file you
are examining. It then calls diff with the remaining options. The
RCS article used -u to get the unified format
without explaining what it meant, but you can use
-c to get context diffs, or use -U
lines to choose the amount of
context you get in a unified diff, or use any other diff options
you like.

You may notice that rcsdiff produces more verbose output than
normal diff. From the RCS article:

It looks just like a normal unified diff except for the first
5 lines.

This doesn't prevent you from sending patches to people. The
patch program is extremely good about ignoring extraneous
information. It can even ignore news or mail headers, extra
comments written in a file outside a patch, and people's signatures
following patches. Patch tells you when it is determining whether
text is part of a patch or not by saying
“Hmm...”

If you don't care how two files differ,
but just want to know whether they differ, the
cmp program will tell you. It works not only for text files, but
also for binary files. In this example, the files 5 and 6 are
different; 2 and 4 are the same:

cmp 5 6
5 6 differ: char 159, line 4
cmp 2 4

Notice that when two files are the same, cmp doesn't say
anything at all. It only tells you explicitly if the files have
been changed. For use in writing shell scripts, cmp also returns
true if the files are the same and false if they don't, as shown by
this shell session:

There are several other programs with related functionality.
In particular, diff3 can be used to merge together two different
files that have both been edited from a common ancestor file. That
common ancestor must exist in order for diff3 to work
correctly.

The info pages which are shipped with diff are probably
installed on your system. If you want to learn more about diff, try
the command info diff or use info mode from
within emacs or jed.

diff, wdiff, patch, and emacs are available via ftp from the
canonical GNU ftp archive, prep.ai.mit.edu, in the directory
/pub/gnu/