Generic Diff Format Specification

Status of this Document

This document is a NOTE made available by the W3 Consortium for discussion
only. This indicates no endorsement of its content, nor that the Consortium
has, is, or will be allocating any resources to the issues addressed by the
NOTE.

This document was submitted and should be considered along with
NOTE-drp.

The Consortium is pleased to have received this specification as a submission,
and acknowledges the interest in such a mechanism indicated by several
member companies.

The proposal will be discussed at the W3C workshop on "push" technology in
September.

That workshop will give participants a chance tocomment on the path the
consortium should take. At this time we are not promising any allocation
of further W3C resources tothis proposal after the Push Workshop.

This technology impacts URI specifications, HTTP, and includes anew XML mime
type. Much of the functionality should be able to be provided by the Resource
Description Framework currently being developed in the metadata activity.
There are other components of the DRP proposal that have requirements in
common with other W3C and IETF activities(notably DSig Manifests and WEBDAV
versioning) and should this proposal be actively considered, W3C will work
to insure the harmonization of the related activities.

Abstract

This document provides a specification of a generic file format for representing
the differences between two files.

Table of Contents

1. Introduction

2. The Generic Diff Format

3. Conclusion

3.1 References

1. Introduction

This document describes the Generic Diff Format (GDIFF). The GDIFF format
can be used to efficiently describe the differences between two arbirary
files. The format does not make any assumptions about the type or contents
of the files, and thus can be used to describe the differences between text
files as well as binary files. The GDIFF format is itself a binary file format.

This proposal does not describe how to compute the differences between two
files. It only defines how the resulting differences can be described in
an efficient and generic manner. The proposal describes the GDIFF file format
and how to interpret it.

2. The Generic Diff Format

The GDIFF format is primarily useful in applications which compute the
differences between two versions of a file. The resulting differences can
be stored in a file using the GDIFF format. The differences described by
the GDIFF file can later be applied to the old file to obtain the new file.

The GDIFF format is particularly useful in situations where it is more efficient
to distribute the differences between two versions of a file, rather than
the entire new version of the file.

Any file differencing algorithm can be used to compute the differences between
the old and new versions of a file. An example is the
rsync
algorithm. The result can be expressed using the GDIFF format that is described
here.

To apply a GDIFF file, you need random access to the old version of the file,
and sequential access to the GDIFF file. The new version of the file is produced
on the output stream.

File Format

The GDIFF format is a binary format. The mime type of a GDIFF file is
"application/gdiff". All binary numbers in a GDIFF file are stored in big
endian format (most significant byte first).

Each diff stream starts with the 4-byte magic number (value
0xd1ffd1ff), followed by a 1-byte version number (value 4).
The version number is followed by a sequence of 1 byte commands which are
interpreted in order. The last command in the stream is the end-of-file command
(value 0).

The GDIFF commands are listed in the table below.

Name

Cmd

Followed By

Action

EOF

0

End of File

DATA

1

1 byte

append 1 data byte

DATA

2

2 bytes

append 2 data bytes

DATA

<n>

<n> bytes

append <n> data bytes

DATA

246

246 bytes

append 246 data bytes

DATA

247

ushort, <n> bytes

append <n> data bytes

DATA

248

int, <n> bytes

append <n> data bytes

COPY

249

ushort, ubyte

copy <position>, <length>

COPY

250

ushort, ushort

copy <position>, <length>

COPY

251

ushort, int

copy <position>, <length>

COPY

252

int, ubyte

copy <position>, <length>

COPY

253

int, ushort

copy <position>, <length>

COPY

254

int, int

copy <position>, <length>

COPY

255

long, int

copy <position>, <length>

There are two kinds of GDIFF commands. The first kind is the DATA command
(1 through 248). Each data command is followed by a number of data bytes
which are copied onto the output stream.

The second kind of GDIFF command is the COPY command (249 through 255). Each
COPY command is followed by two arguments: position and length. The arguments
specify the portion of the old file that must be copied onto the output stream.

If a number larger than 1^31-1 bytes is needed for a command
command that takes only int arguments, the command must be split
into multiple commands. This may be necessary when dealing with very large
files.

Types

byte - 8 bit signed

ubyte - 8 bit unsigned

ushort - 16 bit unsigned, most significant byte first

int - 32 bit signed, most significant byte first

long - 64 bit signed, most significant byte first

Example

To illustrate the use of the GDIFF format we will use two input streams
old and new as an example and prepare a simple GDIFF file by
hand:

Note that in this case the resulting GDIFF file is larger than the new file.
This is not normally the case when the files get larger. Also note that there
can be many different GDIFF files which produce the same result. The size
of the resulting GDIFF file largely depends on the similarity between the
two input files, and the ability of the diff algorithm to find the most optimal
set of differences.

3. Conclusion

The GDIFF format is a simple diff format that can efficiently describe the
differences between a two files of any type. It is defined in this proposal
to provide a simple interoperability format for binary differencing.