Monday, November 2, 2015

More on bsdiff and delta compression

bsdiff is a simple delta compression algorithm, and it performs well compared to its open and closed source competitors. (Note bsdiff doesn't scale to large files due to memory, but that's somewhat fixable by breaking the problem up into blocks.) It also beats LZHAM in static dictionary mode by a large margin, and I want to understand why.

It conceptually works in three high-level phases:

1. Build suffix array of original file

2. Preprocess the new file, using the suffix array to find exact or approximate matches against the original file. This results in a series of 24 byte patch control commands, followed by a "diff" block (bytes added to various regions from the original file) and an "extra" block (unmatched literal data).

3. Separately compress this data using bzip2, with a 512KB block size.

As a quick experiment, switching step 3 to LZMA or LZHAM results in easy gains (no surprise as bzip2 is pretty dated):

The bsdiff control block data can be viewed as a sort of program that outputs the new file, given the old file and a list of "addcopyskip x,y,z" instructions along with three automatically updated machine "registers": the offsets into the old file data, the diff bytes block, and the extra bytes blocks. bsdiff then compresses this program and the two data blocks (diff+extra) individually with bzip2.

I think there are several key reasons why bsdiff works well relative to LZHAM with a static dictionary (and LZMA too, except it doesn't have explicit support for static dictionaries):

- LZHAM has no explicit way of referencing the original file (the seed dictionary). It must encode high match distances to reach back into the seed dictionary, beyond the already decoded data. This is expensive, and grows increasingly so as the dictionary grow larger.

- bsdiff is able to move around its original file pointer either forwards or backwards, by encoding the distance to travel and a sign bit. LZHAM can only used previous match distances (from the 4 entry LRU match history buffer), or it must spend many bits coding absolute match distances.

- LZHAM doesn't have delta matches (bsdiff's add operation) to efficiently encode the "second order" changes talked about in Colin Percival's thesis. The closest it gets are special handling for single byte LAM's (literals after matches), by XOR'ing the mismatch byte vs. the lookahead byte and coding the result with Huffman coding. But immediately after a LAM it always falls back to plain literals or exact matches.

- The bsdiff approach is able to apply the full bzip2 machinery to the diff bytes. In this file, most diff bytes are 0 with sprinklings of 2 or 3 byte deltas. Patterns are readily apparent in the delta data.

After some thought, one way to enhance LZHAM with stronger support for delta compression: support multibyte LAM's (perhaps by outputting the LAM length with arithmetic coding when in LAM mode), and maybe add the ability to encode REP matches with delta distances.

Or, just use LZHAM or LZMA as-is and just do the delta compression step as a preprocess, just like bzip2 does.

About Me

Back in the day I worked for several years at Digital Illusions on things like the first shipping deferred shaded game ("Shrek" - 2001), software renderers, and game AI. Then, after working for Microsoft at Ensemble Studios for 5 years as engine lead on Halo Wars, I took a year off to create "crunch", an advanced DXTc texture compression library. I then worked 5 years at Valve, where I contributed to Portal 2, Dota 2, CS:GO, and the Linux versions of Valve's Source1 games. I was one of the original developers on the Steam Linux team, where I worked with a (somewhat enigmatic) multi-billionare on proving that OpenGL could still hold its own vs. Direct3D. I also started the vogl (Valve's OpenGL debugger) project from scratch, which I worked on for over a year. In my spare time I work on various open source lossless and texture compression projects: crunch, LZHAM, miniz, jpeg-compressor, and picojpeg.