New Options in the World of File Compression

In the beginning, there was compress. People used it on their data files,
and it was good. Their files became smaller, and more data could be
crammed onto disc platters than ever before. The joys of LZW compression
were enjoyed by many.

In 1992, something better came along: gzip. It used a different
compression algorithm (LZ77 plus Huffman coding) that provided even smaller
files. As a bonus, it was free of any pesky patent-encumbered algorithms.
The masses were even happier, as they could cram even more data into their
systems, and no royalty payments for patents were required.

In 1996, Julian Seward released bzip2, which used a combination of the
Burrows-Wheeler transform and other compression techniques to achieve
compression performance even better than gzip's. It required more CPU
power and more memory, but, with the ever-escalating capabilities of
computers, this became less and less of an issue over time.

For many years, gzip and bzip2 were the de facto compression standards in
the world of free software, with gzip being used on time-sensitive
compression tasks, and bzip2 being used when maximum file compression was
desired.

However, in the year 2000, something new came along. Igor Pavlov released
a program called 7-zip, which featured a new algorithm called LZMA. This
algorithm provided very high compression ratios, although it did require
major RAM and CPU time.

Unfortunately, there were two problems that made 7-zip less than ideal for
Linux/BSD/Unix users. The first is that it was written for Microsoft
Windows. Eeek! This was thankfully addressed in 2004, with the release of
a cross-platform port called p7zip. A second problem is that 7-zip (and
p7zip) used a file format called .7z. This is a multi-file archive format
similar in functionality to .zip files. Unfortunately, with its
Windows-based roots, the .7z file format made no provision for Unix-style
permissions, user/group information, access control lists, or other such
information. These limitations are show-stoppers for people doing backups
on multi-user systems.

Then in 2004, Igor Pavlov released the LZMA SDK (Software Development Kit).
Though intended for application writers, this development kit also
contained a little gem of a command-line utility called lzma_alone. This
program could be used much like gzip and bzip2, to create .lzma files.
When combined with tar, this provided excellent file compression with
proper Unix compability.

Less than a year after the release of the LZMA SDK, Lasse Collin released
the LZMA Utils. This was initially a set of wrapper scripts around
lzma_alone that provided lzma (with command-line options very similar to
those of gzip and bzip2) instead of the less common p7zip-style options
used by lzma_alone. Later lzma releases were entirely in C. Then, in
2009, Lasse Collin released the XZ Utils, xz being the main utility. This
new utility continues to use LZMA compression, but, instead of producing
raw LZMA data streams, it wraps the resulting data stream in a well-defined
file format containing various magic bytes, stream flags, and cyclic
redundancy checks. Thus was born the .xz file format.

In 2008, Antonio Diaz released a similar utility called lzip. Like xz, it
uses LZMA compression, but, instead of creating .xz files, it creates .lz
files. This format is different in detail, but has many of the same
features as .xz files, such as magic bytes, cyclic redundancy checks, etc.
Additionally, lzip can create multi-member files, and can split output into
multiple volumes.

As of this writing, there are now four command-line utilities (and three
file formats) that use LZMA, providing excellent file compression results:
lzma_alone by Igor Pavlov, lzma and xz by Lasse Collin, and lzip by Antonio
Diaz. Does this mean we're in for a VHS/Betamax-style format war? It's
hard to say. (Fortunately, you're not limited to using just one. These
are utilities, not VCRs. There's plenty of room for all of them on your
hard drive. I have all four on mine.)

I myself prefer lzma_alone, as it's maintained by the person who actually
invented the LZMA algorithm and understands it best. However, the file
format is minimal, and xz and lzip offer significant advantages with their
magic bytes and data integrity checks. It's also difficult to build
lzma_alone, and it has no manpage. The XZ Utils are easiest to build (as
it features a modern autotools-based configure script), but it currently
lacks manpages for the main xz and lzma utilities. Lzip falls in between.
It requires some manual hacking to get compiler flags like you want, but it
does contain a nice manpage.

At some point, one of these may become the predominant way of using LZMA
compression, but, as of today, you can expect to see all three file formats
out there "in the wild". I know I have.

How do these utilities compare, as to compression performance? It turns
out, there's little difference. Here's a table of results, showing how
lzma_alone, xz, lzip, bzip2, gzip, and compress perform on the source
tarball from ghostscript-8.64. (I skipped Lasse Collin's lzma, since it's
just a symlink to xz, now.) Exact versions were lzma_alone-4.65,
xz-4.999.8beta, lzip-1.5, bzip2-1.0.5, gzip-1.3.12, and ncompress-4.2.4.2.

Compression results on all three LZMA-based utilities were quite similar,
with lzma_alone doing the best by a whisker. All three did much better
than bzip2, gzip, and compress, though taking much longer. Lzip
decompression was about 30% slower than the other two LZMA-based utilities,
but it's still markedly faster than Bzip2.

How can you take advantages of these new utilities? Well, if you're lucky,
your distribution will have one or more of these available as pre-compiled
packages. I run Debian (Lenny) 5.0, which has lzma_alone and an earlier
version of Lasse Collin's LZMA Utils (which contains lzma, but not xz)
available. For those not provided by your distribution, you'll have to
download the source code and compile it yourself. Here are links to the
three major programs:

For those who wish to build lzma_alone, I offer this tarball: lzma_alone_patches.tar.bz2,
which contains some minimal patches, build instructions, and a manpage. To
use it, you'll still need to download the original LZMA SDK from the
Web site mentioned above. As for the XZ Utils and Lzip, they are quite
straightforward to build and install.

How can you convert existing tarballs from one file compression scheme to
another? With Unix pipes, of course. The following examples show how:

For those who have many tarballs to convert, you might consider
downloading and installing the littleutils
package. This package contains three scripts (to-lzma, to-xz, and to-lzip)
that will convert multiple gzip- and bzip2-compressed files into
.lzma, .xz, or .lz format, respectively. The -k option is particularly
useful, as it will delete the original file only if the new one is smaller.
Otherwise the original file will be preserved. To convert an entire
directory of tarballs to .lzma format, simply type the following:

to-lzma -k *.tar.gz *.tar.bz2

After that shameless plug for my own software, I'll conclude this article
by urging people to start using at least one of these LZMA-based
compression utilities. Particularly if you distribute compressed tarballs.
LZMA-compressed files take less time to download, and take less time to
decompress (at least compared to bzip2). Even in a world of broadband
Internet connections, multi-gigahertz processors, and cavernous hard
drives, these utilities will save time and space.

Brian Lindholm is a Virginia Tech graduate and middle-aged mechanical engineer
who started programming in BASIC on a TRS-80 Model I (way back in 1980). In
the late eighties, he moved to Pascal and C on an IBM PC-compatible.

Over the years, Brian became increasingly disgruntled with the instability
and expense of the various Microsoft operating systems. In particular,
he hated not being in full control of his system. MOST
fortunately for him, however, he had a college roommate who ran Linux (way
back in the Linux 0.9 and Slackware 1.0 days). That introduction was all
he needed.

Over the years, he's slowly learned more and more, and now manages to keep his
Debian system running happy and stable (even through four major upgrades: 2.2
to 3.0 to 3.1 to 4.0 to 5.0). [A point of note: His Debian system has NEVER
crashed on its own. EVER. Only power failures, attempts to boot off the wrong
partition, errant hits of the reset button, a cracked DVD, and a particularly
flaky IDE Zip drive ever managed to take it down.] He
loves VIM and has found Perl amazingly useful at work.

In his non-Linux life, Brian helps design power generation equipment (big power
plant stuff) for a living, occasionally records live music for people, reads
too much science fiction, and gets out on the Appalachian Trail as often as he
can.