Every so often, I dink around with benchmarking common lossless compressors. One of the best sites for it is, I think, Werner Bergman’s Maximum Compression, which is a rather comprehensive running benchmark of just about every lossless compression benchmark under the sun. Really, there’s a lot. What you have to understand about the world of compressors is that they are very often academic projects or toys that very smart people play with in their free time. There are also companies (but not many) who invest in their own proprietary algorithms for lossless compression.

Here’s the catch, though: the quality of a compressor isn’t measured by its final compression ratio. The PAQ series of a compressors1, for instance, offer great compression and really, truly awful compression times. The time goes with the highest compression levels of WinRK (a proprietary Win32 format with an accompanying GUI). But disk is cheap: nobody really cares about a fraction of a percentage of compression efficiency, do they? What people really want is for their (inevitable) archiving GUI to take less time doing what it does.

In this spirit, I have compiled not so much an exhaustive less of possible compression algorithms (I’ll leave that to Werner, who is very good at what he does), but rather a short list of the most common formats, tested on three different (relatively well-known) corpuses: the Calgary Corpus, the newer Canterbury Corpus, and Andrew Tridgell’s 1999 Large Corpus2. The first of these two are corpuses used to test the very kind of academic project which I’ve avoided. I dislike using them because they are small in size, which means that there is significantly less opportunity for variations in compression formats to manifest themselves. In the interest of verifiability, however, I have used them. I also included Andrew Tridgell’s large corpus because it’s been my experience that small test corpuses tend to vary too much too to disk I/O latency and other vagaries of compression algorithms.

What will follow is a data table for each corpus, followed by some brief observations about each.

The Calgary Corpus dates back to the late 80s. It’s become the test to perform, but it may or may not adequately represent the standard compressor workload in 2008. You’ll notice that Winrar’s maximum setting produces the smallest archive, and more quickly than the neighboring 7-zip runs. Notice, too, that among the lowest values, there tends to be a sort of “bottoming-out” point at which the speed of the compressor’s process in CPU is limited by the speed of the disk.

Canterbury Corpus

Codec

Setting

Enc. Speed (s)

Dec. Speed (s)

Size (b)

Ratio

tar

0.000

0.000

2,821,120

1.000

gzip

fast

0.140

0.062

872,570

0.309

gzip

0.249

0.062

739,066

0.262

gzip

best

1.138

0.062

736,223

0.261

bzip2

fast

0.390

0.156

584,964

0.207

bzip2

0.514

0.171

570,856

0.202

bzip2

best

0.390

0.156

570,856

0.202

zip

-1

0.140

0.078

872,795

0.309

zip

0.343

0.062

739,286

0.262

zip

-9

1.170

0.062

736,443

0.261

7z

1

0.280

0.930

569,953

0.202

7z

6

1.950

0.124

487,919

0.172

7z

9

2.199

0.124

485,391

0.173

rar

m1

0.218

0.124

772,369

0.274

rar

m3

1.232

0.093

515,831

0.183

rar

m5

1.170

0.561

427,178

0.151

I’m still not entirely able to figure out the Canterbury Corpus; it’s ostensibly an “update” to the aging Calgary Corpus. One would think that having been created more than a decade after it’s predecessor, and with the express purpose of more accurately representing the compressor workload of 2001, it would at least be larger (hard disks and file sizes have increased in size since 1989, believe it or not), but in fact it’s not, which was somewhat of a disappointment to me, as I saw entirely the same trends as with the previous corpus. Is that an accurate determination of the algorithms in question? Maybe not—read on.

Large-Corpus

Codec

Setting

Enc. Speed (s)

Dec. Speed (s)

Size (b)

Ratio

tar

0.000

0.000

247,933,952

1.000

gzip

fast

7.347

2.698

65,782,177

0.265

gzip

13.072

3.151

53,870,968

0.217

gzip

best

21.855

2.449

53,536,722

0.216

bzip2

fast

40.591

9.360

52,791,871

0.213

bzip2

54.506

10.567

39,372,759

0.159

bzip2

best

54.228

10.935

39,372,759

0.159

zip

-1

6.349

2.208

65,782,411

0.265

zip

12.682

2.527

53,871,197

0.217

zip

-9

21.529

2.433

53,536,951

0.216

7z

1

19.578

6.608

47,343,400

0.191

7z

6

128.645

4.035

26,373,931

0.106

7z

9

172.677

3.712

24,722,887

0.100

rar

m1

9.016

4.446

48,939,730

0.197

rar

m3

125.128

3.868

31,916,951

0.129

rar

m5

138.435

23.852

29,200,310

0.118

Mostly interestingly in the Tridgell’s “large-corpus,” we finally see 7-Zip spring ahead of WinRAR in terms of pure compression ratio (and in speed, too, in some cases). I’m not an expert on compression, so I can’t tell you why certain efficiencies only manifest themselves over large datasets, but clearly 7-Zip wins in more modern cases where large data-sets (mostly text, if Tridgell’s description is accurate) are present.

Clearly, the LZMA algorithm (the heart of 7-Zip) is something to be proud of; not only is it GPL, but it often outperform the popular WinRAR in both pure compression and in efficiency as well. I’m a little surprised that the 7-Zip *nix port, p7zip, hasn’t gained more traction in Linux, but I suppose that old ways die hard. The cheapness of disk and bandwidth nowadays rather point to more transparent compression as the ideal rather than whatever archiving format has the best compression in terms of purely numeric results.

For those of you looking for a decent free arching program, check 7-Zip out; for those of you who lust after data tables of compression benchmarks, give Werner’s a look: it’ll satiate your desire for tabular results in ways you never thought possible.

These compressor are, for the record, GPL; this phenomenon is actually pretty rare. For some unknowable reason, much of the research work in compression has gone on in the land of Windows and its sometimes-associated proprietary. Meanwhile, Linux has had gzip, bzip2, and zlib, and that’s about it, with notable exceptions.[↩]

A note of history: Andrew Tridgell, usually associated with the Samba project, developed a ‘large’ corpus in 1999 to test his fork of bzip2, known as rzip. This latter format was tuned to allow for better compression on large files by implementing a much larger history buffer.[↩]