Technical Details

Maximum Expansion Factor

zlib's compression method, an LZ77 variant called deflation, emits compressed data as a sequence of blocks. Various block types are allowed, one of which is stored blocks—these are simply composed of the raw input data plus a few header bytes. In the worst possible case, where the other block types would expand the data, deflation falls back to stored (uncompressed) blocks. Thus for the default settings used by deflateInit(), compress(), and compress2(), the only expansion is an overhead of five bytes per 16 KB block (about 0.03%), plus a one-time overhead of six bytes for the entire stream. Even if the last or only block is smaller than 16 KB, the overhead is still five bytes. In the absolute worst case of a single-byte input stream, the overhead therefore amounts to 1100% (eleven bytes of overhead, one byte of actual data). For larger stream sizes, the overhead approaches the limiting value of 0.03%.

deflateInit2() includes arguments for windowBits and memLevel that permit settings other than the defaults. These parameters can be used to adjust the memory required for the compressor as well as the decompressor for special applications, though usually with a reduction in compression. For various combinations of the allowed values for those parameters, the expansion can be larger than for the default settings. The worst case choice of parameters can result in an expansion of at most 13.5%, plus eleven bytes.

The deflateBound() and compressBound() functions can be used to provide an upper limit on the expansion in order to permit the allocation of an output buffer assured to be large enough to hold the entire compressed output. deflateBound() takes into account any deflateInit2() options.

Maximum Compression Factor

Empirically, the deflate method is capable of compression factors exceeding
1000:1. (The test case was a 50MB file filled with zeros; it compressed to
roughly 49 KB.) Mark loves to calculate stuff like this and reports
that the theoretical limit for the zlib format (as opposed to its
implementation in the currently available sources) is 1032:1. To
quote him,

The limit comes from the fact that one length/distance pair can
represent at most 258 output bytes. A length requires at least one
bit and a distance requires at least one bit, so two bits in can
give 258 bytes out, or eight bits in give 1032 bytes out. A dynamic
block has no length restriction, so you could get arbitrarily close
to the limit of 1032:1.

He goes on to note that the current implementation limits its dynamic
blocks to about 8 KB (corresponding to 8MB of input data); together with
a few bits of overhead, this implies an actual compression limit of about
1030.3:1. Not only that, but the compressed data stream is itself likely
to be rather compressible (in this special case only), so running it through
deflate again should produce further gains.

By way of comparison, note that a version of run-length encoding optimized
for this sort of unusual data file -- that is, by using 32-bit integers for
the lengths rather than the more usual 8-bit bytes or 16-bit words -- could
encode the test file in five bytes. That would be a compression factor of
10,000,000:1 (or 10.000.000:1 for you Europeans, or
107:1
for all of you engineers and scientists whose browsers support superscripts).

Finally, please note that this level of compression is extremely rare
and only occurs with really trivial files (e.g., a megabyte of zeros).
More typical zlib compression ratios are on the order
of 2:1 to 5:1.

Compression Factor Design Quirk

A design choice in the zlib implementation (as opposed to the zlib
and deflate specifications) limits match distances to
2windowBits - 262 rather than the
2windowBits that one might naively
expect. This limitation mainly affects applications that try to optimize
decoder memory usage by reducing the window size for small files; in some
cases, compression might be degraded because an otherwise valid match
(according to the spec) near the theoretical distance limit won't be found
by zlib.

One workaround would be to
set the window size to the next larger power of two for encoding (but no
more than the maximum size of 32 KB, obviously) and then modify the
CINFO field in the zlib header after the stream is compressed.
Such an approach is not recommended, however,
and should be attempted only by experts.

Memory Footprint

zlib's memory footprint can also be specified fairly precisely. It
is larger for compression than for decompression, and the exact requirements
depend on how the library was compiled.

The memory requirements for compression depend on two parameters,
windowBits and memLevel:

For the default values of 15 and 8, respectively, this is 256 KB. Both
windowBits and memLevel can be set to lower values at compile time via
the MAX_WBITS and MAX_MEM_LEVEL macros, but only at a cost in compression
efficiency.

The memory requirements for decompression depend only on windowBits,
but this is, in a sense, a harsher limitation: whereas data streams
compressed with a smaller window will merely be a bit larger than they
would have otherwise, a reduced window size for decompression means that
streams compressed with larger windows cannot be decompressed at all.
Having said that:

inflate memory usage (bytes) = (1 << windowBits) +
1440*2*sizeof(int)

Typically, therefore, inflate() requires no more than 44 KB of storage
on a 32-bit machine--this includes the 32768-byte sliding window and 11520
bytes of inflate_huft allocations. There are a few additional (fixed)
amounts of memory usage not included here, but they are small compared to
these items.

Adler-32 versus CRC-32

This section uses superscripts, which are not supported by some
older browsers.

Both Adler-32 and CRC-32 (cyclic redundancy check) are 32-bit checks.
But while the CRC can take on any 32-bit value (232 possibilities),
Adler-32 is limited to 655212 possibilities. So the probability
of a false positive on random errors for CRC-32 is
2.3283 x 10-10, whereas it is very slightly higher for
Adler-32 at 2.3294 x 10-10.

The above assumes that all the values are accessible given the amount of
data. That is true after only four bytes for the CRC-32, but Adler-32
requires, on the average, about 0.5 KB of data to get rolling--or
1 KB if it's ASCII data (text). So if the Adler-32 is used on
significantly less than about a kilobyte, it will be noticeably weaker than
a CRC-32 on the same small block.

A properly constructed CRC-n has the nice property that less than
n bits in error is always detectable. This is not always true for
Adler-32--it can detect all one- or two-byte errors but can miss some
three-byte errors. However, Adler-32 has been constructed to minimize the ways
to make small changes in the data that result in the same check value, through
the use of sums significantly larger than the bytes and by using a prime
(65521) for the modulus. It is in this area that some analysis is deserved,
but it has not yet been done.

This last potential weakness is not a major concern in the application of
Adler-32 to zlib (or any other history-based compressor), since if
there is an error at some point in a stream, it will be massively propagated
after that. It would be of concern in an application with transmission or
storage that has a borderline signal-to-noise ratio, for which small numbers
of random errors are expected. For that sort of application one would
certainly want to use a CRC or, better yet, Reed-Solomon error-correction
coding. But even in this case, if the data being transmitted or stored uses
some sort of history-dependent compression (as in zlib) and was
compressible to begin with, then an Adler-32 used after decompression
would be adequate since the decompressor would significantly amplify any small
errors in the compressed stream. (For incompressible data, most modern
compressors operate in a pass-through mode, so the original comment about
using a CRC or ECC holds.)

The main reason for Adler-32 is, of course, speed in software implementations.
The authors wanted a check on zlib's decompression, but not a significant
speed penalty just for the check. So Mark came up with the Adler-32 as a
faster but still effective alternative to the CRC-32.