Researchers are searching for some way to combine them, to get the benefits of both. [1]

A typical paragraph of text has several different kinds of patterns. Some archive software applies a series of different transformations, each transformation targeting a different kind of pattern in the text.

Such a transformation is often based on some kind of model of how the file was produced.

Compression software (especially image compression software) often sends the raw data through one or more stages of decorrelation -- such as Karhunen–Loève transform[2], color space transformation, linear filtering, DFT, transform coding, etc. -- the "precompressor" stages -- before feeding the result into an entropy compressor.

Many transforms used in compression, counter-intuitively, produce an intermediate result that requires more bits of dynamic range to store. Transforming raw 24-bit RGB pixels into 26-bit YCoCg-R pixels[2] or transforming raw 16-bit audio samples into 17-bit delta coded differences may seem to "waste" bits, but it improves net compression performance: In many cases, the entropy compressor gives smaller compressed file sizes when given such "de-correlated" intermediate data than when given the original raw data. (The improvement in results resulting from such transformations is often called "coding gain").

The Red, Green, and Blue layers of an image are typically highly correlated. So compressing the R, G, and B layers independently ends up storing much of the same data 3 times.

Even though "decorrelating" a 24-bit-per-pixel RGB image into a YCbCr requires *more* bits (26 bits) to hold each pixel, decorrelating helps later stages in the compression pipeline do a better job. So independently compressing the Y, Cb, and Cr planes usually results in a smaller compressed file than independently compressing the R, G, and B layers of the same file.

Speech compression often uses a relatively simple model of the human voice. The glottis produces a buzz at some frequency. The vocal tract resonates or dampens that frequency and its harmonics. The tongue and lips occasional producing hissing sibilants and popping plosive sounds.

While in principle the linear predictive filter coefficients could be transmitted directly, typically the system is set up so (after entropy decoding) the received bits represent reflection coefficients,[3] log area ratios, line spectral pairs, or some other representation of the filter coefficients.

The "zlib compressed data format specification" is a standardized way of packaging compressed data in a file, with a simple header that indicates the type of compression and helps detect the most common kinds of data corruption. It is often used to store data compressed with DEFLATE with a window size up to 32K.

The LZX file format and compression algorithm was invented by Jonathan Forbes and Tomi Poutanen. LZX is used in a popular stand-alone AMIGA file archiver. Microsoft Compressed HTML Help (CHM) files use LZX compression. Windows Imaging Format (".WIM") files support several data compression methods, including LZX. Xbox Live Avatars use LZX compression.

LZX uses both LZ77-style sliding window compression and dynamic Huffman codes, somewhat like DEFLATE uses both. The LZ77-style window size (WINDOW_SIZE) is some power-of-two, at least 2^15 and at most 2^21. LZX uses several Huffman tree structures, each of them canonical.

The most important Huffman tree in LZX is the "main tree", composed of 256 elements corresponding to all possible ASCII characters, plus 8*NUM_POSITION_SLOTS elements corresponding to matches. (The NUM_POSITION_SLOTS is in the range of 30 slots to 50 slots). The second most important Huffman tree in LZX is the "length tree", composed of 249 elements. Other trees have smaller roles.

The LZX compressor arbitrarily divides up a file into one or more blocks. Each block is one of 3 types: "verbatim block", "aligned offset block", "uncompressed block". The "uncompressed block" has a few more things in the header, and then a literal copy of the uncompressed data.

In the data portion of the "verbatim block" and "aligned offset block" blocks, LZX uses length-limited Huffman codewords; LZX limits each codeword to at most 16 bits wide. Each "verbatim block" and "aligned offset block" has a larger header that gives the information (in a very compact differentially-compressed form) required for the decoder to regenerate the various Huffman trees used in that block.

The header is decoded to get the codeword length for each symbol, has a length of 0 to 16 bits (inclusive), where "0 bits" indicates the corresponding symbol never occurs in this block, and "16 bits" are used for symbols that occur extremely rarely. A variety of techniques are used to compress the header into relatively few bits.

When the "distance" value is decoded, there are several special values. It happens quite often that two long phrases of English text are *almost* identical, but there is a small typo or substitution in the middle of the otherwise-identical long phrase. With the original LZ77, the result is 3 copy items: [long distance X to first part, length of first part], [distance Y to the insertion, length of insertion], [exactly the same long distance X to start of second part, length of second part]. LZX, rather than repeat exactly the same long distance X twice, uses a special "distance" value to indicate that we are "continuing" to copy a previous block after making a substitution in the middle. (The special distance value ends up occupying fewer bits than the full long distance).

"distance" == 0: "repeat the most recent match distance, which was not itself a repeated distance"

"distance" == 1: "repeat the second-most-recent match distance, which was not itself a repeated distance"

"distance" == 2: "repeat the third-most-recent match distance, which was not itself a repeated distance"

After the decompressor uses the block header to set up all the trees, the decompressor's main loop in a verbatim block is:

Read a codeword from the compressed text using the main tree and decode it into a "main element".

If the "main element" is less than 256, emit the "main element" as a literal plaintext byte.

If the "main element" is 256 or more, we're going to have a copy item.

Break the "main element - 256" into two parts: a "slot" and a 3 bit "length".

If the "length" is the special all-ones value "7", read another codeword from the compressed text using the "length tree" and decode it into a "long length"; add that to the 7 already in the length.

If the "slot" is in the range 0..3, then set the distance the same as the slot.

If the "slot" is in the range 4..37, then extra = ((slot >> 1) - 1) more bits are needed to get the exact distance.

read those extra raw "footer" bits from the compressed text, (not using any Huffman tree, just the raw bits). The distance is (in binary) an implied 1 bit, followed by the (slot & 1) bit, followed by the extra raw footer bits.

If the distance is 0..2, it indicates a "repeated distance" (as described above) -- replace it with the actual distance we used before.

Do a normal LZ77-style copy from the previously decoded plaintext: copy from a distance (as described above) a total of "length + 2" bytes into the decoded plaintext.

repeat until we've decoded the total number of bytes that the header says the uncompressed plaintext should contain.

After the decompressor uses the block header to set up all the trees, the decompressor's main loop in an aligned block is:

Read a codeword from the compressed text using the main tree and decode it into a "main element".

If the "main element" is less than 256, emit the "main element" as a literal plaintext byte.

If the "main element" is 256 or more, we're going to have a copy item.

Break the "main element - 256" into two parts: a "slot" and a 3 bit "length".

If the "length" is the special all-ones value "7", read another codeword from the compressed text using the "length tree" and decode it into a "long length"; add that to the 7 already in the length.

If the "slot" is in the range 0..3, then set the distance the same as the slot.

If the "slot" is in the range 4..37, then extra = ((slot >> 1) - 1) more bits are needed to get the exact distance.

if extra > 3, read (extra - 3) raw "footer" bits from the compressed text, (not using any Huffman tree, just the raw bits). Then read another codeword from the compressed text using the "aligned tree" and decode it to 3 bits (0 to 7). The distance is (in binary) an implied 1 bit, followed by the (slot & 1) bit, followed by the (extra - 3) raw footer bits, followed by the 3 aligned bits.

if extra = 3, read another codeword from the compressed text using the "aligned tree" and decode it to 3 aligned bits (0 to 7). The distance is (in binary) an implied 1 bit, followed by the (slot & 1) bit, followed by the 3 aligned bits.

if extra is 1 or 2 bits, read those extra raw "footer" bits from the compressed text, (not using any Huffman tree, just the raw bits). The distance is (in binary) an implied 1 bit, followed by the (slot & 1) bit, followed by the extra raw footer bits.

If the distance is 0..2, it indicates a "repeated distance" (as described above) -- replace it with the actual distance we used before.

Do a normal LZ77-style copy from the previously decoded plaintext: copy from a distance (as described above) a total of "length + 2" bytes into the decoded plaintext. (The minimum copy length is 2 bytes).

repeat until we've decoded the total number of bytes that the header says the uncompressed plaintext should contain.

The Microsoft "cabinet" (".CAB") compressed archive file format supports 3 data compression methods: DEFLATE, LZX, and Quantum compression. The compressor that puts executable files into a ".CAB" archive may optionally "detect "CALL" instructions, converting their operands from relative addressing to absolute addressing, thus calls to the same location resulted in repeated strings that the compressor could match, improving compression of 80x86 binary code." The ".CAB" decompressor, after doing LZX decompression described above, and if the "preprocessing" bit in the header was set to one, must convert those operands back to relative addressing. We will discuss such preprocessor/decorrelators in general, and Executable Compression specifically, in more detail in a later chapter.

To do:
Is this CALL-filter preprocessor used for all executable files in a cabinet, or just LZX-compressed files?

LZX DELTA (LZXD) is a variation of LZX modified to support efficient delta compression (patching).

A LZX DELTA decompressor takes the old version of some plaintext file (this must be exactly the same as the old version used by the compressor) and the LZXD-compressed patch file, and uses them to generate the new version of that plaintext file.

The main difference between LZXD and LZX is that the window is pre-loaded with the reference file, with the decompressor decoding the first byte of the output file into the window at the location immediately after the last byte of the reference file. Normal LZX copy items have a distance that is at most the full distance back to the first byte of the output file; LZXD allows a distance that reaches much further back to allow copies from the reference file.

LZXD supports many more "slots", up to 290 slots, which are encoded into the main tree. This allows LZXD copy items to grab things much further into the past, allowing it to take advantage of very long reference files.

LZXD also supports much larger "length" values. Much as LZX has a "short length" of 0 to 7, where "7" is used as an escape for to get a "long length" from 7 to 255, LZXD uses the same "short length" and "long length" and goes one step further, using a long length of "257" as an escape for an "extra-long length" that can copy 32 KByte in a single copy.

LZMA is one of the compression methods used in the open-source 7-Zip file archiver.

The LZ77-like compression part of LZMA has many similarities with the corresponding LZ77-like compression part of LZX. LZMA does not use Huffman codes -- instead, it uses context-coded binary arithmetic codes.

Many compressed file formats must be decompressed using an entropy decoder, followed by a whole series of transformations -- a kind of Wikipedia: Pipeline (software).

There are 4 ways to implement such decompressors (and the corresponding compressors):

Multi-pass compression: each inner transformation reads all the input from some temporary file, and writes all its output to some temporary file, before the next stage starts.

Run every stage more-or-less simultaneously

Use co-routines or threads or processes or Unix pipes. Alas, these techniques are not supported in many embedded system environments.

Use decompression functions that call "down" to fetch more data. The outer application directly calls only the final transformation stage, which calls some intermediate transformation stage, which calls some other intermediate transformation stage, which calls the entropy decoder at the beginning of the pipeline, which calls the routine that fetches bytes from the file. The outer application repeatedly calls that final transformation stage until all the data is processed.

Use decompression functions that return "up" to fetch more data. The outer application directly calls each stage in turn. Each stage reads data from its input buffer, updates its internal state, and writes data to its output buffer. That output buffer is the input buffer of the next stage. The outer application repeatedly cycles through all the stages until all the data is processed.

Converting software from a "call down to fetch more data" API to a "return up to fetch more data" API is difficult.[10]

Because modern CPUs process data faster than data can be read off a spinning disk or most external flash drives, it is possible to read and write a compressed, encrypted file in less time than it takes to read and write a normal plain text file containing the same data. [20]

Some researchers test encryption algorithms by trying to compress the encrypted data. Many encryption algorithms are specifically designed to produce ciphertext that is "indistinguishable from random noise".[27] When trying to implement such an algorithm, If compression produces a compressed output file that is smaller than the encrypted input file, that tells the researcher that there is a flaw in the encryption software -- either a bug in the implementation, or it's a correct implementation of an insecure algorithm.

Some people using lossless compression prefer to use "idempotent" (Is there a better term for this?) compression software, even if it takes a little longer to compress.[28] In other words, they expect that when you compress a file with "maximum compression" (often "-9n") to create one compressed file, then uncompress it, then re-compress it, they expect the resulting second compressed file is identical to the first compressed file.

This generally requires that each and every transformation implementation (at least when passed the "-9n" option) is idempotent. Some things that are not idempotent (and so should be avoided, at least when passed the "-9n" option) are:

Some compressors store the original filename and timestamp of the uncompressed file in the compressed file. To be idempotent, either the compressor must leave out the name and timestamp (that's what the "-n" typically does) or else the decompressor must restore the original name and timestamp.

For any given implementation of a decompression algorithm, there are almost always many different ways for a compressor to represent the original file -- there are many compressed files that will produce identically the same output file. (The exception is "bijective compression"). When attempting to gain "maximum compression", we select the way(s) that are the shortest -- but even then there are usually several ways that will give the same compressed filesize. To be idempotent, the implementation of the compressor must always pick the same way.

idempotent (lossless but not bijective: ED = 1 but DE !=1). Nearly all compression algorithms designed for text or executable code are in this category.

non-drifting lossy (EDE = E). Most lossy compression algorithms designed for audio and video and still images are in this category. Quantization is a non-drifting lossy transform. Starting from an initial image I0, the first encode/decode cycle gives you a slightly different image I1 (hopefully different only in ways normal humans are unlikely to notice). Starting from I1 and encode/decode to produce I2, then encode/decode I2 to produce I3 doesn't give any more loss: I1 == I2 == I3 == I4 ...

The general case: degrading. Every encode/decode cycle generally produces a slightly different image (such as a series of FAX transmissions or a multi-generational copy using physical paper copy machines).

Some transforms are known not to completely remove all redundancy. In particular, researchers have designed special LZ77 encoders and LZW encoders that encode extra "redundant bits" in the choice of which copy was selected, in a way that is backwards-compatible: Standard LZ77 or LZW decoders can extract the original file. Special LZ77 or LZW decoders can extract the original file plus these extra "redundant bits". These extra bits can be used for Reed-Solomon or other forms of error correction, or to store metadata about the original file such as a public signature or message authentication code.[30][31]

↑"lzx.c - LZX decompression source code" (has many comments on the difference between the documentation and the implementation, clarifying several details the "official" documentation neglects to specify)

↑"Encryption Using Huffman Coding". quote: "If you take a data file and compress it using Huffman coding techniques, you get something that looks random. Make the coding table the key... You could make it more difficult by arbitrarily permuting the ones and zeros and choosing random digraphs to encode... several dozen common digraphs as single symbols".

↑Douglas W. Jones. "Splay Tree Based Codes". quote: "Splay compression performs unexpectedly well on images... It is only an accident that splay trees are useful for encryption, and the security of variants on the algorithm is still largely an open question."