Quantum compression format

The Quantum compression format is a somewhat obscure method invented
by a company called Cinematronics (written by an author identified
as David Stafford, "davids a& cruzio.com"). Rights to it were
purchased at some point by Microsoft, and Quantum is one of the
possible compression methods in a CAB archive. This document does not
describe the Quantum header nor the CAB details, only the compressed data.

Overview

The Quantum compressor is a large-window LZ77-type compressor
combined with an integer arithmetic coder. In this it differs
from gzip/zlib and LZX which use Huffman compressors instead of arithmetic. An
arithmetic coder is typically more efficient, but slower. (As for the
arithmetic coding patents, Quantum doesn't use skew coding or bit
stuffing — it uses bit-plus-follow, which is apparently not
patented. And the models are not those of the patented Q-coder. So
you should be reasonably safe, but I Am Not A Lawyer)

Arithmetic basics

The Quantum arithmetic coder is similar to the bitwise coder described under
"Practical Arithmetic Coding with Renormalization" at Charles
Bloom's site. I will not describe it further here. What
differentiates Quantum is the statistical models it uses as input to
the coder.

Model Basics

All the models used by the Quantum compressor share the same structure
and update technique. All have a table with some number
of entries, each entry composed of a symbol and an associated
cumulative frequency. The cumulative frequency is the frequency of
all the symbols which are at a higher index in the table than that
symbol — thus the last entry in the table has a cumulative
frequency of 0. Each model also has a total frequency (the frequency
of all the symbols in the table), and a number time_to_reorder.

All the models are initialized with the symbols in symbol order in the
table, and with every symbol in the table having a frequency of 1.
Thus the initial total frequency is equal to the number of entries in
the table. The initial time_to_reorder value is 4.

The way the models are updated is as follows: Each time a symbol is
encoded or decoded, all the cumulative frequencies up to (but not
including) that symbol have 8 added to them. If this causes the total
frequency to exceed 3800, time_to_reorder is decremented.

If time_to_reorder has not reached zero, each cumulative
frequency in the table, starting with the last, is divided (with
truncation) by 2. If this causes the cumulative frequency to become
less than or equal to the cumulative frequency of the next entry, the
entry cumulative frequency is instead set to the next entry cumulative
frequency plus 1.

If time_to_reorder HAS reached zero, the frequencies (not cumulative
frequencies) for each entry in the table are computed. Each
frequency is divided by 2 with rounding. The table is then sorted by
frequency (highest first) using an in-place selection sort (not
stable!) and the cumulative frequencies recomputed. The
time_to_reorder is set to 50.

Selector and literal models

To start decompressing a Quantum archive, read a symbol from the code
stream using the selector model. This is a very simple model of 7
entries consisting of the symbols 0-6. What to do next depends on the
value of the selector. If the selector is 0,1,2, or 3, a symbol is
read using a literal model. These models encode symbols to be
directly output. They are, respectively:

Selector

Number of entries

Starting symbol

0

64

0

1

64

64

2

64

128

3

64

192

LZ Compression

Selectors 4,5, and 6 implement Lempel-Ziv matches. Selector 4 encodes
3 character matches, selector 5 implements 4 character matches,
and selector 6 encodes 5+ character matches. Two character matches are
encoded as literals. All three selectors use a position_slot and
extra_bits table given below, and selector 6 uses a length_slot and
length_extra_bits table.

position_slot

position_base

extra_bits

0

0

0

1

1

0

2

2

0

3

3

0

4

4

1

5

6

1

6

8

2

7

12

2

8

16

3

9

24

3

10

32

4

11

48

4

12

64

5

13

96

5

14

128

6

15

192

6

16

256

7

17

384

7

18

512

8

19

768

8

20

1024

9

21

1536

9

22

2048

10

23

3072

10

24

4096

11

25

6144

11

26

8192

12

27

12288

12

28

16384

13

29

24576

13

30

32768

14

31

49152

14

32

65536

15

33

98304

15

34

131072

16

35

196608

16

36

262144

17

37

393216

17

38

524288

18

39

786432

18

40

1048576

19

41

1572864

19

length_slot

length_base

length_extra_bits

0

0

0

1

1

0

2

2

0

3

3

0

4

4

0

5

5

0

6

6

1

7

8

1

8

10

1

9

12

1

10

14

2

11

18

2

12

22

2

13

26

2

14

30

3

15

38

3

16

46

3

17

54

3

18

62

4

19

78

4

20

94

4

21

110

4

22

126

5

23

158

5

24

190

5

25

222

5

26

254

0

The match selectors use several models to encode match length slots and
position slots. The size of each position model depends on the compression
table size. For example, if the table size is 1K, there is no need to
encode position slots of 1K or greater. The number of position slots
for each table size, and thus the position model size, is given in the
Appendix. The 4 and 5 selectors also
have a smaller maximum position slot regardless of table size; this
implies that small matches are not emitted beyond a certain point. (In
fact, the original compressor often doesn't emit small matches even within the
range of its position slot table, but that is an implementation
detail.)

Selector

Number of entries

Starting symbol

4

max 24

0

5

max 36

0

6 (position)

max 42

0

6 (length)

27

0

(note that the models are separate, even if they happen to be
the same size. That is, they each update separately and are not
affected by updates to each other)

To decompress a selector 4 or 5 match, read a symbol from the code
stream using the appropriate model. Look up the symbol in the
extra_bits table, and read that many bits. Add the position_base from the
position_slot table to the extra bits, and add one. This is the match
offset. The match length is 3 (selector 4) or 4 (selector 5).

To decompress a selector 6 match, read a symbol from the code stream
using the selector 6 length model. Look up the symbol in the
length_extra_bits table, and read that many bits. Add the length_base from
the length_slot table to the extra bits, and add 5. This is the match
length. Now read a symbol from the code stream using the selector 6
position model. Look up the symbol in the extra_bits table, and read
that many bits. Add the position_base from the position_slot table to the
extra bits just read, and add 1. This is the match offset.

Framing and Padding — Microsoft CAB

This paragraph is applicable only to the CAB implementation
In order to ensure that there are an integral number of uncompressed
bytes per CAB block, no LZ match will cause the uncompressed stream to
cross an $8000-byte boundary. Further, the arithmetic coder is
stopped at each $8000 byte boundary, with any remaining bits in the
block set to zero. The models and LZ window are not affected by
crossing such a boundary.

Checksum — Cinematronics original

This paragraph is applicable only to the Cinematronics implementation
Each time you output a byte, whether from a literal or a match, XOR it
into a checksum accumulator (inital value 0), then rotate the
accumulator one bit left. There is no EOF symbol, so keep reading
until you've output the number of bytes you know are in the file.
Then read 16 bits from the stream. These 16 bits should match your
checksum accumulator. You're done.

EOF when encoding

Note that because of the way the decoder works, it has already taken
in 16 + bit_plus_follow bits past the point the encoder had stopped.
Some of these bits are significant. The method used in the CACM87
coder to handle this does not seem to work. Instead, what seems to
work is: put a 0. Put bit-plus-follow 1s. Put the bottom 15 bits of
the low end of the current range. Then put the checksum.
(this is not the actual method Quantum uses)

The method used in CACM87 is, after the last symbol is processed
through the coder, the bit-plus-follow counter is incremented and the
low end range value is checked. If it is below 1/4 (0x3FFF), a one
(with follow) is output. Otherwise, a zero (with follow) is output.
I'm not sure whether this is valid or works 'by accident' in their
application.

I hit upon my method by considering what would happen if
more symbols had been sent, with very low cumulative frequency. In
that case, the high end of the range gets set equal (or nearly,
anyway) to the low end of the range, and the bits output are exactly
those from the low end of the range, with the bit-plus-follow bits
from the last real symbol, if any, after the high bit of the low end
of the range. (the constant 0 bit is because CS_L is always
less than 1/2 at the end of the encoding loop).

Hmm... new method based on the MACM-96 encoder. That one outputs the
code with the most trailing zeros. Not helpful efficiencywise for
Quantum, but it cleans things up a bit. The important thing to
realize is that all that is necessary is to send a code between CS_L
and CS_H. But at termination, CS_H & 0x8000 = 1 and CS_L & 0x8000 =
0. So the code to be sent is simply 0x8000. Since we're sending a 1
bit first, the follow bits are 0 and therefore we send 0x8000 +
bit_plus_follow zeros. This is also not the method Quantum uses; they
appear to send a zero bit first.

If anyone knows the "correct" method, or the one Quantum uses, please
let me know.