Fixed-Length Codes

Suppose we want to compress a 100,000-byte data file that we know contains
only the lowercase letters A through F. Since we have only six
distinct characters to encode, we can represent each one with three bits rather
than the eight bits normally used to store characters:

Letter

A

B

C

D

E

F

Codeword

000

001

010

011

100

101

This fixed-length code gives us a compression ratio of 5/8 = 62.5%.
Can we do better?

Variable-Length Codes

What if we knew the relative frequencies at which each letter occurred? It
would be logical to assign shorter codes to the most frequent letters and save
longer codes for the infrequent letters. For example, consider this code:

Letter

A

B

C

D

E

F

Frequency (K)

45

13

12

16

9

5

Codeword

0

101

100

111

1101

1100

Using this code, our file can be represented with

(45×1 + 13×3 + 12×3 + 16×3 + 9×4 + 5×4) × 1000 = 224 000 bits

or 28 000 bytes, which gives a compression ratio of 72%. In fact, this is an
optimal character code for this file (which is not to say that
the file is not further compressible by other means).

Prefix Codes

Notice that in our variable-length code, no codeword is a prefix of any other
codeword. For example, we have a codeword 0, so no other codeword starts with 0.
And both of our four-bit codewords start with 110, which is not a codeword. A
code where no codeword is a prefix of any other is called a prefix code. Prefix codes are useful because they
make a stream of bits unambiguous; we simply can accumulate bits from a stream
until we have completed a codeword. (Notice that encoding is simple regardless
of whether our code is a prefix code: we just build a dictionary of letters to
codewords, look up each letter we're trying to encode, and append the codewords
to an output stream.) In turns out that prefix codes always can be used to
achive the optimal compression for a character code, so we're not losing
anything by restricting ourselves to this type of character code.

When we're decoding a stream of bits using a prefix code, what data structure
might we want to use to help us determine whether we've read a whole codeword
yet?

One convenient representation is to use a binary tree with the codewords
stored in the leaves so that the bits determine the path to the leaf. This
binary tree is a trie in which only the leaves map to letters. In our
example, the codeword 1100 is found by starting at the root, moving down the
right subtree twice and the left subtree twice:

Here we've labeled the leaves with their frequencies and the branches with the
total frequencies of the leaves in their subtrees. You'll notice that this is a full
binary tree: every nonleaf node has two children. This happens to be true of all
optimal codes, so we can tell that our fixed-length code is suboptimal by
observing its tree:

Since we can restrict ourselves to full trees, we know that for an alphabet C,
we will have a tree with exactly |C|
leaves and |C|−1
internal
nodes. Given a tree T corresponding to a prefix code, we also can compute
the number of bits required to encode a file:

B(T) = ∑ f(c) dT(c)

where f(c)
is the frequency of character c
and dT(c)
is the depth of the character in the tree (which also is the length of the
codeword for c). We call B(T)
the cost of the tree T.

Huffman's Algorithm

Huffman invented a simple algorithm for constructing such trees given the set
of characters and their frequencies. Like Dijkstra's algorithm, this is a greedy
algorithm, which
means that it makes choices that are locally optimal yet achieves a globally
optimal solution.

The algorithm constructs the tree in a bottom-up way. Given a set of leaves
containing the characters and their frequencies, we merge the current two
subtrees with the smallest frequencies. We perform this merging by
creating a parent node labeled with the sum of the frequencies of its two
children. Then we repeat this process until we have performed |C|−1
mergings to produce a single tree.

As an example, use Huffman's algorithm to construct the tree for our input.

How can we implement Huffman's algorithm efficiently? The operation we need
to perform repeatedly is extraction of the two subtrees with smallest frequencies,
so we can use a priority queue. We can express this in ML as:

We won't prove that the result is an optimal prefix tree, but why does this
algorithm produce a valid and full prefix tree? We can see that every time we
merge two subtrees, we're differentiating the codewords of all of their leaves
by prepending a 0 to all the codewords of the left subtree and a 1 to all the
codewords of the right subtree. And every non-leaf node has exactly two children
by construction.

Let's analyze the running time of this algorithm if our alphabet has n
characters. Building the initial queue takes time O(n log n)
since each enqueue
operation takes O(log n) time. Then we perform n−1
merges, each of
which takes time O(log n). Thus Huffman's algorithm takes O(n log n)
time.

Adaptive Huffman Coding

If we want to compress a file with our current approach, we have to scan
through the whole file to tally the frequencies of each character. Then we use
the Huffman algorithm to compute an optimal prefix tree, and we scan the file a
second time, writing out the codewords of each character of the file. But that's
not sufficient. Why? We also need to write out the prefix tree so that the
decompression algorithm knows how to interpret the stream of bits.

So our algorithm has one major potential drawback: We need to scan the whole
input file before we can build the prefix tree. For large files, this can take a
long time. (Disk access is very slow compared to CPU cycle times.) And in some
cases it may be unreasonable; we may have a long stream of data that we'd like
to compress, and it could be unreasonable to have to accumulate the data until
we can scan it all. We'd like an algorithm that allows us to compress a stream
of data without seeing the whole prefix tree in advance.

The solution is adaptive Huffman coding, which builds the prefix
tree incrementally in such a way that the coding always is optimal for the
sequence characters already seen. We start with a tree that has a frequency of
zero for each character. When we read an input character, we increment the
frequency of that character (and the frequency in all branches above it). We
then may have to modify the tree to maintain the invariant that the least
frequent characters are at the greatest depths. Because the tree is constructed
incrementally, the decoding algorithm simply can update its copy of the tree
after every character is decoded, so we don't need to include the prefix tree
along with the compressed data.