Huffman encoding [DRAFT]

Huffman encoding

There's quite a lot on the web about data compression. Some of it is
highly technical; other pages glory in saying "Well, you're probably
doing a school project or something, so we're not going to do your
assignment for you. Assuming you've done your homework and
know Huffman coding..".

We don't have many such axes to grind. We believe if you do have
an assignment, and are one of those people who reads the solution
before you try the assignment, then either you're very bright, or
you're a total schmuck who won't amount to much! Either way, here's
our simple approach to Huffman compression. We do assume that
you have rudimentary programming skills, understand simple binary
arithmetic, and know what ASCII is.

Introduction

You may wish to skip to the meat, but why not read our tiny intro?
In 1951, David Huffman wrote a little post-graduate assignment, which
was destined to become a computer-programming classic. Huffman encoding
is perhaps to computer science what fast Fourier transforms are to
electrical engineering - a dietary staple that crops up everywhere. Huffman's neat, for
it allows us in a predictable, quick fashion to optimally compress
data. "Hang on", you cry - "surely it cannot be optimal?" Well, yes,
this depends on what you mean by optimal. Huffman is optimal, provided
you are looking at your data as made up of a given alphabet of 'things'
that you want to represent in the shortest way you can. 'Letters' in
your alphabet will be optimally represented, with more common letters
taking up fewer bits, and less frequently used letters being allocated
far longer bit strings to represent them. This doesn't mean that you
cannot achieve more compression by looking at the relationship between
letters, and encoding this. But, at the end of the day, when you have
a whole lot of 'identifiable letters' lumped together, Huffman is often the
way to go. This is why we see Huffman encoding used together with other
compression schemes, such as the notorious Lempel-Ziv-Welch compression
used in GIF files (Notorious because UNISYS suddenly started exerting their
copyright a few years back, which silly move angered not a few webizens).

When we say 'optimal', we also fail to mention that there are different
ways of applying Huffman compression. For example, you might have a body
(corpus) of text that you want to compress, and know from the start
the relative frequencies of all the letters in this text. You could then
design a 'static' compression scheme and apply it to the text. Conversely,
you might have only the first part of the text (and already want to start
compressing) - you might then use an 'adaptive' scheme of Huffman encoding!
Or you might have several texts in different languages - each one might
need its own scheme for optimal compression. In addition, if you're sending
your compressed data down a wire, note that you also have to tell the person
on the other end something about how the data are compressed - and this
information about how compression has been arranged
takes up space, too. There are also tricks and traps in actually doing
the compression - and some data sets are inherently unfriendly when it
comes to doing what Huffman does - representing the letter frequencies
in the form of a 'binary tree'.

It's obvious if you count the frequencies of letters in a text - say Charles Dickens'
Tale of Two Cities - that some letters occur more frequently than
others. In most large English texts, the letter 'e' is by far the most
common (there are one or two exceptions)! Now let's try some trickery.
We know that text is commonly stored as ASCII codes - and for normal 8-bit
ASCII characters, the highest bit of each 8-bit byte is zero. Let's
go through our text, and wherever we encounter the 8 bit ASCII code for
'e' (ASCII 01100101), we replace it by the single bit '1'. We save seven
bits for every letter 'e'! If we now read our text from left to right,
we can see that if we first encounter a '0', then there will be seven
more bits that represent an 8-bit ASCII code; but if the first bit is a '1',
then this alone represents an 'e', and we can move on to the next bit
and examine it in turn, to see whether it's another e, or another 8-bit
character!

We can see that the above approach is far from optimal - there are other letters,
apart from e, that occur frequently. Each letter has its own peculiar frequency, so it would seem appropriate to
represent some common letters with bit strings of one particular length,
and others less common with other, longer bit strings. Just as we did with our clumsy
e-xample, we need to uniquely distinguish between different strings,
depending on the particular bit(s) they start with. In our example, every
representation of e began (and ended there and then) with a binary one, everything
else started with '0'. You can also
work out that it's only really worthwhile using a single bit for the
letter 'e' if nearly half of the letters are 'e'. If a quarter of the
letters, are 'e', then you might represent this letter by two bits, and
so on.

Let's have a look at how Huffman designed an optimal approach.
When he designed his approach to compression, Huffman was faced with
a problem. Let's consider a text, and an alphabet of symbols and their
frequencies in the text,
for example (swiped from Lanaki's cryptography lesson):

You find that you rapidly get bogged down if you start
assigning frequencies starting with the most commonly occurring
letters. There's a vast number of ways you can combine frequencies,
once you start moving down and looking at less and less frequently-occurring
letters.
Try it for your self! Huffman's answer was devastatingly simple -
just start at the other end, with the least common frequencies! The
trick is to take the frequencies of the two least common characters,
and add them together. You now have a 'metacharacter' that stands in
place of these two characters. You next take all the remaining frequencies
(including the new metacharacter), and decide which two characters now
are least common, and repeat the process until you have only two
characters left (one at least must be a metacharacter, unless your
alphabet has only two characters). You then
work back down to your initial characters, but this time,
every time you encounter a metacharacter and split it into two,
you can assign a '0' to the representation of one character (or
meta-character), and a '1' to the other branch.

We'll try an example in a moment. But you can immediately get a feel
for what you're doing. Every time you move back down towards the infrequently
used characters, as you split into either 0 or 1, you are creating a binary tree , with branches, and sub-branches, and sub-sub-branches..
Note that only
very rarely will such trees have branches that are completely symmetrical -
if a letter is very common, such a 'leaf' will occur very near the 'root'
of the tree, making the tree lopsided.

An example

Consider the following simple alphabet, with corresponding frequencies:

Letter

Frequency (%)

E

36

S

19

A

17

D

14

R

10

Z

4

We progressively apply our Huffman algorithm, thus:

Rank

Initial

Pass 1

Pass 2

Pass 3

Pass 4

HIGHEST

LOWEST

E=36

E=36

E=36

E=36

DRZSA=64

S=19

S=19

DRZ=28

SA=36

E=36

A=17

A=17

S=19

DRZ=28

D=14

D=14

A=17

R=10

RZ=14

Z=4

You can see what we've done - first we set up our table with the initial
values and frequencies. We then add the frequencies for R and Z (call this
RZ), and then re-rank everything so that items are still sorted by
frequency. We repeat this again and again, until after Pass 4, we have
only two items left - the metacharacter DRZSA, and the character E.

Turning this into a binary tree is then simple - let's split or tree
into two, binary '1' for E, and binary '0' for DRZSA. E is a 'leaf', so
we turn to DRZSA, which we in turn split into two - '01' for DRZ, and
'00' for SA. We repeat the process for both metacharacters, deriving
'011' for D, and '010' for RZ. Similarly SA is split into '001' for A,
and '000' for S. But wait, we still need to sort out RZ, so we split it
into '0101' (R) and '0100' (Z). We're finished. Here's a schematic
of the binary tree we've created:

As desired, less frequent characters have more bits in their
binary representation;

At several points in the tree, where frequencies were the same,
arbitrary choices have been made - we could have, for example,
added DRZ (28%) to either SA (36%) or E (also 36%).

We arbitrarily allocated '1' to E and '0' to DRZSA. We might
just as easily have allocated '0' to E and '1' to DRZSA.

We can now make a table of bit strings for each character:

Bit string table

Character

Bit string

Length

E

1

1

D

011

3

A

001

3

S

000

3

R

0101

4

Z

0100

4

You can see that despite the constraints of the algorithm, we still
have room to play! There were several points where we made 'arbitrary
choices'. This is a good thing, as we will soon discover.
For there are problems..

Doing the compression

Okay, so we now have our (optimal?) tree that represents the Huffman
codes for all of our characters. The question is, how do we use it?
(We'll leave actual encoding of the binary tree itself for later).
There are two distinct problems. These are:

Encoding our text corpus

Telling someone how to decode the encoded data

Encoding

Conceptually, this is simple. All one does is take the string we
wish to encode, and then, using a table of lengths and bit-strings
constructed as above, for every 'E' we encounter, substitute '1'
(binary); for every 'A', we substitute '001', and so on. The mechanics
of doing so are mildly tricky, however!

One way of doing things is to set aside binary words in memory
that are big enough to contain the longest binary Huffman string.
Also set up an array of lengths of each binary string corresponding
to a particular character.
We then work sequentially through the string to be translated (don't
do this just yet), looking up the binary word for each character, and
then tacking the correct number of bits onto our output string. If you
set things up correctly, this should be quick. Oh! Remember to somehow
record the length of the finished output string, so extra 0's or whatever
at the end don't get mis-interpreted as data!

Decoding - Canonical Huffman

Decoding is a bit more subtle, and actually colours our encoding scheme!
We need to tell the 'decoder' how things were encoded, and the encoder
therefore needs to find an efficient algorithm to do just that.

First, the decoder needs to know exactly how the string was encoded.
For this, we employ the trickery we referred to above. The magic term
is 'canonical Huffman'. The word canon comes from a Greek/Latin word meaning
a 'rule', often actually a rather arbitrary edict. Our edict is, that when encoding the bit strings :

We will preserve "alphabetical order" in our alphabet;

We will allocate '1's to shorter codes.

The reason for this strange set of rules will become apparent.
But first, let's take the Huffman table from above:

Rank

Initial

Pass 1

Pass 2

Pass 3

Pass 4

HIGHEST

LOWEST

E=36

E=36

E=36

E=36

DRZSA=64

S=19

S=19

DRZ=28

SA=36

E=36

A=17

A=17

S=19

DRZ=28

D=14

D=14

A=17

R=10

RZ=14

Z=4

We next construct a tree, slightly different from the one above,
because we will now always allocate the '1's to shorter branches:

But you can see there's a problem - for although we've meticulously
allocated 1's to shorter codes (E gets a one, DRZSA gets a zero;
D gets 001 while RZ gets 000), we haven't preserved alphabetical order !
Take A, S and D, which have the same lengths. To preserve alphabetical
order, the value for A should be 001, D 010, and S 011. Also note that
even if we swop around S and A, making S=011 and A=010, D is still misplaced.

The solution is to create our tree as above, allocating codes, and
then, for each set of characters with bit strings of the same length ,
to re-allocate their codes in alphabetical order! So, for example, with
A, D and S, which are all of length three, we re-allocate values
- A becomes 001, D 010 and S 011. Note that nothing has really
changed, as we still need three bits for each character. We just
re-arranged things slightly, thus:

Well, the answer is that if we use canonical Huffman coding to encode
a text, then all the person on the other end needs in order to decode the text
is a list of the lengths of all of the characters in the alphabet!
You can easily see why this is, by working through our example .. if we know that
there's only one code of length 1 then '1' must refer to E, so
everything starting with 0 must
refer to bit strings with length greater than 1. If there are no
characters of length 2, then we move onto length 3, where there are
now four possibilities (000, 001, 010 and 011 - remember that E gobbled
up the first '1', so 100 and so on are not possibilities).

Knowing that A, S and D are of length three, we can work out what
the corresponding codes are - because shorter strings eat the
'ones', 000 cannot refer to A, S or D. Because we preserve alphabetical
order, the values must be A=001, D=010, and S=011.

In summary, for the person receiving a canonical Huffman-encoded text to
effectively decode the text, all she needs is to know the length (in bits)
of each of the characters. She can work out all the rest! Here is our
table of bit strings - note that the actual strings are redundant
information, as we have NOW used canonical Huffman encoding.

Canonical bit string table

Character

Bit string

Length

E

1

1

A

001

3

D

010

3

S

011

3

R

0000

4

Z

0001

4

Alternatives

Think for a moment.
We could also have arranged things thus:

Rank

Initial

Pass 1

Pass 2

Pass 3

Pass 4

HIGHEST

LOWEST

E=36

E=36

E=36

AS=36

EDRZ=64

S=19

S=19

DRZ=28

E=36

AS=36

A=17

A=17

S=19

DRZ=28

D=14

D=14

A=17

R=10

RZ=14

Z=4

(All we did was meld E and DRZ, rather than AS and DRZ as we
did above. Perfectly legal, as E and AS have identical frequency weightings).
The corresponding canonical Huffman tree is:

Now you can see that even a canonical Huffman tree needn't be unique.
We still have some scope for playing! So which variant is better?
If you think about it, in our first tree, the letter 'E' was encoded
using just a single bit - here it gets two, but then here, A and S
also get two bits rather than three. Consider a hundred characters.
Using either encoding system, the number of bits required will be:

i='Z'
SIGMA ( freq i * size i )
i='A'

In other words, add up the product of the bit size of each letter,
and the number of letters to get the total bits.

For the first system, this works out at:

36*1 + 19*3 + 17*3 + 14*3 + 10*4 + 4*4 = 242 bits

For the second, we get:

36*2 + 19*2 + 17*2 + 14*3 + 10*4 + 4*4 = 242 bits

Intriguing, isn't it? Now you have to ask yourself, is this
just luck, or always the case?

Practical Points

{We need to make a few notes on e.g. Fibonacci}

NB ceil(log2(alphabetsize) is max number of 1's in a row!
ie may have word length longer than this, but then hi-order bits
must be zero, which helps with encoding!