Huffman Compression

When we encode characters in computers, we assign each an 8-bit code
based on an ASCII chart. But in most files, some characters appear more
often than others.
So wouldn't it make more sense to assign shorter codes for characters that
appear more often and longer codes for characters that appear less often?

This is exactly what Claude Shannon and
R.M. Fano were thinking when they
created the first compression algorithm in the 1950's. However,
D.A. Huffman
published a paper in 1952 that improved the algorithm slightly and it soon
superceded Shannon-Fano coding with the appropriately named Huffman coding.

The Concept:

Huffman coding has the following properties:

Codes for more probable characters are shorter than ones for
less probable characters.

Each code can be uniquely decoded

To accomplish this, Huffman coding creates what is called a "Huffman tree",
which is a binary tree such as this one:

Figure: a sample Huffman tree

To read the codes from a Huffman tree, start from the root and add a '0'
every time you go left to a child, and add a '1' every time you go right.
So in this example, the code for the character 'b' is 01 and the code for
'd' is 110.

As you can see, 'a' has a shorter code than 'd'. Notice that since all the
characters are at the leafs of the tree (the ends), there is never a chance
that one code will be the prefix of another one (eg. 'a' is 01 and 'b' is 011).
Hence, this unique prefix property assures that each code can be uniquely decoded.