Tutorial :Huffman compression algorithm

Question:

I've implemented file compression using huffman's algorithm, but the problem I have is that to enable decompression of the compressed file, the coding tree used, or the codes itself should be written to the file too. The question is: how do i do that? What is the best way to write the coding tree at the beggining of the compressed file?

Solution:1

There's a pretty standard implementation of Huffman Coding in the Basic Compression Library (BCL), including a recursive function that writes the tree out to a file. Look at huffman.C. It just writes out the leaves in order so the decoder can reconstruct the same tree.

BCL is also nice because there are some other pretty straightforward compression algorithm pieces in there, too. It's quite handy if you need to roll your own algorithm.

Solution:2

First off, have you considered using a standard compression Stream (like GZipStream in .net) ?

About how/where to write your data, you can manipulate a Streams position with Seek (even reserve space that way). If you know the size of the Tree ahead of time you can start writing after that position. But you may want to position the coding tree after the actual data, and just make sure you know where that starts. Ie, reserve a little space in front, write the compressed data, record the position, write the tree, go to the front and write out the position.

Solution:3

Assuming you compress on 8-bit symbols (i.e. bytes) and the algorithm is non-adaptive, the simplest way would be to store not the tree but the distribution of the values. For example by storing how often you found byte 0, how often byte 1, ..., how often byte 255. Then when reading back the file you can re-assemble the tree. This is the simplest solution, but requires the most storage space (e.g. to cover large files, you would need 4 bytes per value, i.e. 1kb).

You could optimize this by not storing exactly how often each byte was found in the file, but instead normalizing the values to 0..255 (0 = found least, ...), in which case you would only need to save 256 bytes. Re-assembling of the tree based on these values would result in the same tree. (This is not going to work as pointed out by Edmund and in question 759707 - see there for further links and answers to your question)

P.S.: And as Henk said, using seek() allows you to keep space at the beginning of the file to store the values in later.

Solution:4

Most implementations are using canonical huffman encoding. You have only to store the symbol lengths in a compact way. Hier an implementation: shcodec. Another way is using a semi-static huffman encoding (periodic rescale), then you have not to store any tree.

Solution:5

Instead of writing the code tree to the file, write how often each character was found, so the decompression program can generate the same tree.

Solution:6

The most naive solution would be to parse the compression tree in pre-order and write the 256 values in the header of your file.

Solution:7

Since every node in a huffman tree is either a branch with two children, or a leaf, you can use a single bit to represent each node unambiguously. For a leaf, follow immediately with the 8 bits for that node.

e.g. for this tree:

/\ /\ A B /\ C D

You could store 001[B]01[C]1[D]1[A]

(Turns out this is exactly what happens in the huffman.c example posted earlier, but not how it was described above).

Solution:8

it is better to send the frequencies of characters and build the tree at the receiving end. This data will be of constant size for a fixed alphabet. I guess this must be serializable and put in the file. Sending the tree depends on its implementation, for what I have tried, an array based approach leads to more memory left unused for the tree since, the tree may not be a balanced tree most of the time. If the tree was balanced then array representation would have generated the best option.