A Method for the Construction of Minimum-Redundancy Codes

This is a discussion on A Method for the Construction of Minimum-Redundancy Codes within the Contests Board forums, part of the Community Boards category; This, of course, was the title of David A. Huffman's now-famous term paper describing one of the most important compression ...

A Method for the Construction of Minimum-Redundancy Codes

This, of course, was the title of David A. Huffman's now-famous term paper describing one of the most important compression models ever. And even as some algorithms available today far exceed it's capabilities, it nonetheless remains a crucial achievement in the field of information theory that cannot be ignored. Many modern encoders even use Huffman encoding as a preprocessing step due to it's simplicity and general performance characteristics.

The challenge is to implement the algorithm in such a way that produces the smallest output file, and this will be the primary factor in deciding the winner. Bragging rights will also be rewarded to the fastest implementation, as well as the one requiring the least amount of overhead.

There's no prize for being the winner, unfortunately, EXCEPT that if the implementation submitted produces smaller output files* than the solution that I will be posting as a benchmark (which won't be an official entry), the winner WILL receive a $50 Gift Certificate.

Rules are as follows:

(1) The entry can be written in any language as long as:
- The source is provided, obviously
- An entry point to a C-style function with the following name and signature is provided:

The function should read data from in_file and write the output to out_file, compressing the data if mode_is_compress is non-zero, otherwise decompress.

(2) No non-constant global variable are allowed. All of the various data structures needed to process the data must be created and destroyed within the entry point. This restriction applies to external files as well.

(3) No assumptions should be made about the format of an uncompressed file. It may contain any byte value ranging from 0 to 255.

(4) No assumptions should be made about the size of an input file, except that it will some value less than 2^32.

(5) An implementation that crashes during the test will be disqualified.

(6) No third party libraries will be allowed. Standard libraries are fine unless it is determined that they result in some 'unfair' advantage. If in doubt, just ask.

(7) Submissions must be original. No plagerized works allowed.

(8) No entries employing non-Huffman compression schemes will be allowed.

(9) Entries shall be submitted via PM.

At this point, there isn't a formal deadline for submissions. I'm going to give it about a week or so to see who is going to participate, and at that point we can take an informal vote to determine what the deadline should be, based on everyone's schedules.

I've attached below the benchmark. It is, of course, compressed. Full source code will be provided at the end of the contest.

Good luck.

* Conditions/Restrictions:
- The implementation must not favor any particular data format (such as text, for example)
- The output file must consistently be more than four bytes smaller than the benchmark. That may sound like an odd requirement, but the reasoning is that the benchmark could have been designed to output four bytes less, but I opted to keep it for the fact that it improved error-detection significantly.

So, if we don't know that the in_file is text, or null-terminated, how do we know how big it is? And how are we supposed to know that out_file points to enough space to write the output? Or are those parameters supposed to be FILE * instead?

So, if we don't know that the in_file is text, or null-terminated, how do we know how big it is? And how are we supposed to know that out_file points to enough space to write the output? Or are those parameters supposed to be FILE * instead?

The input file can be queried with something like 'stat', or opening it and seeking the end, etc. The output file won't exist until your function creates it. The reason I didn't opt for using FILE* was just to be flexible, eg: in case you want to use std streams instead, for example.

EDIT: and just to clarify by "make no assumptions about the size of the input" I mean don't structure your implementation in such a way that requires a certain maximum input size

define non-huffman, if 99% of the compression takes place using non-huffman methods and then I use huffman to preprocess or post-process the data does that count as a huffman method, or are we limited to only those methods specifically discussed in his paper?

Until you can build a working general purpose reprogrammable computer out of basic components from radio shack, you are not fit to call yourself a programmer in my presence. This is cwhizard, signing off.

define non-huffman, if 99% of the compression takes place using non-huffman methods and then I use huffman to preprocess or post-process the data does that count as a huffman method, or are we limited to only those methods specifically discussed in his paper?

The output basically consists of two things: (1) the compression model and (2) the actual data that has been compressed. The former doesn't matter at all - you can use whatever format you want. The latter, however, must only use Huffman encoding on the symbols. That is to say, you can't use arithmetic coding to compress the data, obviously. And just to be clear, Huffman coding does not dictate what particular symbol gets mapped to which code - it's completely arbitrary. All that matters is that the code is unambiguous, and 'minimally redundant'.

The output basically consists of two things: (1) the compression model and (2) the actual data that has been compressed. The former doesn't matter at all - you can use whatever format you want. The latter, however, must only use Huffman encoding on the symbols. That is to say, you can't use arithmetic coding to compress the data, obviously. And just to be clear, Huffman coding does not dictate what particular symbol gets mapped to which code - it's completely arbitrary. All that matters is that the code is unambiguous, and 'minimally redundant'.

So we can use any method we want to preprocess the data as long as the final post process compression is done using Huffman?

Now we will actually need to know the nature of the input data because the preprocessor used depends on knowing what data is being compressed.

Until you can build a working general purpose reprogrammable computer out of basic components from radio shack, you are not fit to call yourself a programmer in my presence. This is cwhizard, signing off.

So we can use any method we want to preprocess the data as long as the final post process compression is done using Huffman?

Now we will actually need to know the nature of the input data because the preprocessor used depends on knowing what data is being compressed.

No. The input data must not be preprocessed in any way (unless by 'preprocess' you simply mean to analyze it, in which case that's perfectly fine). In other words, if you want to pass it through an LZW compressor for the sole purpose of gathering statistics, go for it - as long as the data is not compressed by anything other than the Huffman encoder by the time it ends up in the output file, rule #8 hasn't been violated.

EDIT: And just to be clear, this means you can't even apply something as simple as RLE on the data, as that is a form of compression. The two areas you should be concentrating on, then, are:

- Identifying the most efficient encoding of a given symbol. Many implementations don't address this correctly and yield larger than necessary codewords.
- Choosing an efficient format for the compression model/statistics. This is also an area that is often neglected.

Until you can build a working general purpose reprogrammable computer out of basic components from radio shack, you are not fit to call yourself a programmer in my presence. This is cwhizard, signing off.

So as long as its not compressed in the preprocessing stage, we can do anything we want, like convert it form one format to another?

Because if we are limited to only performing Huffman, I don't really see what the point of the contest is, because Huffman is a set algorithm.

Huffman encoding isn't exactly an algorithm, per se, though. It really just describes a method for setting up a direct mapping of variable-length to fixed-length codes, and in such a way that the variable-length codes can be detected unambigously. That's basically it. It doesn't even address the format to be used for storing the compression model and statistics (of course, in some cases they aren't even necessary, say, when the frequency of each symbol is known apriori (or at least a close approximation), but that's pretty rare). As such, there are many different types of data structures in use, many of them quite inefficient - I've even seen one that requires as much as 3 kilobytes to store the compression model. That doesn't help much in maintaining respectable compression ratios!

The second problem is that, strange as it may sound, many implementations don't actually select the shortest codeword for a given symbol. The reasoning is that by doing so you can guarantee that the longest symbol doesn't exceed some maximum number of bits, such as 8 (compared with of 255), which also allows you to make all sort of optimizations, as well. Nothing wrong with that approach, I guess, if you can get it to perform well, but from my experience this is not generally the case. I'd be interested to find out if this is just because it isn't being implemented correctly, just an accepted trade off, or some sort of flaw in the threory itself? Not sure.

So the whole point of this exercise is to explore the fundamental limits of the method, and see just how far it can be improved, really.

Huffman encoding isn't exactly an algorithm, per se, though. It really just describes a method for setting up a direct mapping of variable-length to fixed-length codes, and in such a way that the variable-length codes can be detected unambigously. That's basically it. It doesn't even address the format to be used for storing the compression model and statistics (of course, in some cases they aren't even necessary, say, when the frequency of each symbol is known apriori (or at least a close approximation), but that's pretty rare). As such, there are many different types of data structures in use, many of them quite inefficient - I've even seen one that requires as much as 3 kilobytes to store the compression model. That doesn't help much in maintaining respectable compression ratios!

The second problem is that, strange as it may sound, many implementations don't actually select the shortest codeword for a given symbol. The reasoning is that by doing so you can guarantee that the longest symbol doesn't exceed some maximum number of bits, such as 8 (compared with of 255), which also allows you to make all sort of optimizations, as well. Nothing wrong with that approach, I guess, if you can get it to perform well, but from my experience this is not generally the case. I'd be interested to find out if this is just because it isn't being implemented correctly, just an accepted trade off, or some sort of flaw in the threory itself? Not sure.

So the whole point of this exercise is to explore the fundamental limits of the method, and see just how far it can be improved, really.

Larger dictionaries often lead to BETTER compression, not he other way around.

On real systems, the symbol length is generally limited to the largest register size the processor can natively handle. On x86 systems thats 32 bits. It's a tradeoff, but one that sacrifices extremely little compression for very large performance gains. bitwise operations longer than 32 bits on a 32 bit machine take exponentially longer than those that are 32 bits or smaller. symbol lengths longer than 32 are rare and infrequent and generally represent symbols that themselves occur very few times in the original data. A worse case scenario a 33 bit symbol would represent data that occurred at a maximum 3.0303% of the time, while one that took 32 bits would be 3.125% so you would lose at most 0.0009469% compression. The point being that the real gains in compression are to be had in the preprocessing section, that is massaging the data to be more compressible in teh first place, this is how all modern compression works. To vastly over-simplify things - JPEG does t by reducing the color count, MPEG does it by reducing the sub-frame count, MP3 does it by reducing bandwidth, ZIP and RAR both do it by RLE, symbol translation, and many other methods.

Until you can build a working general purpose reprogrammable computer out of basic components from radio shack, you are not fit to call yourself a programmer in my presence. This is cwhizard, signing off.

That's really not how Huffman compression works, though. First of all, the encoded symbol of any particular byte being a number of possible sizes, ranging from 1 to 255 bits, means that things don't usually fall on some even boundary (eg: 32 bits, say). For example, let's say the code for a certain symbol is 100000000000001. Encoding 3 characters sends the 35-bit string 100000000000001100000000000001100000000000001 to the output stream. The next symbol might append 1 bit to that, another 5 bits. The only time padding is inserted is at the very end of the compression process to square things up on a byte boundary.

Next, there is no "messaging" of the data involved at any point. To do so would equate to some other form of compression, obviously, which defeats the purpose altogether. Think of it this way: your encoder should assume that whatever preprocessing is needed to reduce the input size has already been applied to the data. It's only job is to produce Huffman codes - nothing else. So it's really just a matter of delegation. Even modern compressors don't expect the Huffman encoder to preform RLE on the input, or any other form of compression - those things are done prior to sending it to to the Huffman phase, if necessary.

That's really not how Huffman compression works, though. First of all, the encoded symbol of any particular byte being a number of possible sizes, ranging from 1 to 255 bits, means that things don't usually fall on some even boundary (eg: 32 bits, say). For example, let's say the code for a certain symbol is 100000000000001. Encoding 3 characters sends the 35-bit string 100000000000001100000000000001100000000000001 to the output stream. The next symbol might append 1 bit to that, another 5 bits. The only time padding is inserted is at the very end of the compression process to square things up on a byte boundary.

Next, there is no "messaging" of the data involved at any point. To do so would equate to some other form of compression, obviously, which defeats the purpose altogether. Think of it this way: your encoder should assume that whatever preprocessing is needed to reduce the input size has already been applied to the data. It's only job is to produce Huffman codes - nothing else. So it's really just a matter of delegation. Even modern compressors don't expect the Huffman encoder to preform RLE on the input, or any other form of compression - those things are done prior to sending it to to the Huffman phase, if necessary.

I'm going to go ahead and pass. I know how compression works, I know how Huffman works. You are mistaken if you think that no compressors modify the data prior to the Huffman compression stage.

Until you can build a working general purpose reprogrammable computer out of basic components from radio shack, you are not fit to call yourself a programmer in my presence. This is cwhizard, signing off.