8/12/2010

08-12-10 - The Lost Huffman Paper

"On the Implementation of Minimum Redundancy Prefix Codes"
by Moffat and Turpin , 1997 ,
is a brilliant work that outlines very clearly the best ways to write a huffman decoder. It's a completely modern, practical work, and
basically nobody has added anything beyond this technique in the last 13 years.

However, the importance of it was missed when it came out.
For many years afterwards people continued to publish "improvements" to Huffman decoding
(such as Sub-linear Decoding of Huffman Codes Almost In-Place (1998) )
which are just pure useless shit (I don't mean to single out that paper, there were probably hundreds of shitty foolish pointless papers on
"efficient huffman" written after Moffat/Turpin).

Most people in the implementation community also missed this paper (eg. zlib, JPEG, etc. people who make important use of huffman decodes have
missed these techniques).

I missed it too. Recently we did a lot of work on Huffman decoding at RAD, and after trying many techniques and lots of brainstorming, we
came up with what we thought was a brilliant idea :

Store the code in your variable bit input word left-justified in a register. The Huffman codes are numerically arranged such that for codes
of any given length, leaves (symbols) are lower values than nodes (branches). Then, the code for the first branch of each codelen can be left-justified in a
word, and your entire Huffman decode consists of :

(this returns a symbol in "canonical order" - that is most probable is 0 ; if your symbols are not
in order from most to least probable, you need an additional table lookup to reorder them).

This is really incredibly fucking hot. Of course it's obvious that it can be improved in various ways - you can use a fast table to skip
the first few steps, you can use a nextCodeLen table to skip blanks in the codeLen sequence, and you can use a binary search instead of linear
search. For known-at-compile-time huffman trees you could even optimize the binary search for the probability distribution of the codes and
generate machine code for the decoder directly.

All of those ideas are in the Moffat+Turpin paper.

ADDENDUM : this post happened to get linked up, so I thought I'd flesh it out and fill in some of the blanks I'm implying above, since
I'm sure you blog readers aren't actually going and reading the Moffat/Turpin paper like you should.

In all cases a practical Huffman decoder should use an initial table lookup to accelerate the first N bit step. (N usually 8-12 depending on
application). The reality is that what you do after that is not super important because it is rare (the majority of Huffman codes are short).
Because of this, there are advantages to using a right-justified "upside down" huffman code word, with the MSB at the bottom (I believe Zip does this)
because it means you can get the first N bits by doing just an AND with a constant (eg. get the "top" 8 bits by doing &0xFF).

There are two key efficiency issues for Huffman decoder implementation : 1. Reducing branches, and reducing the dependency-chain that leads to branches.
That is, doing math is not bad, but doing math to decide to branch is bad. 2. Avoiding variable shifts, because variable shifts are catastrophically
slow on some important platforms.

Finally, let's look at a real implementation of the Moffat/Turpin Huffman decoder. The variable bit input word is stored in a word left-justified with top
bits at the left. The first branch code at each code length is also left-aligned.

this does a branch on each code length, but avoids variable shifts. In some extreme cases (platforms with very slow variable shift,
and huffman trees with very low minimum code lengths), this unrolled branching version can even be faster than the peek table, in
which case you would simply set TABLE_N_BITS to zero.

In some cases, certain code lengths might not occur at all, and
you can avoid checking them by having an additional table of which codelengths actually occur (in practice this rarely helps).
This would look like :

where the 11 is not a codelen but the 11th code len which actually occurs, and you have the extra codeLenTable[] lookup.

Obviously you could just unroll directly starting at codelen=1 , and obviously this is also just a search. You are just trying to find
where bits lies in the sorted huff_branchCodeLeftAligned table. So instead of just a linear search you could binary search. However
note that lower code lens are much more likely, so you don't want to just binary search at the beginning. And the binary search makes
the branches much less predictable, so it's not always a win. However, as Moffat/Turpin describes, in some cases, for example if you
have a hard-coded Huffman tree, the huff_branchCodeLeftAligned can be constants and you can optimize the binary tree branch structure,
so you can do codegen to make an optimal decoder, like :

There's one final issue. Because the bit word is left aligned in the machine word, we can't make any branchCode value for "all bits are branches".
In particular, with this compare :

if ( bits < huff_branchCodeLeftAligned[11] )
{

when bits is all 1's (0xFFF...) we can't make a branchCodeLeftAligned that returns true. There are a few solutions for this, one is to use
<= branchCodeMinusOne , but then you have to make sure that you start with codeLen >= minCodeLen , because below that branchCode is zero and
you have a problem. Another solution is to make sure bits is never full; that is, if you have a 32 bit word, then only allow 31 bits (or less)
in your variable bit register. The final solution is the one I prefer :

The actual case of bits = all 1's in a 32 bit register should only occur 1 in 4 billion times, so we don't have to handle it fast, we just have
to handle it correctly. So I suggest you do the unrolled checks for decode, and then after you have checked all codelens up to maximum allowed
(24 or 32 or whatever your max codelen is), if bits was ~0 , it will have not decoded, so you can do :

Finally note that in most cases it is slightly more efficient to use a "semi-huffman" code rather than a true huffman code.
The semi-huffman code I propose is huffman up to codelen = 16 or so, and then simply flat after that. In most cases this affects
compression ratio microscopically (because the long code lengths are very rare) but can reduce complexity a lot. How does it
reduce complexity?

1. You don't have to do the proper length-limitted huffman tree construction. Instead you just build a normal unlimitted huffman
code length set, and then anything with code length >= 16 is flagged as part of the semi-huffman set.

2. When you transmit your code lengths, you don't have to send the lengths > 16, you just send lengths in [0,16] (0 means doesn't occur).

3. Your unrolled decoder only has to go up to 16 branches (if you table-accelerate, you do 8 bits by table then 8 more branches).

4. Then in the final case instead of just handling the ~0 case you handle all the "long" symbols with a flat code.

It looks pretty much like LZMA but with the arithmetic for match lens and literals replaced with quasi-adaptive-huffman , yes?

The code is actually readable for one thing which is nice.

BTW it would be nice if you write a little about it. A few things in particular have always befuddled me in LZMA :

1. Where does the state transition table come from exactly? Do you understand it intuitively?

2. How does his optimal parse work exactly? It looks like you are doing a full forward Dijkstra on some # of bytes forward (up to 1024), which is pretty damn slow. LZMA's is much faster than that so it must be something else.

Hi Charles,I copied the state transition table straight from Igor's decoder (lzmaDec.c). I haven't spent any time trying to really understand it yet.

His coding style makes my head hurt. I've tried several times to comprehend wtf is really going on in there.

Yea, the forward Dijkstra search is very expensive, but without it lzham's ratio isn't competitive. My flexible parsing experiments where mostly failures (in a comp. ratio sense). I'm still looking for a smarter way to do this.

The latest development version of lzham accelerates parsing by breaking up the lookahead buffer into (up to) four 3KB blocks and assigns each block to a worker thread to be independently parsed starting from the same state. After all blocks are parsed the main thread takes the lz decisions from each block and codes them using the real state. I added a special code that is issued after each block that tells the decompressor to reset its cur_state and match history.

I also moved a lot of redundant get_cost() calculations out of the innermost matchlen truncation loop, and simplified/tweaked a bunch of stuff.

The matchlen truncation loop is probably the next area I'm going to explore. Exhaustively trying all possible truncations doesn't seem worth the effort.

With these changes the new compressor is 3-4x faster on a core i7, and is now faster than 7za -mx9. But I'm still using way more overall CPU than LZMA.

"The matchlen truncation loop is probably the next area I'm going to explore. Exhaustively trying all possible truncations doesn't seem worth the effort."

I have a good solution to this for the backwards optimal parse, but it doesn't work with forwards parse. It's "LowestCostPrecedingPos"( http://cbloomrants.blogspot.com/2008/10/10-10-08-2.html ). I think a similar idea might work for forwards actually, but it requires a bit more book-keeping.