8/11/2010

08-11-10 - Huffman - Arithmetic Equivalence

Huffman

to

Arithmetic

coder transformation.

This is something well known by "practictioners of the art" but I've never seen it displayed explicitly, so here we go.
We're talking about arbitrary-alphabet decoding here obviously, not binary, and static probability models mostly.

Let's start with our Huffman decoder. (a bit of review
here or
here or
here ). For simplicity and symmetry, we will use a Huffman decoder that can handle code lengths up to 16,
and we will use a table-accelerated decoder. The decoder looks like this :

this is very standard (more normally the huffTable would only accelerate the first 8-12 bits of decode, and you would then fall back to
some other method for codes longer than that). Let's expand out what Peek and Consume do exactly. For symmetry to the arithcoder I'm going to keep my bit buffer
right-aligned in a big-endian word.

it should be obvious what these do; _Peek grabs the top 16 bits of code for you to snoop. Consume removes the top "bits" from code, and
then streams in bytes to refill the bits while we are under count. (to repeat again, this is not how you should actually implement bit
streaming, it's slower than necessary).

Okay, now let's look at an Arithmetic decoder.
(a bit of review here
or here and
here ). First lets start with the totally generic case. Arithmetic Decoding consists of getting the
probability target, finding what symbol that corresponds to, then removing that symbol's probability range from the stream. This is :

Okay it's not actually obvious that this is a correct arithmetic decoder (the details are quite subtle) but it is; and in fact this is
just about the fastest arithmetic decoder in the world (the only thing you would do differently in real code is share the divide by cumulativeProbabilityTotal
so it's only done once).

Now, the problem of taking the Peek target and finding what symbol that specifies is actually the slowest part, there are various solutions,
Fenwick trees, Deferred Summation, etc. For now we are talking about *static* coding, so we will use a table lookup.

To decode with a table we need a table from [0,cumulativeProbabilityTotal] which can map a probability target into a symbol. So when we get a value
from _Peek we look it up in a table to get the symbol, cumulativeProbabilityLow, and probability.

To speed things up, we can use cumulativeProbabilityTotal = a power of two to turn the divide into a shift. We choose cumulativeProbabilityTotal = 2^16.
(the longest code we can write with our arithmetic coder then has code length -log2(1/cumulativeProbabilityTotal) = 16 bits).

Okay, not bad, and we still allow arbitrarily probabilities within the [0,cumulativeProbabilityTotal] , so this is more general than the Huffman decoder.
But we still have a divide which is very slow. So if we want to get rid of that, we have to constrain a bit more :

Make each symbol probability a power of 2, so (cumProbHigh - cumProbLow) is always a power of 2 (< cumulativeProbabilityTotal). We will then store the log2
of that probability range. Let's do that explicitly :

we only need a tiny bit more now. First observe that an arithmetic symbol of log2Probability is written in (16 - log2Probability) bits, so lets call
that "codeLen". And we'll rename AC_log2range to AC_bitlen :

What's happening here - peek is 16 bits long, it's a window in the next 16 bits of "bits_code". First
we make "code" which is the top "codeLen" of "peek". "code" is our actual Huffman code for this symbol.
Then we know the top bits of bits_code are equal to code, so to turn them off, rather than masking we can
subtract. The equivalent cumProbLow is code<
The arithmetic coder had to look up cumProbLow in a table because it is still actually a bit more
general than the Huffman decoder. In particular our arithmetic decoder can still handle probabilities
like [1,2,4,1] (with cumProbTot = 8). Because of that the cumProbLows don't hit the nice bit boundaries.
If you require that your arithmetic probabilities are always sorted [1,1,2,4], then since they are power of
two and sum to a power of two, each partial power of two must be present, so the cumProbLows must all hit
bit boundaries like the huffman codes, and the equivalence is complete.

So, you should now see clearly that a Huffman and Arithmetic coder are not completely different things. They are a
continuum on the same scale. If you start with a fully general Arithmetic coder it is flexible, but slow. You then
constrain it in various ways step by step, it gets faster and less general, and eventually you get to a Huffman coder.
But those are not the only coders in the continuum, you also have things like "Arithmetic coder with fixed power of
two probability total but non-power-of-2 symbol probabilities" which is somewhere in between in space and speed.

BTW not directly on topic, but I found this in my email and figure it should be in public :

Well, Adaptive Huffman is awful, nobody does it. So you have a few options :
Static Huffman -
very fast
code lengths must be transmitted
can use table-based decode
Arithmetic with static probabilities scaled with total = a power of 2
very fast
can use table-based decode
must transmit probabilities
decode must do a divide
Arithmetic semi-adaptive
"Deferred Summation"
doesn't transmit probabilites
Arithmetic fully adaptive
must use Fenwick tree or something like that
much slower, coder time no longer dominates
(symbol search in tree time dominates)
Arithmetic by binary decomposition
can use fast binary arithmetic coder
speed depends on how many binary events it takes to code symbols on average
It just depends on your situation so much. With somehting like image or
audio coding you want to do special-cased things like turn amplitudes
into log2+remainder, use a binary coder for the zero, perhaps do
zero-run coding, etc. stuff to avoid doing the fully general case of a
multisymbol large alphabet coder.

9 comments:

This is a great illustration. It even suggests a template-parameterized implementation for table-based decoders. An obvious-but-interesting thing to me: this continuum extends to coding simpler than table-based Huffman.

I wonder if it would be faster to do static huffman by compiling the tables directly into code. Maybe it is a lose because your branch targets are unpredictable, but maybe it is a win because you can more easily do things like decode multiple symbols, use immediate shifts, and have shorter load dependency chains. A fast implementation would likely look similar to a really simple indirect-threaded bytecode interpreter.

So this would be a compiled-table approach, but now what could you do if your bits->symbols mapping had structure that was easier to transpose into code? Specifically, let's get rid of the switch dispatch. I believe this leads you to with Rice/Golomb/Gamma/etc. codes.

"I wonder if it would be faster to do static huffman by compiling the tables directly into code. Maybe it is a lose because your branch targets are unpredictable, but maybe it is a win because you can more easily do things like decode multiple symbols, use immediate shifts, and have shorter load dependency chains. A fast implementation would likely look similar to a really simple indirect-threaded bytecode interpreter."

Yeah we talked about doing this. If you had to actually interpret commands it would be a lose, but if you could compile directly into machine code and insert it in your decoder, it would be a win.

"So this would be a compiled-table approach, but now what could you do if your bits->symbols mapping had structure that was easier to transpose into code? Specifically, let's get rid of the switch dispatch. I believe this leads you to with Rice/Golomb/Gamma/etc. codes. "

Ah yeah, that's an interesting idea. The most compelling case would come if your encoder was a bit flexible, eg. you had some space/speed tradeoff parameter and it could choose to build tables that were not quite optimal and had more directly encodable structure.

For example, one case that's very common is to have a fast decode table for the first 8-12 bits of the huffman code. Then, for the remainder of the symbols instead of decoding the true huffman code, you could just use a variable-length code of some sort, with very little loss because those symbols are rare anyway. But then speeding up that branch is also not very important. However one nice win of that approach is that sending the code lens can be done smaller and simpler because you don't send any of the code lens > fastLen. Hmmm...

by the way, since it's not clear, that is not me bandwagonning on that meme, I actually hate that meme.

it is actually me making fun of myself, because my first reaction to this post HONESTLY was "holy crap, it's in color, Charles is using color, unbelievable". And when I went to post that I realized that perhaps I was overdoing it a bit, and should make fun of myself for it.

"Yeah we talked about doing this. If you had to actually interpret commands it would be a lose, but if you could compile directly into machine code and insert it in your decoder, it would be a win."

It would be fun to JIT your Huffman decoder. I imagine a simple table-based codegen (which I'm sure you have lying around) would be easy to throw together.

A few random questions about table-based decoders.

What is the size/speed tradeoff of looking up symbol length after decoding the symbol? If you looked up both together, you might get some more cache misses, but your load dependency chain shrinks. I suppose you mentioned that on in-order PPC there's some advantage to having 8-bit arrays, but you're going to be eating a 3-5 cycles even if your L1 hits.

It seems like it might be advantageous to be able to decode many small symbols simultaneously. If Huffman is able to compress 8-bit bytes to 4 bits on average, you expect to be able to 2-3 bytes from 8-12 bits. Have you ever tried this? Dealing with these special cases might not be easy in table-based implementations, but might be more feasible with codegen.

"However one nice win of that approach is that sending the code lens can be done smaller and simpler because you don't send any of the code lens greater than fastLen. Hmmm..."

You could make it so that rare symbols are transmitted as literals with a common prefix. Then you send the rare symbols + code lens, and the code lens build a huffman tree that excludes the rare symbols, but includes the rare symbol marker. This was just the naive and obvious thing I thought of just now, so I'm sure you could do much better.

Whenever I see "code lens" I think that there is some deep philosophical implication. Like: "through the lens of the algorithm" instead of "I am too lazy to type gth."

"It would be fun to JIT your Huffman decoder. I imagine a simple table-based codegen (which I'm sure you have lying around) would be easy to throw together."

Yeah there are lots of fun things you could do with dynamic code. Unfortunately, it's forbidden on most consoles for security reasons.

"What is the size/speed tradeoff of looking up symbol length after decoding the symbol? If you looked up both together, you might get some more cache misses"

Yes, the exact details of this are platform dependent. We have tried looking them up together, looking them up in two arrays both indexed by peek, looking up sym then looking up codelen from sym, etc. The code posted is not what we ship with. There are platform quirks but generally the biggest issue is the dependency chain and when you need which values.

"It seems like it might be advantageous to be able to decode many small symbols simultaneously."

Yeah, this is in the old Huffman2 that's in crblib. It's not very useful in general because you generally have a mixed stream, like huffman-raw bits-huffman.

"You could make it so that rare symbols are transmitted as literals with a common prefix."

Indeed. There's a whole spectrum to be explored where mutating the coding a bit to be not perfectly Huffman could reduce memory use and improve speed, and then you could play with the space/speed tradeoff.

"Yeah, this is in the old Huffman2 that's in crblib. It's not very useful in general because you generally have a mixed stream, like huffman-raw bits-huffman."Depends on the anatomy of your code stream (and your decoder). If the raw bits are part of the same (logical) symbols, it's definitely a win to include them in the table.

Two examples: First one is LZ offsets. You typically have a couple different symbols to encode the magnitude of the offset followed by a couple of raw bits to specify it fully. If the full code (prefix + extra bits) is shorter than your full table size, you can just decode the whole thing at once. Basically, instead of doing

Second example: JPEG-style value coding. Similar to the above, you have a huffman tree encoding run length and log2(abs(value)) followed by extra bits to specify sign and "mantissa". You can put decoding for sign+mantissa into the table for small values (which typically have short codes anyway). Now you have 3 paths: fast path (decode huff bits + trailing bits in one table lookup), normal path (decode huff bits in one table lookup) and slow path (multiple table lookups or one table lookup + linear scan afterwards). For something JPEG-ish, it's not unusual for 85% or more of all codes to go through the fast path even with moderate table sizes (did this with 9 bits index and 16 bits/table entry, so a 1k table total... pretty good deal).

So this definitely works if you can fold multi-step decoding of a single logical symbol into one table lookup. It's useless for anything that actually controls execution flow in the decoder, though. Reading two symbols that influence control flow separately adds N+M states your decoder can be in. Parsing them together introduces (N x M) states - usually a bad idea, unless a significant fraction is trivial or similar and can be unified.