10/10/2008

10-10-08 : On the Art of Good Arithmetic Coder Use

On the Art of Good Arithmetic Coder Use

I think it's somewhat misleading to have these arithmetic coder libraries. People think they can take them and just "do arithmetic coding".
The reality is that the coder is only a small part of the arithmetic coding process, and it's the easiest part. The other parts are subtle
and somewhat of an art.

The other crucial parts are how you model your symbol, how you present symbols to the coder, how you break up the alphabet, and different
type of adaptation.

If you have a large alphabet, like 256 symbols or more, you don't want to just code with that alphabet. It has few problems. For one,
decoding is slow, because you need to do two divides, and then you have to binary search to find your symbol from a cumulative probability.
Aside from the speed issue, you're not compressing super well. The problem is you have to assign probabilities to a whole mess of symbols.
To get good information on all those symbols, you need to see a lot of characters to code. It's sort of like the DSP sampling problem -
to get information on a big range of frequencies in audio you need a very large sampling window, which makes your temporal coherence really
poor. In compression, many statistics change very rapidly, so you want quick adaptation, but if you try to do that on a large alphabet
you won't be gathering good information on all the symbols. Peter Fenwick has some good old papers on multi-speed adaptation by decomposition
for blocksorted values.

In compression most of our alphabets are highly skewed, or at least you can decompose them into a highly skewed portion and a flat portion.
Highly skewed alphabets can be coded very fast using the right approaches. First of all you generally want to sort them so that the
most probable symbol is 0, the next most probable is 1, etc. (you might not actually run a sort, you may just do this conceptually so that
you can address them in order from MPS to LPS). For example, blocksort output or wavelet coefficients already are sorted in this order.
Now you can do cumulative probability searches and updates much faster by always starting at 0 at summing towards 0, so that you rarely
walk very far into the tree.

You can also decompose your alphabet into symbols that more accurately represent its "nature" which will give you better adaptation. The
natural alphabet is the one that mimics the actual source that generated your code stream. Of course that's impossible to know, but you
can make good guesses. Consider an example :

The source generates symbols thusly :
It chooses Source1 or Source2 with probability P
Source1 codes {a,b,c,d} with fixed probabilities P1(a),P1(b),...
Source2 codes {e,f,g,h} with fixed probabilities P2(e),P2(f),...
The probability P of sources 1 and 2 changes over time with a semi-random walk
After each coding event, it either does P += 0.1 * (1 - P) or P -= 0.1 * P
If we tried to just code this with an alphabet {a,b,c,d,e,f,g,h} and track the adaptation
we would have a hell of a time and not do very well.
Instead if we decompose the alphabet and code a binary symbol {abcd,efgh} and then code
each half, we can easily do very well. The coding of the sub-alphabets {a,b,c,d} and
{e,f,g,h} can adapt very slowly and gather a lot of statistics to learn the probabilities
P1 and P2 very well. The coding of the binary symbol can adapt quickly and learn the
current state of the random decision probability P.

This may seem rather contrived, but in fact lots of real world sources look exactly like this. For example if you're trying to compress
HTML, it switches between basically looking like English text and looking like markup code. Each of those is a seperate set of symbols
and probabilities. The probabilites of the characters within each set are roughly constantly (not really, but they're relatively constant
compared to the abruptness of the switch), but where a switch is made is random and hard to predict so the probability of being in
one section or another needs to be learned very quickly and adapt very quickly.

We can see how different rates of adaptation can greatly improve compression.

Good decomposition also improves coding speed. The main way we get this is by judicious use of binary coding. Binary arithmetic coders
are much faster - especially in the decoder. A binary arithmetic decoder can do the decode, modeling, and symbol find all in about 30 clocks
and without any division. Compare that to a multisymbol decoder which is around 70 clocks just for the decode (two divides), and that doesn't
even include the modeling and symbol finding, which is like 200+ clocks.

Now, on general alphabets you could decompose your multisymbol alphabet into a series of binary arithmetic codes. The best possible way to
do this is with a Huffman tree! The Huffman tree tries to make each binary decision as close to 50/50 as possible. It gives you the minimum
total code length in binary symbols if you just wrote the Huffman codes, which means it gives you the minimum number of coding operations if you
use it to do binary arithmetic coding. That is, you're making a binary tree of coding choices for your alphabet but you're skewing your tree
so that you get to the more probable symbols with fewer branches down the tree.

(BTW using the Huffman tree like this is good for other applications. Say for example you're trying to make the best possible binary search
tree. Many people just use balanced trees, but that's only optimal if all the entries have equal probability, which is rarely the case. With
non-equal probabilities, the best possible binary search tree is the Huffman tree! Many people also use self-balancing binary trees with
some sort of cheesy heuristic like moving recent nodes near the head. In fact the best way to do self-balancing binary trees with non equal
probabilities is just an adaptive huffman tree, which has logN updates just like all the balanced trees and has the added bonus of actually
being the right thing to do; BTW to really get that right you need some information about the statistics of queries; eg. are they from a
constant-probability source, or is it a local source with very fast adaptation?).

Anyhoo, in practice you don't really ever want to do this Huffman thing. You sorted your alphabet and you usually know a lot about it so you
can choose a good way to code it. You're trying to decompose your alphabet into roughly equal probability coding operations, not because of
compression, but because that gives you the minimum number of coding operations.

A very common case is a log2 alphabet. You have symbols from 0 to N. 0 is most probable. The probabilities are close to geometric like {P,P^2,P^3,...}
A good way to code this is to write the log2 and then the remainder. The log2 symbol goes from 0 to log2(N) and contains most of the good
information about your symbol. The nice thing is the log2 is a very small alphabet, like if N is 256 the log2 only goes up to 9. That means coding it
is fast and you can adapt very quickly. The remainder for small log2's is also small and tracks quickly. The remainder at the end is a big alphabet,
but that's super rare so we don't care about it.

Most people now code LZ77 offsets and lengths using some kind of semi-log2 code. It's also a decent way to code wavelet or FFT amplitudes. As an
example, for LZ77 match lengths you might do a semi-log2 code with a binary arithmetic coder. The {3,4,5,6} is super important and has most of
the probability. So first code a binary symbol that's {3456} vs. {7+}. Now if it's 3456 send two more binary codes. If it's {7+} do a log2 code.

Another common case is the case that the 0 and 1 are super super probable and everything else is sort of irrelevant. This is common for example in
wavelets or DCT images at high compression levels where 90% of the values have been quantized down to 0 or 1. You can do custom things like code
run lengths of 0's, or code binary decisions first for {01},{2+} , but actually a decent way to generally handle any highly skewed alphabet is a unary
code. A unary code is the huffman code for a geometric distribution in the case of P = 50% , that is {1/2,1/4,1/8,1/16,...} ; we code
our symbols with a series of binary arithmetic codings of the unary representation. Note that this does not imply that we are assuming anything about
the actual probability distribution matching the unary distribution - the arithmetic coder will adapt and match whatever distribution - it's just that
we are optimal in terms of the minimum number of coding operations only when the probability distribution is equal to the unary distribution.

In practice I use four arithmetic coder models :

1. A binary model, I usually use my rung/ladder but you can use the fixed-at-pow2 fractional modeller too.

2. A small alphabet model for 0-20 symbols with skewed distribution. This sorts symbols from MPS to LPS and does linear searches and probability
accumulates. It's good for order-N adaptive context modeling, N > 0

3. A Fenwick Tree for large alphabets with adaptive statistics. The Fenwick Tree is a binary tree for cumulative probabilities with logN updates.
This is what I use for adaptive order-0 modeling, but really I try to avoid it as much as possible, because as I've said here, large alphabet
adaptive modeling just sucks.

4. A Deferred Summation semi-adaptive order-0 model. This is good for the semi-static parts of a decomposed alphabet, such as the remainder
portion of a log2 decomposition.

Something I haven't mentioned that's also very valuable is direct modeling of probability distributions. eg. if you know your probabilites are
Laplacian, you should just model the laplacian distribution directly, don't try to model each symbol's probability directly. The easiest way to do
this usually is to track the average value, and then use a formula to turn the average into probabilities. In some cases this can also make for
very good decoding, because you can make a formula to go from a decoded cumulative probabilty directly to a symbol.

ADDENDUM BTW : Ascii characters actually decompose really nicely. The top 3 bits is a "selector" and the bottom 5 bits is a "selection".
The probabilities of the bottom 5 bits need a lot of accuracy and change slowly, the probabilities of the top 3 bits change very quickly based
on what part of the file you're in. You can beat 8-bit order0 by doing a separated 3-bit then 5-bit. Of course this is how ascii was
intentionally designed :