10/10/2008

I think it's somewhat misleading to have these arithmetic coder libraries. People think they can take them and just "do arithmetic coding".
The reality is that the coder is only a small part of the arithmetic coding process, and it's the easiest part. The other parts are subtle
and somewhat of an art.

The other crucial parts are how you model your symbol, how you present symbols to the coder, how you break up the alphabet, and different
type of adaptation.

If you have a large alphabet, like 256 symbols or more, you don't want to just code with that alphabet. It has few problems. For one,
decoding is slow, because you need to do two divides, and then you have to binary search to find your symbol from a cumulative probability.
Aside from the speed issue, you're not compressing super well. The problem is you have to assign probabilities to a whole mess of symbols.
To get good information on all those symbols, you need to see a lot of characters to code. It's sort of like the DSP sampling problem -
to get information on a big range of frequencies in audio you need a very large sampling window, which makes your temporal coherence really
poor. In compression, many statistics change very rapidly, so you want quick adaptation, but if you try to do that on a large alphabet
you won't be gathering good information on all the symbols. Peter Fenwick has some good old papers on multi-speed adaptation by decomposition
for blocksorted values.

In compression most of our alphabets are highly skewed, or at least you can decompose them into a highly skewed portion and a flat portion.
Highly skewed alphabets can be coded very fast using the right approaches. First of all you generally want to sort them so that the
most probable symbol is 0, the next most probable is 1, etc. (you might not actually run a sort, you may just do this conceptually so that
you can address them in order from MPS to LPS). For example, blocksort output or wavelet coefficients already are sorted in this order.
Now you can do cumulative probability searches and updates much faster by always starting at 0 at summing towards 0, so that you rarely
walk very far into the tree.

You can also decompose your alphabet into symbols that more accurately represent its "nature" which will give you better adaptation. The
natural alphabet is the one that mimics the actual source that generated your code stream. Of course that's impossible to know, but you
can make good guesses. Consider an example :

The source generates symbols thusly :
It chooses Source1 or Source2 with probability P
Source1 codes {a,b,c,d} with fixed probabilities P1(a),P1(b),...
Source2 codes {e,f,g,h} with fixed probabilities P2(e),P2(f),...
The probability P of sources 1 and 2 changes over time with a semi-random walk
After each coding event, it either does P += 0.1 * (1 - P) or P -= 0.1 * P
If we tried to just code this with an alphabet {a,b,c,d,e,f,g,h} and track the adaptation
we would have a hell of a time and not do very well.
Instead if we decompose the alphabet and code a binary symbol {abcd,efgh} and then code
each half, we can easily do very well. The coding of the sub-alphabets {a,b,c,d} and
{e,f,g,h} can adapt very slowly and gather a lot of statistics to learn the probabilities
P1 and P2 very well. The coding of the binary symbol can adapt quickly and learn the
current state of the random decision probability P.

This may seem rather contrived, but in fact lots of real world sources look exactly like this. For example if you're trying to compress
HTML, it switches between basically looking like English text and looking like markup code. Each of those is a seperate set of symbols
and probabilities. The probabilites of the characters within each set are roughly constantly (not really, but they're relatively constant
compared to the abruptness of the switch), but where a switch is made is random and hard to predict so the probability of being in
one section or another needs to be learned very quickly and adapt very quickly.

We can see how different rates of adaptation can greatly improve compression.

Good decomposition also improves coding speed. The main way we get this is by judicious use of binary coding. Binary arithmetic coders
are much faster - especially in the decoder. A binary arithmetic decoder can do the decode, modeling, and symbol find all in about 30 clocks
and without any division. Compare that to a multisymbol decoder which is around 70 clocks just for the decode (two divides), and that doesn't
even include the modeling and symbol finding, which is like 200+ clocks.

Now, on general alphabets you could decompose your multisymbol alphabet into a series of binary arithmetic codes. The best possible way to
do this is with a Huffman tree! The Huffman tree tries to make each binary decision as close to 50/50 as possible. It gives you the minimum
total code length in binary symbols if you just wrote the Huffman codes, which means it gives you the minimum number of coding operations if you
use it to do binary arithmetic coding. That is, you're making a binary tree of coding choices for your alphabet but you're skewing your tree
so that you get to the more probable symbols with fewer branches down the tree.

(BTW using the Huffman tree like this is good for other applications. Say for example you're trying to make the best possible binary search
tree. Many people just use balanced trees, but that's only optimal if all the entries have equal probability, which is rarely the case. With
non-equal probabilities, the best possible binary search tree is the Huffman tree! Many people also use self-balancing binary trees with
some sort of cheesy heuristic like moving recent nodes near the head. In fact the best way to do self-balancing binary trees with non equal
probabilities is just an adaptive huffman tree, which has logN updates just like all the balanced trees and has the added bonus of actually
being the right thing to do; BTW to really get that right you need some information about the statistics of queries; eg. are they from a
constant-probability source, or is it a local source with very fast adaptation?).

Anyhoo, in practice you don't really ever want to do this Huffman thing. You sorted your alphabet and you usually know a lot about it so you
can choose a good way to code it. You're trying to decompose your alphabet into roughly equal probability coding operations, not because of
compression, but because that gives you the minimum number of coding operations.

A very common case is a log2 alphabet. You have symbols from 0 to N. 0 is most probable. The probabilities are close to geometric like {P,P^2,P^3,...}
A good way to code this is to write the log2 and then the remainder. The log2 symbol goes from 0 to log2(N) and contains most of the good
information about your symbol. The nice thing is the log2 is a very small alphabet, like if N is 256 the log2 only goes up to 9. That means coding it
is fast and you can adapt very quickly. The remainder for small log2's is also small and tracks quickly. The remainder at the end is a big alphabet,
but that's super rare so we don't care about it.

Most people now code LZ77 offsets and lengths using some kind of semi-log2 code. It's also a decent way to code wavelet or FFT amplitudes. As an
example, for LZ77 match lengths you might do a semi-log2 code with a binary arithmetic coder. The {3,4,5,6} is super important and has most of
the probability. So first code a binary symbol that's {3456} vs. {7+}. Now if it's 3456 send two more binary codes. If it's {7+} do a log2 code.

Another common case is the case that the 0 and 1 are super super probable and everything else is sort of irrelevant. This is common for example in
wavelets or DCT images at high compression levels where 90% of the values have been quantized down to 0 or 1. You can do custom things like code
run lengths of 0's, or code binary decisions first for {01},{2+} , but actually a decent way to generally handle any highly skewed alphabet is a unary
code. A unary code is the huffman code for a geometric distribution in the case of P = 50% , that is {1/2,1/4,1/8,1/16,...} ; we code
our symbols with a series of binary arithmetic codings of the unary representation. Note that this does not imply that we are assuming anything about
the actual probability distribution matching the unary distribution - the arithmetic coder will adapt and match whatever distribution - it's just that
we are optimal in terms of the minimum number of coding operations only when the probability distribution is equal to the unary distribution.

In practice I use four arithmetic coder models :

1. A binary model, I usually use my rung/ladder but you can use the fixed-at-pow2 fractional modeller too.

2. A small alphabet model for 0-20 symbols with skewed distribution. This sorts symbols from MPS to LPS and does linear searches and probability
accumulates. It's good for order-N adaptive context modeling, N > 0

3. A Fenwick Tree for large alphabets with adaptive statistics. The Fenwick Tree is a binary tree for cumulative probabilities with logN updates.
This is what I use for adaptive order-0 modeling, but really I try to avoid it as much as possible, because as I've said here, large alphabet
adaptive modeling just sucks.

4. A Deferred Summation semi-adaptive order-0 model. This is good for the semi-static parts of a decomposed alphabet, such as the remainder
portion of a log2 decomposition.

Something I haven't mentioned that's also very valuable is direct modeling of probability distributions. eg. if you know your probabilites are
Laplacian, you should just model the laplacian distribution directly, don't try to model each symbol's probability directly. The easiest way to do
this usually is to track the average value, and then use a formula to turn the average into probabilities. In some cases this can also make for
very good decoding, because you can make a formula to go from a decoded cumulative probabilty directly to a symbol.

ADDENDUM BTW : Ascii characters actually decompose really nicely. The top 3 bits is a "selector" and the bottom 5 bits is a "selection".
The probabilities of the bottom 5 bits need a lot of accuracy and change slowly, the probabilities of the top 3 bits change very quickly based
on what part of the file you're in. You can beat 8-bit order0 by doing a separated 3-bit then 5-bit. Of course this is how ascii was
intentionally designed :

10-10-08

On LZ Optimal Parsing

So the LZ-Huffman I've done for RAD/Oodle does some new stuff with optimal parsing. I want to write up what I'm doing so I
remember it, and I'm also going to try to write about the semi-optimal parsing that a lot of people are doing now.

First let's review a bit. In LZ coding you have a lot of choices about how you code things. Greedily choosing the longest
match at each point dOpoes not produce the best code stream. There is a very good ancient paper by Storer and Szymanski that
introducing optimal parsing, and their method is both fast and 100% optimal, but it is only "optimal" in the sense of coding
the fewest number of literals possible. See
my old notes on Storer-Szymaski optimal parsing . Now, for the old LZSS coder
that used fixed bit coding (8 bit literals, 24 bit matches, or whatever), that optimality is also the same as code length optimality.
But in reality what we care about is code length optimality - we want to produce the shortest possible file - which is not
necessarilly the same as coding the fewest possible literals.

There are sort of two interesting domains of modern LZ coders, and optimal parsing matters a whole lot to both of them. One
domain is the super fast, super simple LZ coder like my LZH. Optimal parsing is important to LZH because the matches are so
important to compression, you really want to get the best matches you can. The other domain is something like LZMA or BALZ
(an ROLZ) that does a lot of context coding of literals. In those cases, the context coder for literals is so good that you
really want to only code good matches, and often the literals will be so cheap you want to send them as literals; also the cost
of literals and matches is highly variable for those coders.

Optimal parsing is a path search in a graph. For game developers you can think about A* as I talk about this. You start yourself
at the head of the file, and your goal is to walk to the end of the file in the shortest possible path. From each node in the
graph, your exit paths are the possible coding operations at that point in the file, eg. write a literal, write match X of length Y.
Each coding operation has a certain cost which is equal to the number of bits output if you do that coding operation, and then
you step along that link to the next point in the file as specified by the match length.

Note that with the original LZSS we only needed to consider either matching or not matching at each position in the file, since
any time you wrote a match the best thing to do was to write the maximum match length. With true code-optimal parsing, you have
to consider all possible match offsets and all possible match lengths at each location (!!). This is a huge number of coding
choices, but we can rule out almost all of them. (more on this later).

So, the brute force straightforward optimal parser should be obvious at this point. Start with an empty code stream and push position 0
on the stack. Repeat forever : pop the stack. Try all coding operations at the position specified in the stack. Push all of those
coding operations on the stack with the code stream they made, and the position of the end of the match. This works fine and
produces a 100% optimal parse, but is insanely slow. If you have N coding choices at each position it makes N^L operations for a
length L file.

With adaptive coding (eg. you're using an adaptive entropy coder for literals matches and offsets), you cannot get a 100% optimal
parse unless you do this full search, because it does not have the optimal subgraph property. That is if you have the optimal
path from A to C, and that path goes through B, that optimal path is not the same as taking he optimal path from A to B and the
optimal path from B to C. To speed things up we are going to start making approximations, and we are going to generally assume a
looser version of the optimal subgraph property. What this means in terms of LZ coding is that if you are at a certain position
in the file, the best way to code the whole file does not necessarily include the best way to code from {begin} to the current
position.

Let's revisit the original LZSS parse now. LZSS does no entropy coding, so it doesn't matter how you get to a given byte of the stream,
you will code the remainder in the same way. This is obviously a "Dynamic Programming" problem. You have shared sub-portions of the
problem that can be collapsed by storing their result in a table. The optimal path from point P to the end is the same regardless of
how you get to P. You could now proceed by doing the brute force approach and search depth-first, and each time you find a path, you
store all the sub-paths that you visited, and then you reuse them instead of recomputing them. Or you can do it a more elegant way which
is identical, which is to walk backwards through the file and at each position P store the optimal path from the P to the end. When you
find the optimal path at P obviously all the future subgraphs that you might need are already done because you're going backwards.
So, we now think of the LZSS optimal parse as a way of solving the graph search using dynamic programming.

The case of a static entropy coder is simpler so let me focus on that first. Offsets and lengths and literals are written in some # of
bits that is known in advance and we wish to make the shortest possible code stream. We can do a sort of LZSS backwards parse to
do this quickly. It's not as simple as LZSS because we cannot assume that the best match choice at each step is the longest one.

Optimal Parser for Static Statistics :

We're going to walk backwards like LZSS. At each position already considered (future positions in the file), we have already found the
optimal choice at that spot. We have various coding choices at the current position, either literal or various {offset,length} possibilities.
The total cost of each of those coices is : {Cost to code the current choice} + {Table cost at current pos + match length} , that is the
already computed cost to finish the walk after you step match length.

It should be obvious that we can solve this just by considering every coding operation possible at each position and walk backwards.
That gives us a total cost of O(N*L) , which is a lot better than O(N^L) , but it's still slow because N is very large. So let's start
making approximations.

What are all these coding choices that we have? We found various offsets that we matched at least min match length (usually 3). This
gives us a bunch of possible offsets and the maximum match length at each offset : {Off1,Len1} , {Off2,Len2}. In general we also have
to consider all lengths >= 3 at each offset.

The first approximation is the assumption that for any given match length, the lowest code cost comes from the smallest possible offset
that matches with that length. That is, we assume that if Off1 < Off2 , then cost(Off1) <= cost(Off2). This is certainly true for all
non-pathological files in the real world. This kind of assumption is also self-satisfying, that is by making this assumption, the optimal
parser will prefer shorter offsets, which makes them more likely, which makes their code length shorter. It's a positive feedback loop;
I'll talk more about feedback later.

With this approximation, we no longer need to consider all offsets and all match lengths. Now we only need to consider all match lengths,
and the offset for any length is chosen for us - it's the lowest offset that codes that length. Our code choices now look like :

Okay, so we could do this, but it's still too slow. We need to reduce our choices further.

One option is to only consider the longest possible length at each offset. That is, assume that if you code a length at a given offset, the best
choice is the longest length at that offset. So in our example you would only choose from {7 at Off1, 11 at Off2, 22 at Off3}.
That is sort of okay, but it actually gives up a lot of compression. It misses too many good
choices. Apparently there are some good coding paths that involve taking a shorter than maximal match.

To get to the really cool optimization, we need to stop and think about this intuitively. Why would you not write a maximal match? It should
almost always give us a shorter total code len, because writing a longer match is very cheap and lets us not code symbols after the match.
The cost of a given match is

Let's assume for now that the cost of the offset and the match length are seperable (they aren't used to predict each other). The cost of
each match length is an entropy code that generally codes shorter lengths in fewer bits. It's sub-linear in match length, but it is generally
increasing. However, it increases less than the average cost of future symbols. Note that for long match lengths, the cost of len and (len+1)
may be exactly equal depending on how you code your lengths. For example of you use a semi-log2 code, then the cost of many large match lengths
is identical. That means coding longer matches is free in terms of the current code cost.

The CostTable generally increases backwards linearly proportional to the average entropy. That is :

CostTable[ {end} ] = 0
CostTable[ end - 1 ] = cost to code form (end-1) to end
CostTable[ end - N ] = N * H
on average ; where H is the average entropy of one symbol

However, even though CostTable is linearly increasing backwards on average, it is not always. Consider for example what CostTable looks like
in the neighborhood of a very good very long match. Say at some position P this long match is available with length L. At position (P+1)
the best coding choice will be that same match with length (L-1), same at (P+2) etc. The cost table in this area will look like :

If L is large, like say 100 or more, the Cost(L) and the Cost(L-1) will be very very similar. In this region, CostTable[] is increasing as you
step backwards, but it's only increasing very slowly, far more slowly than the average slope of H. We see that CostTable goes backwards like N*H
in a line, but it deviates up and down across the line. Some areas you code worse than average entropy, some areas you code much better than average
entropy. The areas where you code better than average entropy are candidates for choosing a shorter than maximal match length!

What are we really doing as we step back through the file and choose the best match length? We have this constant table Cost(L) that tells us the
cost of coding each match length. This table is increasing, it's smallest at Cost(3) and largest at Cost(infinity). We find the maximum match length
at a given offset and the minimum at a given offset (the minimum is the smallest length that can't be coded from a lower offset). We take the Cost(L)
table in that range and add it onto CostTable() in that range, then pick the lowest value after the sum. Generally CostTable increases backwards
faster than Cost(L) increases forwards, but not always! As we saw in the example above with a long match, CostTable can increase backwards slower than
Cost(L) increases forwards.

So, how do we find this spots quickly? Well, when we are looking at coding at spot P, we have already visited all the CostTable spots > P. What
we'd like to know is, Is CostTable ahead of us increasing sublinearly? And if so, where is it MOST sublinear? That is, Where is CostTable[P] as far
below (end - P)*H ? That will tell us the most likely spot to code a less than maximal match.

We could build a binary search tree for these spots as we go backwards, which would be logN and all, but I used a simpler solution. At each spot P,
I store the best sublinear spot preceding that spot (preceding in file order, actually occuring later in the backwards walk). That is :

Normally when CostTable is increasing a lot as you step back, this does not get triggered at all. That is, CostTable[P+1] is usually a H smaller
than CostTable[P]. But in weird cases, the cost of coding this symbol was almost free, so BestSpotPreceding points back to us. BestSpotPreceding gets
updated in waves that terminate at cheap code spots. Normally you keep stepping back, then you hit a really cheap code spot and a wave percolates
forward, filling out BestSpotPreceding, then you step some more normally, then you hit a cheap code spot and it waves through again. Sort of carries in
an arithmetic coder. Anyway, this is technically still O(N*L), but it's a very cheap operation and it's way way less than N*L in practice.

So, how do we use this? Instead of considering only the maximum length at each offset ( {7 at Off1, 11 at Off2, 22 at Off3} in our previous example ),
we also consider matches at BestSpotPreceding[] in that offset. That is :

And those are all the matches we have to consider. I did this and found it hurt less than 0.001 bpb compared to the full optimal parse that considers all
possible coding operations at each position.

ADDENDUM #1 : I should mention there's another cool way to get this optimization if you don't want to do the BestSpotPreceding thing. We again want to
ramp up our intuition for the problem. The cases where we need to make a non-trivial decision about match lengths are cases when we are writing two
matches in a row, both greater than minimum length, and they overlap. That means there's a big range of ways we can code the same block. For example :

At pos P we find a match of length 7
At pos P+4 we find a match of length 10
The area from pos P to [P+14] can be coded with two matches, the ways are :
{4,10}
{5,9}
{6,8}
{7,7}

The area of overlap of the matches all codes to the same offset, it's the same offset that was found at P+4, but if you match past that at P, then
you just match less of it. The amount of overlap is the number of different coding choices we have.

We're coding the same two offsets and skipping the same number of symbols, so the only difference in code cost is the cost to code the match lengths.
The code costs of the choices are :

That is, code costs go up for larger lens, but they go up more slowly. It's like a sqrt(x) function - sharp at the beginning and then levels out.
That means the lowest total code cost is the one that's most skewed. You want to extend the longer match as much as possible because that's very cheap,
and you want to shorten the shorter match as much as possible because that saves you a lot. Intuitively you can think the difference between coding a
len of 3 and 4 is very big, but the difference between 511 and 512 is tiny tiny.

In our example that means the best choice is :
Cost(len 4) + Cost(len 10)

Note the similarity with the BestSpotPreceding heuristic. We try coding with our longest match len. Then we also try coding with the *shortest* match len
that still reaches the same match after we jump forward. We're trying the two most skew possibilities, our match as long as possible and our match as short
as possible, constrained to be getting the same match following. (matchoffset is already computed because we're going backwards).

The first one is that if we are doing a Huffman (or static arithmetic) coder to entropy code the length, offset & literal, then we can't actually know the
code length in advance. And the way we choose to do the coding will affect the statistics which will affect the code lengths, which affects how we choose
to do the coding, ad infinitum. It's a feedback loop.

As usual in these kind of optimization searches, we make the semi-static assumption. We make a good guess for the code lengths, find the optimal parse
for those code lengths, use that parse to guess new code lengths, optimal parse again, etc. Hopefully this smoothly converges. Well, it doesn't. The
problem is that the coding alphabet is actually pretty sparse. On medium size files there are not that many literals and there are plenty of offsets
and match lengths you might not code at all in one particular encoding choice. For example, in one pass you might not code any matches of length {31}
at all. Now when you do the next pass you need something reasonable in there. If you use the actual code lengths, you can strangely skew how you do
the coding, and it can actually ping pong a lot in the search and not settle down.

To improve this, I suggest smoothing the counts in the early passes. That is, the chance of a length 31 match should be somewhere between the chance of
a length 30 and length 32 match, regardless of the coincidence that we happened not to code with it in the last pass. To do this, I fit a laplacian
probability model to lengths and offsets, and in the early passes I blend that model in to smooth out the regularize the distribution. As I do more
passes I decrease the weight of that blended simple model and use the raw counts more.

In practice, you can avoid doing multiple passes of the actual optimal parser without hurting compression too much. You can do a first pass with just
a normal non optimal parser and build statistics from that. Smooth those statistics with a regularizing simple model. Then do the optimal parse with
those statistics, and you're done. Doing more passes will improve the coding decisions, though.

The other issue is that it's still a bit slow. Most of the slowness comes from the areas where very long matches are possible. If there's a long match
of length L at pos P, we also consider (L-1) at (P+1), and (L-2) at (P+2), etc., and for each of those we generally find many closer offsets to also
consider, roughly O(L) of them, which makes us O(L^2) over that region. On super degerate files like "pic" in the Calgary Corpus, this makes us very slow.
However, it's easy to fix. In this area with the long match, it really doesn't matter much what we do, because all of them are good choices. This is
a property that in areas of super high compression, the importance to the whole file is much less. You want to maximize your wins on the hard to compress
spots, because in terms of output bytes they are more important. So, in these big long match areas, we just cut out the whole area inside the long match,
and we always code the long match at the beginning. (actually I allow coding at P and P+1 up to P+K for a small K ; K is like 4 and L is like 100).

Optimal Parser for Adaptive Coders :

Now, with real modern coders you aren't doing static statistics, they're adapting as you go. You could just make a guess at the overall statistics and
use the semi-static optimal parser that I just described, and then code the stream with your adaptive coders. That's better than not optimal parsing at
all, but it's giving up a lot. Adaptive coders get a big win by changing their statistics as they scan through the file, and an optimal parser can make
much better decisions if it uses the actual local statistics. For example, in one part of the file, literals might code very well, and the optimal parser
should favor literals more, in another part of the file there might be lots of very short match lengths, and the optimal parser should code for that, etc.

One way to solve this is to use a different kind of semi-static approximation and optimal parse forward in blocks. For each block, you pretend your
coder are static and use some fixed code lengths. Now, run the semi-static optimal parser that I described above to find the optimal coding for the next
1000 bytes or so. Now step up through that coding and output codes with your adaptive entropy coder. Only output 900 bytes or so, don't go all the way
to the end of the block so that you avoid edge effects. Now hold use the current statistics to find the optimal parse for the next 1000 bytes.
This in fact works very well, and I believe was first used by RKive many years ago. He called it "dynamic programming" which is sort of right in the sense
that all backward optimal parses are a type of dynamic programming.

There's another form of adaptive optimal parse that a lot of people do now using forward walking. It's called "flexible parsing", and
I believe a form of this is used in LZMA (add: no, that's not LZMA).

The simplest form of "flexible parse" is just the old "lazy evaluation" of Zip. Lazy matching walks through the file and considers either writing {match}
or {literal + match at next pos}. It's a slightly non-greedy parser. Flexible parsing is basically the same thing, but we consider a few different choices.

To do this, we run through the whole file first and find the best match at each position. (you don't actually have to do this in practice, you could just
look ahead some amount N from the current position to avoid multiple passes). In any cases, the match search at each pos is only done once, and we have them
ready far enough in the future to do the flexible parse.

So, we consider a few coding choices at the current location. We choose the one that gives the lowest code cost to the end of the file. To make good decisions,
it needs to also consider the coding after you take the current operation. That is, it's like a recursive function :

now of course if we actually did this full recursion we would be doing the whole O(N^L) tree like we didn't want to do. But we don't need to do that because
we're only using this to make the decision at the current byte. That is, we can use the "short horizon" assumption - coding choices far in the future don't
matter that much to our current coding choice. It's like only considering a few moves ahead in a Chess program (which is a bad thing
to do there, but a good thing here). In fact, you only need to look 2 moves ahead here to make good coding choices.

Also, you don't need to consider very many N choices. N = 4 or 8 is plenty. Certainly consider coding a literal and coding the longest possible match, and
then heuristically pick some other likely good choices in between (such as the longest matches available with some shorter offsets, or the longest matches
possible from previous-match-references if you have those).

How do we terminate the recursion? After 2 steps (considering one op and then one more op after that) we just need an estimate of the cost of coding the rest
of the file. That's easy, it's just :

CodeCostEstimate( P ) = ( {end} - P ) * H

That is, just assume the remainder is coded at the entropy. The entropy you use here should be a conservative overestimate of the entropy; for example just
use the entropy of what you've done so far and add a fudge factor. This ensures that you actually prefer writing matches and don't do something strange if
you start the file in a very low entropy region.

This flexible parse sounds kind of expensive, but in fact with N = 4, it's only 16 table lookups to compute all the code lengths, then we pick the best one.
This of course is not a true optimal parse, but it's pretty damn good, and the fact that it's using the true current adaptive statistics
makes up for the fact that it's not doing the best possible parse.

ADDENDUM : I should note that this kind of forward parse is also better at making good decisions with PMR's (Previous Match References). PMR's are
cheap to code offsets, so they should always be considered in the optimal parse when they are possible matches. Going backwards you can't make
good decisions about trying to use PMR's because you haven't done the earlier parts of the file yet.

This has changed the parentheses of the multiplication, which lets us do math in 32 bits and also saves a divide. It's inaccuracte, we would
waste some range, so we check for being at the end of the range and put the wasted space on there :

Putting the waste on the last symbol is not great. To make the most of it, you should really reorganize your alphabet so that the most
probable symbol is the last one. Symtot is usually around (1<<14) which means the waste is on average (1<<13). Range is between (1<<24)
and (1<<32) so the fractional waste is between 2^-11 and 2^-19. This appears to come out to about a waste of 1 byte every 4096 on text with
the alphabet rearranged to put the space (0x20) at the end.

I should note that many many people have arithcoders that do this check to put the wasted range on the last symbol, but they *don't* do the work to make
sure the last symbol is the MPS (the most probable symbol). If you don't do that, it's rather pointless. With english text for example you might be putting the
extra range on the symbol "255" which basically never happens, so you are doing work for nothing.

Sean asked a good question that had me a little confused yesterday. Why do some of the renormalization updates shift in 1's to "high" and
some don't? You might see :

I thought maybe there was some subtle math reason why you would choose not to put in the 255. Nope. It's just another approximation + optimization.
Shifting in the 1 bits to range is the more accurate correct way to do it, but at that point range is already in 2^24 to 2^32, so adding on 255 doesn't
amount to much. In practice on my real file tests it doesn't even save 1 byte. I guess it should save something like 1 every 2^20. So, most people
just leave it out to save the 1 instruction.

10/07/2008

10-07-08

Frédéric Kayser sent me the link to this
FastAC page with a bunch of papers and source code by Amir Said.
It's pretty good stuff, I thought I'd give a synposis of what's in there :

The coder is almost a vanilla Michael Schindler style "range coder". There are a few little tweaks in there that differ from
the standard implementation.

1. Instead of a "virtual queue" for the bytes that might carry, he goes ahead and just writes them out, and then if a carry is
detected he goes back through the output bytes and increments them while they're 0xFF. This violates one of the rules of standard arithmetic coder
design, which is that you never touch output bytes after you stream them out, so that they can be transmitted or written to disk
or whatever. Now of course if you're just outputting to a big buffer in memory his way is totally fine, and it appears to be faster
than bothering with all the variables to account for the virtual queue.

2. He actually uses the full 32 bits for range instead of 31 bits like in the standard Schindler. We use 31 bits so that we can see
the carry in the top bit. Said has a neat little trick for detecting the carry :

(This maybe could be even faster if there were an elegant way to directly get the carry flag in C). Using the full 32 bits is nice because it
makes you byte aligned and eliminates some little ugly bit junk you have to do with shifting bytes 7 and 1 to get that last bit right.
This lets you avoid a LOT of shift and mask work which is a big win.

Now, as for the models, his "Adapative_Data_Model" is 100% exactly my Deferred Summation model from DCC96. You can read about
the original DefSum in my
papers section or get the ancient code in the
statistical coders section. From what I can tell there's nothing new or
interesting in the modeling, it's 100% just deferred summation to get you power of two CumProbMax so you can turn your division into
a shift, he uses my fast decode table, etc.

Anyway, the implementation is pretty good and the source code is pretty clean, so it's a decent place to start. However, I don't
recommend actually using it as is. The models are terrible and the API interface is very bad.

The reason why "Deferred Summation" never amounted to much is that arithmetic coding a big hunk of symbols with order-0 modeling is kind of pointless. You just never actually want to do that. If you're doing arithmetic coding, you're also doing higher order
modeling or LZ coding, or something, so that you don't just have static statistics. Providing API's that are locked to these
silly models is lame. The correct API interfaces for an arithmetic coder are the ones I have in "arithc" , just

The binary models are also silly. He does deferred summation for the bit model as well, which is not necessary. You can either do
the implicitly rescaling fixed-tot bit model that I just wrote about recently, or you can use a rung-ladder model like I have in crblib
which uses a table to find power of 2 probabilities from a state.

Said's codec is in fact very fast. It's almost 10% faster that any arithmetic coder I've ever seen. That's pretty damn good. I'm
going to modify it a bit to provide what I think are reasonable API's and post it.

Okay here it is :
cb_Arithmetic_Codec.zip ; a tiny .h with a modified version of FastAC.
This is pretty rough, I tried to modify the original FastAC code as little as possible.

Good article on Windows file caching . The Windows
file cache architecture is pretty sweet really. Files are cached in 256k blocks. The caching is also file "buffering".
That is, file IO in the application doesn't get any buffer just for that app - you get access to the cache pages, that is
your buffer. It means multiple apps on the same file share buffers. This is all pretty sweet. The caching is just a file
mapping of pages in the system address space. When you create a memory mapped file, you are just getting direct access to
those pages, which you can then lock to get pointers in your address space. You can lock & unlock the pages just like any
pages using VirtualAlloc and VirtualFree. When you write to pages, you can control them getting flushed out the write cache
by unlocking them.

Windows file caching is *extremely* greedy. It will basically throw away anything it has cached in order to cache the latest
request. This makes it look very good in benchmarks, but can be pretty bad in practice. For example, I was testing by
doing linear streaming IO across a 2 GB file. Windows caches every single byte of the 2 GB file so that future touches to it
are instant; in order to do that, it will evict the baby and the bathwater from the cache (including stuff like the cache of
the NTFS MFT table).

One thing I haven't been able to figure out is a way to manually control the prefetching that windows does. When you do a bunch
of ReadFile calls in a row, the cache will asynchronously prefetch ahead the next 256k chunk. This means that even if you just
call the synchronous ReadFile (or fread) to get bits of the file, the return is almost immediate. What I would like, however, is
a way to manually say "prefetch at this byte". The only way I can do that at the moment is to issue a read at that byte on a thread,
but that's not really ideal, it would be better to be able to signal the underlying cache directly.

Also it appears that if you use memory mapped files, windows stops doing the smart async prefetch ahead. Say you memory map a huge file,
then you run through streaming touching pages in order. Windows uses the virtual page fault from the memory manager to interrupt
and pull in pages from disk. It seems to not prefetch because you get a lot of faults and your app just stalls when you get to pages that
aren't in memory yet. Memory mapped files are pretty
awesome, it's cool that you just get access to the cache, but without the ability to request prefetches they're pretty worthless since the
plain old ReadFile beats them so badly.
(if you are in fact doing totally random file accesses on a huge file, then they are just about perfect).

Stdio in MSVC has a 4k buffer and then calls straight to ReadFile to fill it. The 4k buffer isn't really providing you any
"buffering" it just hides function call overhead and makes the reads page aligned; the real buffering is inside ReadFile.

If you're linking with the multithreaded libs, stdio becomes outrageously slow. Personally I like my threading model to work
such that individual objects, such as a "FILE" are not protected from concurrent use by the library. That is, once the API hands out an atomic
object to the app, it is up to the app if it wants to protect that object from multiple threads accessing it or not.
The library should only
do protection from access to shared/global data structures. (I know they can't do this with stdio because they want to make sure it's safe,
and they want to protect stdout/stdin/stdierr). Anyway, the result is that "getc" goes from being around 10 clocks
to being around 170 clocks. Fortunately, you can fix that pretty easily by just bringing back the macro yourself. Just
define this :

And use it to do your IO when you know the FILE is only being used by one thread.

Something I just discovered today : while Windows buffering is in fact very good for reads, it sucks for writes. If I write out a stream of
data through a buffered file, I only get around 25 MB/sec. If I write out chunks with async non-buffered calls, I get 100 MB/sec. (conversely,
buffered streaming input is actually faster than doing a bunch of async non-buffered reads).

10/06/2008

10-06-08

Followup on the "Russian Range Coder". So, first of all let me stress once again that this thing is pointless and you should not use it.
It's an "optimization" that gives up some compression and doesn't actually make you faster at all. You should never do approximations or
optimizations unless they actually help, even if they look like they should be faster. Best to stick with full quality everywhere lest it
come back to bite you.

But I was thinking about it and I realized you can improve the renorm loop. The original loop in the case of underflow just truncates the
range down to the bottom part to make the top bits always equal to the top bits of "low". But that can be really bad in some cases, since you might
be straddling a boundary, but most of your range could be above the straddle, not below. It's easy enough to just check whether you have
more range above or below the straddle :

That is, we either push the top bits of low up to the top bits of hi, or push the top bits of hi down to the
top bits of low, depending on which gives us more code space. Again, this doesn't hurt speed at all because
it's rare. In fact, that's why you should just use a proper virtual-queue range coder that handles underflow
right.

Just to make it super clear what's happening, here is an actual example :

Here "radArith" is a normal Michael Schindler range coder with virtual queue. My modified "russian range coder"
has about half the waste of the original (compared to the canonical range coder, not compared to entropy).
Note that the arithmetic coders here are all using a max cumulative frequency of 16384, while the "true entropy"
uses the exact character counts in the file.

BTW these are the links to the authors of the "Russian Range Coder" - they seem to all be dead. If anyone has live links
to these people let me know and I'll hook them up.

This small rearrangement of the math does in fact make a huge difference, but dude look,
it's literally just moving where you put the parentheses.

Of course it means you save a divide, and it means range can go up to 31 bits since you
never multiply by it. Getting range up to 31 bits means you can afford to shuffle out
whole bytes at a time. In CACM the multiplicataion has to fit in 32 bits so your range can only be
16 bits so outputting bytes is a no go; range stays between 15 and 16 bits all the time in
CACM while a "range coder" can go down to 23 bits; when it's down at 23 bits you only have 9
bits of coding precision because of the approximation of doing the divide before the multiply.

I'm sick of all these web pages that are like "LZMA uses a statistical algorithm called range coding".
No, it uses a fucking entropy coder called arithmetic coding.

Okay, this was just an angry rant, but it reminds me of something interesting. I recently stumbled on
some new arithmetic coders and they blew my mind for a minute until I figured out how they work.
(BTW For reference, the canonical "range coder" is by Michael
Schindler )

(BTW #2 it's sort of curious to me historically that I missed the range coder; I was so close to inventing it myself,
but somehow got right next to it and didn't see it. All the pieces of the range coder were thought of previously
by various people, including the original CACM guys, such as trying to use 32 bits, shifting out bytes, approximating
the division, and the virtual queue, but on their own each of those ideas doesn't quite work or make a big difference,
nobody put those ideas together and realized that together they make this "range coder" thing that's a big improvement).

The new arithmetic coders I've discovered are the fpaq0p binary range coder and "rc" the "russian range coder".
FPAQ0p is by Ilia Muraviev, who's done lots of good stuff; you can find it on
Matt Mahoney's PAQ page . The "russian range coder" used to have
a web page but it appears to have disappeared. I found the code in MRP
(Minimum Rate Predictors, a decent lossless image coder).

The "fpaq0p" arithmetic encoder is brutally simple. It's what you would write if you just
heard of arithmetic encoding; in fact it was my first idea for a coder ever, but it didn't work, so everybody
dropped it. Well, it's back. If you remember the super basic idea of an arithmetic coder is you're specifying
a code stream by writing an infinite precision number; at any time you know your desired number is in some
interval, low and high, you shrink the interval down to specify the symbol, and to do it in finite precision you shift
out top bits of low & high as they become the same. (See my ancient
write up of basic arithcoding )

Okay, good job, but this is badly broken. First of all, you're
overflowing your 32 bit code. Second of all, it doesn't handle carry
propagation, and it doesn't handle underflow. The problem with carries is
that range can step up to the next bit and push a bit over code size.
The problem with underflow is that we aren't handling the case of the range
getting very very small but while the top bytes are different.

In particular, when range_high comes down close to range_low from above and range_low comes up
close to range_high from below, but the top bytes are different,
eg, they're like 0x800000AA and 0x7FFFFFBB , the top bytes
are different so they don't get shifted out, but the range can get arbitrarily small.
(in particular the top byte of "low" is X, the top byte of "hi" is X+1, the second byte
of low is 0xFF and the second byte of hi is 0x00).

But fpaq0p works, so how does it do it?

The main trick is to only do binary coding. Then you just don't worry about underflow. With
the underflow problem, your range can get as small as only 1 , with 0x01000000 and 0x00FFFFFF, but you
implicitly add 1 to the top, so you still have enough range to code 2 symbols, which is what you need
for your binary alphabet to be decodable. Any alphabet larger than binary is not codable there.

The other minor trick is just about where you do your "+1". The high is always implicitly +1 above
what you actually store in range_high. We started it with 0xFFFFFFFF but we want an implicit
FFFFF... repeating to infinity after it. That's why we shift in FF, but we also have to do it right
in the range update in a way that doesn't blow the 32 bits or ever generate a carry.

Which looks like it's doing the same trivial thing, but is subtly different.

Note that in the horrible underflow case your coding precision can get as bad as only 1 bit (that is,
the probabilities are forced to 50/50 regardless of what they really are). Obviously that hurts
coding, but it's actually not that
bad, because it's statistically incredibly rare. In fact it happens something like 2^-24 of the time,
since all spots on the code range are equally likely. And even when it does happen you only lose a
few bits of coding efficiency.

but in practice it doesn't really matter because you're stalled out on branches anyway.

Also in a binary arithmetic coder you never actually divide by symtot. You use symtot = some fixed power of 2,
and you do a probability update that keeps tot the same. My code's got this old rung/ladder thing that does
different styles of adaptation with a power of 2 total. Or you can just do the super simple thing that everybody
seems to do these days :

Here "updshift" controls how fast your adaptation is vs. how precise you become in the long term. This update
makes no distinction between the initial rollup when you have no stats and the adapted state, which the rung/ladder
does. This is an approximation of a normal n0/n1 counter with an implicit max count of (1 << updshift).
Compare to an n0/n1 update :

BTW this approximation goes back to the ancient stuff by Howard & Vitter on fast arithmetic coding. Of course in practice you
also want to merge your modelling with arithmetic coding since they do the same random branch on bit.

Most people these days are good at designing their compressor to drive the binary arithmetic coder well. A well driven
binary arithmetic coder is generally coding probabilities close to 50%. That's better because it keeps precision
high and keeps you in the accurate part of the coarse probability update approximation. It's also better because
it means you're calling the coder many fewer times. If you only call the arithmetic coder with probablities
close to 50% it means you're only doing coding roughly as many times as the number of bits you output, which is low.
You accomplish this by coding symbols that are "natural" for the problem. For example, for LZ77 offsets you
code in the log2 of the offset (similarly with BWT MTF codes). etc. etc.

So I also found this range coder implementation called the "Russian Range Coder". It's extremely similar to
the standard Michael Schindler coder except the renorm & bit output loop is slightly different :

So what's going on here? The first part we can recognize is exactly the simple loop from fpaq0p that just outputs
the top byte when it's the same. But if you just do that you're fucked for multisymbol alphabets because of
underflow, so looky here, the second part handles that. It checks if the range is getting too small to resolve
symbols. But normally in the virtual queue / schindler coder that would mean setting up a carry queue and such.
This coder avoids that. It does it by just giving up a huge chunk of the range.

The bad case occurs when the range is small and the top bytes are off by 1. The second bytes are 0x00 and 0xFF.
With a virtual queue, you would put the top byte of "low" in the virtual queue spot, and count the number of 0xFF
bytes but don't write them yet. If you get a carry, you propagate the carry by writing the top of low+1, then
implicitly flipping the 0xFF's
to 0x00's and writing them. This is the virtual Q / schindler way.

Here we don't check for underflow and do the carry and such, we just smash down the "high" to the "low". We know
the top bytes are not equal, so the next bytes must be 00 and FF, eg. we have something like 0x0B00XXXX and 0x0AFFXXXX , we just
smash the high down to 0x0AFFFFFF. This bit :

which I think makes it clearer what's happening. You just chop down to make your top bytes the same, eliminating the bad
straddle situation which caused underflow. Now you can renormalize the simple way by shifting out equal top bytes and get
a big enough range back.

Now obviously this is an approximation that loses some efficiency, you're throwing away on average half your range in the cases
that you have this bad underflow, but again it's pretty rare. I don't know the exact probabilities, but the measured
efficiency is very close to the full coder.

In practice, however, this coder is not any faster than the full correct schindler coder, so there's no point to using
it. It is however, very simple and elegant once you see how it works.

ps. the other new thing is the Jarek Duda lattice coder thing which I have yet to understand. It seems to offer no advantages and
many disadvantages, but I would like to grok it for my own brain's happiness.

10/01/2008

So I looked into LZMA / 7-Zip a bit. So far as I know there's no real description of the algorithm available,
but there is source code here . This is from just browsing the source
code for a bit, so take it with a bit of skepticism.

LZMA is an arithmetic coded LZ + context coder. It has a couple of clever ideas that make it perform very well.
In fact it compresses better than Quantum and is also faster (at compression - Quantum is faster at decompression,
in fact the only disadvantage of LZMA is that its decomp is not super fast).

The match finder is very fast. I can't figure out what it's doing exactly. It makes strong hashes using the CRC.
The docs claim it then goes to a patricia trie, but my god I can't see that in the code.

It does lots of binary arithcoding. In fact I don't think it has a multisymbol arithcoder at all, it just breaks
larger alphabets into a series of binary codes. He does a pretty good job of breaking up the binary tree so that
the common cases of short lengths and low offsets get written in very few binary coding ops (that is, the coding
choices of 0 or 1 are close to equal probability).

There are some clever models. In general it's well tweaked to use few contexts but capture the salient information
in those contexts. For example, the order-1 literal model only uses the top N bits of the previous char as context, since
for most data the top bits are the good predictors and the bottom bits are semi-random.
One clever model is using the bottom 2 bits of position as part of the literal context. This is
great for capturing the very common pattern on binary data of a 0 every fourth byte, or any every-2 or every-4 bytes
kind of patterns (such as when you write a bunch of 32 bit floats and the exponent part is the same on all of them).

It also does a semi-optimal parse. From what I can tell it looks like it finds the match possibilities N steps ahead,
and then does a kind of semi-backward cost evaluation from the end of that sequence to pick the best choice at the
current coding spot. I believe N is only 4, so this is sort of like the old "lazy matching" but it looks ahead 4, and rather
than a rough heuristic like most people use in lazy matching he actually evaluates a coding cost like a true
optimal parse.

On text it's very good, but on binary it really kicks ass, I'm guessing the biggest win is because of the pos-conditioned contexts which are
really good for binary.