9/27/2008

09-27-08

On LZ and ACB

LZ77 coders are hugely inefficient. Everybody knows the basic idea of LZ77. You compress data using back references to previous data. You
either write a match, consisting of an offset & length, or a literal, consisting of one raw character.

The inefficiencies are intuitively obvious. If the coding was perfect, then if you were the decoder, the encoded stream should be totally
random, you shouldn't be able to predict the code stream, but of course you can. One way is with context modeling. If you see a context
like "what the ", then it's highly likely the next sequence is "fuck" so offsets which point to that are far more likely. Similarly,
certain offsets point at certain words, and there are natural likely match lengths for those as well.

Someday soon I'm going to write about how I'm doing "Optimal Parsing". But the very fact that you can do optimal parsing
is a sign of the flaw of LZ77. The encoder has many choices of different ways to write the code stream, which means there
is massive redundancy in the code space. There are many different LZ77 code streams which all correspond to the same
input. The optimal parser tries to pick the best one of these, but it doesn't change the fact that all the ways to encode
that you did not choose are just wasted probability space in the coded bits.

Let's look at how you might reduce the redundancy in LZ77 if you for some reason had to do LZ coding and wanted to go as far
with it as you can :

First of all, whether or not a match will be found is highly predictable. Most high compression LZ coders do a lot of context modeling
for the match flag, because you can tell by looking at the local neighborhood whether it's an area that's likely to match. There are also
finite-state patterns that arise in many types of data (particularly binary data), where it may be very likely that the data goes
match-literal-match-literal , or mat-lit-lit-mat-lit-lit-mat , or whatever. Apparently LZMA does this.

Modeling the redundancy in the offset and the length are more interesting and difficult. The first thing we see is that they are not
really two separate parameters, they are highly correlated, but not in a trivial way. For example, say we simply find the longest possible
match and send the {offset,length} in the normal LZ77 way. There were very similar shorter matches that we passed up. When we send just
the {offset} , the decoder knows what the match string we want looks like, and it knows that we passed up all the other matches, so this
offset must be the longest one. That means it can mostly infer the match length. The decoder gets the offset, that specifies the match
string, then the decoder finds the other string in the window that has the longest prefix length with the match string. The actual match
must be at least the prefix length + 1. You can then either just output that match, or you could send the amount that the lengths exceeds
that.

Now, we're still sending the offsets raw, but of course the offset is highly predictable too. Given the current context (or whatever
type of model you want to use), you're much more likely to match certain offsets than others. You shouldn't just code the offset as
a backward index into the window, you should assign a probability to each. The "ROLZ" and "PMR" LZ coders of today do this to some
extent.

(aside : ROLZ = reduced offset LZ; basically means only code matches that share some amount of context, perhaps also MTF the offsets within
the contexts; you code many fewer matches, but that's a good thing because ROLZ coders generally use a strong context model for literals;
in fact if you use a powerful PPM for literals then the fewer matches you code the better your compression is. PMR = previous match
reference, just codes references to previous matches in fewer bits because they are more likely, encoder also favors choosing PMR's over
other offsets).

We can code an arbitrary offset with some modeling by reordering them. Make offset 0 be the most likely, up to offset N the least likely.
One way to do this is to order by context match length. Say the longest context match is L symbols, then all the offsets that match with
a context of length L are first, then all the offsets that match at length (L-1), down to length (0). Within each context, order so that
more recently seen & used offsets are lower. So to code an offset, you still find the longest match, then you see where that offset is
in the list ordered by likelihood, and output that index. (alternatively you could just context code the offset in the normal context way).

This is very closely related to LZP, Block Sorting, and Fenwick's Symbol Ranking stuff (
for example ).

We can do something similar in a slightly different way by sorting. Consider all previous possible match spots. Sort them by their context
string, backwards, like for PPM. Now to match your current string, find the longest match with your forward string. This is the offset you
want to send. To send it, find your current context in the sorted list - you'll find the longest match (actually you'll be between two
strings in the list). Send the difference between your position in the list and the position of the longest match. Hopefully this delta is
usually small, because the best match usually comes from a context similar to yours. The length of match is like before - the offset
implicility send the length, you send the additional length after the implicit.

Obviously our goal has been to make the offset and match length both very close to zero. I mentioned block sort and
symbol ranking before; the statistics of the codes we're generating here should look very much like those statistics.

ACB was never given much attention for a few reasons. 1) George's english is awful and he could never write a decent
paper on it, so it never got into academia, his terminology is all non-standard and very confusing,
2) it's way the hell off the optimal curve in terms of speed vs. compression. It's
slower than PPM and can't compress as well. It is however, interesting to me in a theoretical sense, because it's sort
of like as good as you can do with dictionary compression, and as we see it's extremely similar to LZP, block sorting,
and symbol ranking. In fact all of these coders wind up sucking in the same way, because by simply turning your
correlation knowledge into an index with low mean, you're giving away a lot of good local information that a plain context modeller
could do more with.