9/16/2010

09-16-10 - Modern Arithmetic Coding from 1987

I was just reading through my old paper "New Techniques in Context Modeling and Arithmetic Encoding"
because I wanted to read what I'd written about PPMCB (since I'm playing with similar things now and PPMCB is very similar to the way modern context
mixers do this - lossy hashing and all that). (thank god I wrote some things down or I would have lost everything; I only wish I'd written more,
so many little heuristics and experimental knowledge has been lost).

Anyway, in there I found this little nugget that I had completely forggoten about. I wrote :

Step 3 uses a method introduced by F. Rubin [3], and popularized by G.V. Cormack�s DMC
implementation [1]. It is:
while ( R < 256 )
{
output(L>>16);
R <<= 8;
L = (L << 8) & 0xFFFFFF;
if ( (L + R) > 0xFFFFFF ) R = 0xFFFFFF - L;
}
This method is approximate, and therefore produces slightly longer output than the canonical
CACM-87 style normalization [5]. However, the output is byte-aligned, and the normalization loop is
almost always only performed once; these properties make this method extremely fast.
The line:
if ( (L + R) > 0xFFFFFF ) R = 0xFFFFFF - L;
is where the extra output comes from. It can be seen that R is shrunk to less than it need be, so
extra normalizations are needed, and more bytes are output. However, this only happens 1/65536th of
the time, so the excess output is negligible.

which is a little uglier than my version above, but equivalent (he has a lot of ugly +1/-1 stuff cuz he didn't get that quite right).

And actually in the Data Compression Using Dynamic Markov Modelling paper
that goes with the dmc.c code,
they describe the arithmetic coder due to Guazzo and it is in fact an "fpaq0p" style carryless arithmetic coder (it avoids carries
by letting range get very small and only works on binary alphabets). I don't have the original paper due to Guazzo so I can't confirm that
attribution. The one in the paper does bit by bit output, but as I've stressed before that doesn't characterize the coder,
and Cormack then did bytewise output in the implementation.

Anyway, I had completely forgotten about this stuff, and it changes my previous attribution of byte-wise arithmetic coding to Schindler '98 ;
apparently I knew about it in '95 and Cormack did it in '87. (The only difference between the Schindler '98 coder and the NTiCM '95 coder is
that I was still doing (Sl*L)/R while Schindler moved to Sl*(L/R) which is a
small but significant change).