Truffman Coding

1.0 Introduction

This page describes an invention in the craft of statistical bit coding.
We assume fixed (static) probability estimation; and ignore
any context conditioning or other data modeling.
We consider only "simple" codes (in a sense we won't define rigorously)
and assume no token occurs with overwhelming (greater than about 70%)
probability.
Also, to keep the exposition simple we describe a conceptual decoder
which navigates a decoding tree step-by-step.
A practical decoder might be table-driven for faster performance.
For these various reasons, the page is not a comprehensive
tutorial on the application of this invention to data compression.

A statistical bit-code maps decision tokens to variable-length
bit strings with the intent of compression.
For example

Table 1

Token

Probability (case 1)

Probability (case 2)

Compressed bit-string

A

0.50

0.36

0

B

0.25

0.36

10

C

0.25

0.28

11

An optimal code will use k bits to encode a token of probability
2-k. It will be seen that the codes
in Table 1 are optimal for the case-1 probabilities.
The case-2 probabilities are not powers of 1/2 and no simple
prefix code will be optimal.

The average coding cost in bits is
(- Σ p log2q)
where p is a token's actual probability,
and q = 2-k is the probability for
which the code would be optimal.
Using the examples in Table 1, the optimal and actual costs
for case-1 probabilities are 1.50 and 1.50 bits per token;
the optimal and actual costs for case-2 probabilities are
1.575 and 1.640 bits per token.

The depicted code has minimal cost among all
simple prefix codes for either
of the two probability distributions.
(A simple prefix code of minimal cost is called a Huffman code.)
In case 1, this Huffman code
wastes zero compared with the limiting bound -- the cost of
an optimal Shannon (arithmetic) code.
In case 2, the Huffman code wastes 0.065 bits (1.640 - 1.575) per
token compared with the Shannon limit.
(That's a whopping 4% inefficiency; though that figure
is misleading since
practical Huffman codes usually encode many more than 3 tokens.)

A Huffman code will waste, typically, about 0.028 bits per
token compared with the Shannon limit. Our case-2 example
wastes much more than this because its dominant probabilities (36%)
are very poorly approximated by any power of 1/2.

This paper describes a new invention ("Truffman code")
which improves on Huffman coding by using a tiny amount of state
information. (A key part of the extra state is
a trit -- hence the name "TRuffman code.")
Instead of wasting 0.028 bits per token, a crude implementation
of the new invention wastes typically about 0.012 bits per token.
(This is a wild guess -- I've only done a few experiments.)
It might be possible to refine the invention to
reduce this to 0.009 or so (??).

Before explaining the Truffman code, we will show an easy way
to improve performance in our case-2 example.

Table 2

Compound Token

Probability (case 2)

Compressed bit-string

A A

0.13 (~.36*.36)

000

B A

0.13 (~.36*.36)

001

A B

0.13 (~.36*.36)

010

B B

0.13 (~.36*.36)

011

A C

0.10 (~.36*.28)

100

B C

0.10 (~.36*.28)

101

C

0.28

11

By considering two adjacent tokens to be a single token,
a new Huffman code is fashioned with 2.720 average bits per token
compared with 2.709 for the Shannon limit; or only 0.011 (0.4%)
wastage. This 0.011 wastage compares favorably with the typical
Huffman code wastage of 0.0284 bits because the probabilities of
the compound tokens are now all fortuitously close to powers of 1/2.

In the example, there are only three tokens, so a maximum
of nine (3*3) compound tokens if at most two tokens are combined
as in the hand-crafted example.
In practice, a Huffman code might support 200 tokens or so,
leading to as many as 40,000 compound tokens.
This could be unwieldy.
The new Truffman code does not increase the number of tokens;
it does not use compounding.
It achieves the effect of Table 2 in another way.

2.0 Truffman Code

In a Huffman code, one bit is used to encode a
decision of probability 50% and two bits for a 25% decision.
We'd like to use 1 trit for a decision of
probability of 33% or so.
We'd also like to use the left-over space when a 25% decision
does not occur to encode a 75% decision, or the
left-over from a trit to encode a 67% decision.
This is effectively what the Truffman code achieves.

We now describe a conceptual Truffman decoder in detail.
The decoder will be in one of three states: (A) Base,
(B) Preserved-trit, (C) Preserved-bit.
As the names imply, a trit or bit is remembered in the Preserved
states. Hence, you may prefer to think of six states:
A, B0, B1, B2, C0, C1.

A conceptual Huffman decoder inputs bits (50-50 decisions).
The Truffman decoder may also input skewed bits (70-30 decisions).
The decoder will know from context whether it needs an
ordinary bit or a skewed bit.

In the Base state, two bits are input when a skewed bit is needed.
If those bits are both 1 then the 30% branch is taken.
If not, the 70% branch is taken and the decoder enters the
Preserved-trit state with the value of the two bits (00, 01, or 10)
being the preserved trit.

In the Preserved-trit state, when a skewed bit is needed,
no data is input; the preserved trit is used.
If the trit is 10, the 30% branch is taken.
Otherwise the 70% branch is taken and the decoder enters
the Preserved-bit state.
In the Preserved-bit state, the preserved trit (now restricted to
00 or 01) is used for the next needed bit, after which the decoder
goes to Base state. (It is possible to design the code
so that a skewed bit is never needed in the Preserved-bit state.)

Note that the optimal coding probability for the so-called
"70% branch" will alternate
between 75% and 67% as skewed-bit decoding alternates between
the Base state and the Preserved-trit state.

I won't outline a Truffman encoder except to note that
the distance between needed skewed bits can, in principle,
be arbitrarily long; thus the Preserved-trit value may not be known
until long after that coding point is encountered.
This is no problem in my proof-of-concept implementation:
I allocate enough memory to contain the entire compressed file.
A simple workaround in a more practical system would be to impose
a rule such as: If a Preserved-trit remains pending for more than 4 tokens,
it collapses to a Preserved-bit; the efficiency loss from such a rule
might be very modest.

3.0 Constructing a Truffman Decoding Tree

There is an elegant algorithm due to David Huffman
for constructing a minimal-cost Huffman code; there are other
simple algorithms for constructing near-optimal ("good
enough") Huffman-like codes.
Constructing even a near-optimal Truffman code seems to be much more
challenging.
In this page I report only my initial effort.
I'm embarrassed: it will seem quite "kludgey" even with the kludgiest details
omitted.

The first step is to construct a histogram of an input training set
in order to obtain the fixed probabilities.
Then I triplicate that histogram, add an escape token,
and apply Huffman's algorithm for the minimal Huffman code.
(In the copies of the histogram I thought it useful to, for example,
multiply all probabilities by 0.8 in one of the copies and by 0.65 in another.)
For example, if there are 100 tokens, 301 tokens will
be submitted to Huffman's algorithm.
(The escape token of very low probability may be convenient, e.g.
as an EOF indication.)
Practical Huffman codes will often impose an upper limit on coded
token lengths
(or almost equivalently, a lower limit on probabilities submitted
to Huffman's algorithm).
In my implementation, such bounds are convenient
for the post-processing kludges to function properly.

By summing the "Huffmanized" probabilities (probabilities which are
powers of 1/2) of the three instances of each token, one derives
a net probability estimate which is either 3, 4 or 5 times a
power of 1/2. Some automated post-processing, which I won't describe
in detail, reduces these three classes to
two, while ensuring they comply with certain constraints:
for example, odd-sized subclasses can cause trouble; the 2nd-least probable
token needs to be forced to class 3.
(I told you this was embarrassing.)
Finally tokens and nodes are combined into higher nodes in
an obvious way.
One can achieve slightly better ultimate performance by considering
the actual token probabilities rather than the "Huffmanized" estimates
when combining nodes.

It would lead to much simpler processing, were the
histogram duplicated instead of triplicated.
And quadruplicating might have some advantages!
I've started to code both approaches, but efforts to optimize the
assigned classes may be tedious;
and the existing "proof of concept" implementation may be enough
for my present purpose.

4.0 Conclusion

Dynamic probability estimates are superior to static
estimates, and various high-speed arithmetic and
quasi-arithmetic codes have been developed.
For these reasons, as well as complexities associated with
generating a Truffman tree and using it to encode or decode,
the Truffman Code may have little practical utility.

Still, it is a fun concept. Perhaps someone would be interested enough to
pursue a conference presentation. Or perhaps, someone might
adopt an underlying concept and apply it to something of practical value.