9/15/2017

Oodle's modern encoders take a parameter called the "space-speed tradeoff".
(specifically OodleLZ_CompressOptions:: spaceSpeedTradeoffBytes).

"speed" here always refers to decode speed - this is about the encoder making
choices about how it forms the compressed bit stream.

This parameter allows the encoders to make decisions that optimize for a space-speed goal
which is of your choosing. You can make those decisions favor size more, or
you can favor decode speed more.

If you like, a modern compressor is a bit a like a compiler. The compressed data is a
kind of program in bytecode, and the decompressor is just an intepreter that
runs that bytecode. An optimal parser is like an optimizing compiler; you're considering
different programs that produce the same output, and trying to find the program that
maximizes some metric. The "space-speed tradeoff" parameter is a bit like -Ox vs -Os,
optimize for speed vs size in a compiler.

Oodle of course includes Hydra (the many headed beast) which can tune performance
by selecting compressors based on their space-speed performance.

But even without Hydra the individual compressors are tuneable, none more so than Mermaid.
Mermaid can stretch itself from Selkie-like (LZ4 domain) up to standard LZH compression
(ZStd domain).

I thought I would show an example of how flexible Mermaid is. Here's Mermaid level 4 (Normal)
with some different space-speed tradeoff parameters :

(* MSVC build of ZStd/LZ4 , not a fair speed measurement (they're faster in GCC), just use as a general reference point)

Point being - not only can Mermaid span a large range of performance but it's *good* at both ends of
that range, it's not getting terrible as it out of its comfort zone.

You may notice that as sstb goes below 128 you're losing a lot of decode speed and not gaining much
size. The problem is you're trying to squeeze a lot of ratio out of a compressor that just doesn't
target high ratio. As you get into that domain you need to switch to Kraken. That is, there comes a
point where the space-speed benefit of squeezing the last drop out of Mermaid is harder than just
making the jump to Kraken. And that's where Hydra comes in, it will do that for you at the right spot.

ADD : Put another way, in Oodle there are *two* speed-ratio tradeoff dials. Most people are just
familiar with the compression "level" dial, as in Zip, where higher levels = slower to encode, but
more compression ratio. In Oodle you have that, but also a dial for decode time :

CompressionLevel = trade off encode time for compression ratio
SpaceSpeedTradeoffBytes = trade off decode time for compression ratio

Perhaps I'll show some sample use cases :

Default initial setting :
CompressionLevel = Normal (4)
SpaceSpeedTradeoffBytes = 256
Reasonably fast encode & decode. This is a balance between caring about encode time, decode time,
and compression ratio. Tries to do a decent job of all 3.
To maximize compression ratio, when you don't care about encode time or decode time :
CompressionLevel = Optimal4 (8)
SpaceSpeedTradeoffBytes = 1
You want every possible byte of compression and you don't care how much time it costs you to encode or
decode. In practice this is a bit silly, rather like the "placebo" mode in x264. You're spending
potentially a lot of CPU time for very small gains.
A more reasonable very high compression setting :
CompressionLevel = Optimal3 (7)
SpaceSpeedTradeoffBytes = 16
This still says you strongly value ratio over encode time or decode time, but you don't want to chase
tiny gains in ratio that cost a huge amount of decode time.
If you care about decode time but not encode time :
CompressionLevel = Optimal4 (8)
SpaceSpeedTradeoffBytes = 256
Crank up the encode level to spend lots of time making the best possible compressed stream, but make
decisions in the encoder that balance decode time.
etc.

The SpaceSpeedTradeoffBytes is a number of bytes that Oodle must be able to save in order to accept a
certain time increase in the decoder. In Kraken that unit of time is 25600 cycles on the artifical
machine model that we use. (that's 8.53 microseconds at 3 GHz). So at the default value of 256, it
must save 1 byte in compressed size to take an increased time of 100 cycles.

I've spent some time in the last month looking into cases where ZStd beats Kraken & Mermaid.

Most of the time Kraken gets better ratio than ZStd,
but there were exceptions to that (mainly text), and it always kind of bothered me,
since Kraken is roughly a superset of ZStd (not exactly), and the differences are small,
it shouldn't have been winning by more than 1% (which is the variation I'd expect due to
small differences). On text files, I have no edge over ZStd, all my advantages are moot, so
we're reduced to both being pretty basic LZ-Huffs; so we should be equal, but I was losing.
So I dug in to see what was going on.

Thanks of course to Yann for making his great work open source so that I'm able to look at it; open source and sharing
code is a wonderful and helpful thing when people choose to do so voluntarily, not so nice when your work is stolen
from you against your will and shown to the world like phone-hacked dick-pics *cough* *assholes*. Since I'm
learning from open source, I figured I should give back, so I'm posting what I learned.

A lot of the differences are a question of binary vs. text focus. ZStd has some tweaking that clearly
comes from testing on text and corpora with a lot of text (like silesia). On the other hand, I've been
focusing very much on binary and that has caused me to miss some important things that only show up
when you look closely at text performance.

This is what I found :

Long hashes are good for text, bad for binary

ZStd non-optimal levels use hash lengths of 5 or even 6 or 7 at the fastest levels. This helps on
text because text has many long matches, so it's important to have a hash long enough that it can
differentiate between "boogie" and "booger" and put them in different hash table bins. (this is
most important at the fastest levels which are cache table with no ways).

On binary you really want to hash len 4 because there are important matches of exactly len 4, and longer hashes
can make you miss them.

Longer hashes help the fast modes a *lot* on text. If you care about fast compression of text
you really want those longer hashes.

This is a big issue and because of it ZStd fast modes will continue to be better than Oodle on text
(and Oodle will be better on binary); or we have to find a good way to detect the data type and
tune the hash length to match.

lazy2 is helpful on text

Standard lazy parsing looks for a match at ptr, if one is found it also looks at ptr+1 to see if
something better is there. Lazy2 also looks at ptr+2.

I wasn't doing 2-ahead lazy parsing, because on binary it doesn't help much. But on text it's
a nice little win :

I once wrote that in codecs that do strong rep0 exclusion (rep0len1 literal can't occur immediately after a
match), that you can just always send max-length matches, and not have to consider match length reductions.
(because max-length matches maintain rep0 exclusion but shorter ones violate it).

That is not quite right.
It tends to be true on binary, but is wrong on text.
The issue is that you only get the rep0 exclusion benefit if you actually send a literal after the match.

That happens often on binary. Binary frequently goes match-literal-match-literal , with some near-random
bytes between predictable regions.
Text has very few literals. Many text files go match-match-match which means the rep0 literal exclusion does
nothing for you.

On text files you often have many short & medium length overlapping matches, and trying len reductions is
important to find the parse that traces through them optimally.

AAAADDDGGGGJJJJ
BBBBBFFFHHHHHH
CCCEEEEEIII
and the optimal parse might be
AAABBBFFFHHHHHH
which you would only find if you tried the len reduction of A

this kind of thing. Text is all about making the best normal-match decisions.

Getting len 3 matches right in the optimal parser is really important on text

Part of the "text is all matches" issue. My codecs are mostly MML 4 in the non-optimal modes,
then I switch to MML3 at level 7 (Optimal3). Adding MML3 generally lets you get a bit more
compression ratio, but hurts decode speed a bit.

(BTW MML3 in the non-optimal modes generally *hurts* compression ratio, because they can't make
the decision correctly about when to use it. A len 3 match is always marginal, it's only
slightly cheaper than 3 literals (depending on the literals), and you probably don't want it if you
can find any longer match within those next 3 bytes. Non-optimal parsers just make these decisions
wrong and muck it all up, they do better with MML 4 or even higher sometimes. (there are definitely
files where you can crank up MML to 6 or 8 and improve ratio))

So, I was doing that *but* I was using the statistics from a greedy pre-pass to seed the optimal
parse decisions, and the greedy pre-pass was MML 4, which was biasing the optimal against len 3 matches.
It was just a fuckup, and it wasn't hurting me on binary, but when I compared to ZStd's optimal parse
on text I could immediately see it had a lot more len 3 matches than me.

(this is also an example of
the parse-statistics feedback problem, which I believe is the most important problem in LZ compresion)

There's lot of little clever nuggets that are hard to see. They aren't generally commented and they're buried in
chunks of copy-pasted code that all looks the same so it's easy to gloss over the variations.

and I thought - okay, look for a 4 byte rep match, if found take it unconditionally and don't look for
normal match. That's the same thing I do (I think it came from me?), no biggie.

But there's a wrinkle. The rep check is not at the same position as the normal match. It's at pos+1.

This is actually a mini-lazy-parse. It doesn't do a full match & rep find at pos & (pos+1). It's just
scanning through, at each pos it only does one rep find and one match find, but the rep find is offset
forward by +1. That means it will take {literal + rep} even if match is available, which a normal
non-lazy parser can't do.

(aside : you might think that this misses a rep find, when the literal run starts, right after a match,
it starts find the first rep at pos+1 so there's a spot where it does no rep find. But that spot is where
the rep0 exclusion applies - there can be no rep there, so it's all good!)

ADD : a couple more notes on ZStd (that aren't from the recent investigation) while I'm at it :

ZStd uses a unique approach to the lrl0-rep0 exclusion

After a match (of full length), that same offset cannot match again. If your offsets are in a rep match cache, the most
recently used offset is the top (0th) entry, rep0. This is the lrl0-rep0 exclusion.

rep0 is usually the most likely match, so it will get the largest share of the entropy coder probability space. Therefore
if you're in an exclusion where that symbol is impossible, you're wasting a lot of bits.

There are two ways that I would call "traditional" or straightforward data compression ways to model the lrl0-rep0 exclusion.
One is to use a single bit for (lrl == 0) as context for the rep-index coding event. eg. you have two entropy coding states
for offsets, one for lrl == 0 and one for lrl != 0. The other classical method would be to combine lrl with rep-index in a
larger alphabet, which allows you to model their correlation using only order-0 entropy coding. The minimum alphabet size here
is only 2 bits, 1 bit for (lrl == 0) or not, and one for (match == rep0) or not.

ZStd does not use either of these methods. Instead it shifts the rep index by (lrl == 0). That is, ZStd has 3 reps, and normally
they are in match offset slots 0,1,2. But right after the end of a match (when lrl is 0) those offset values change to mean rep
1,2,3 ; and there is no rep3, that's a virtual offset equal to (rep0 - 1).

I can't say how well the ZStd method here compares to the alternatives as it's a bit more effort to check than I'd like to do.
(if you want to try it, you could double the size of ZStd's offset coding alphabet to put 1 bit of lrl == 0 into the offset coding;
then the decode sequence grabs an offset and only pulls an lrl code if the offset bit says so).

ZStd uses TANS in a limited and efficient way

ZStd does not use TANS (FSE) on its literals, which are the largest class of entropy coded symbols. Presumably Yann found, like us,
that the compression gains on literals (over Huffman) are small, and the speed cost is not worth it. ZStd only uses TANS on the
LZ match components - LRL, offset, ML.

Each of these has a small alphabet (52,35,28), and therefore can use a small # of bits for the TANS tables (9,9,8). This is a sweet
spot for TANS, so it works well in ZStd.

For large alphabets (eg. 256 for literals), TANS needs a higher # of bits for its code tables (at least 11), which means 2048 entries
being filled. This makes the table setup time rather large. By cutting the table size to 8 or 9 bits you cut that down by 4-8X.
With large alphabets you also may as well just go Huff. But with small alphabets, Huff gets worse and worse. Consider the extreme -
in an alphabet of 2 symbols Huff becomes no compression at all, while TANS can still do entropy coding. With small alphabets to use Huffman you
need to combine symbols (eg. in a 2-bit alphabet you would code 4 at once as an 8-bit symbol). BUT that means going up to big decoder
tables again, which adds to your constant overhead.

FSE uses the prime-scatter method to fill the TANS decode table. (this is using a relatively-prime step to just walk around the
circular array, using the property that you can just keep stepping that way and you will eventually hit every slot once and only once).
I evaluated the prime-scatter method before and concluded that the compression penalty was unacceptably large.
I was mistaken. I had just implemented it wrong, so my results were much worse than they should be.

(the mistake I made was that I did the prime-scatter in one pass; for each symbol, take the steps and fill
table entries, increment "from_state" as you step, "to_state" steps around with the prime-modulo. This causes
a non-monotonic relationship between from_state and to_state which is very bad. The right way to do it
(the way ZStd/FSE does it) is to use some kind of two-pass scheme, so that you do the shuffle-scatter first
(which can step around the loop non-monotonically) but then assign the from_state relationship in a second
pass which ensures the monotonic relationship).

With a correct implementation, prime-scatter's compression ratio is totally fine (*). The two-pass method that ZStd/FSE
uses would be slow for large alphabets or large L, but ZStd only uses FSE for small alphabets and small L.
The entropy coder and application are well matched. (* = if you special case singletons, as below)

The worst case for prime-scatter is low counts, and counts of 1 are the worst. ZStd/FSE uses a special case
for counts of 1 that are "below 1". Back in the "Understanding TANS" series I looked at the "precise sort" method
of table building and
found that artificially skewing the bias to put counts of 1 at the end was a big win in practice. The issue
there is that the counts we see at that point are normalized, and zeros were forced up to 1 for codeability.
The true count might be much lower. Say you're coding an array of size 64k and symbol 'x' only occurs 1 time.
If you have a TANS L of 1024 , the true probability should be 1/64k , but normalized forces it up to 1/1024.
Putting the singleton counts at th end of the TANS array gives them the maximum codelen (end of the array
has maximum fractional bits). The sort bias I did before was a hack that relies on the fact that most
singleton counts come from below-1 normalized probabilities. ZStd/FSE explicitly signals the difference, it
can send a "true 1" (eg. closest normalized probability really is 1/1024 ; eg. in the 64k array, count is near
64), or a "below 1" , some very low count that got forced up to 1. The "below 1" symbols are forced to the end
of the TANS array while the true 1's are allowed to prime-scatter like other symbols.