(though you can get data out incrementally, you don't have
to wait for the whole decode to get the first byte out; but if you only needed the last byte of the block, it's strictly
longer latency).

If you need fine grain paging, you have to trade off the desire to get precise control of your loading
with small blocks & the benefits of larger blocks.

(obviously always follow general good paging practice, like amortize disk seeks, combine small resources into
paging units, don't load a 256k chunk and just keep 1k of it and throw the rest away, etc.)

As a reference point, here's Kraken on Silesia with various chunk sizes :

(note these are *chunks* not a window size; no carry-over of compressor state or dictionary is allowed
across chunks. "files" means compress the individual files of silesia as whole units, but reset compressor
between files.)

You may have noticed that the chunked files (once you get past the very small 16k,32k) are somewhat faster
to decode. This is due to keeping match references in the CPU cache in the decoder.

Limitting the match window (OodleLZ_CompressOptions::dictionarySize) gives the same speed benefit for
staying in cache, but with a smaller compression win.

WARNING : tuning perf to cache size is obviously very machine dependent; I don't really recommend
fiddling with it unless you know the exact hardware you will be decoding on. The test machine here has
a 4 MB L3, so speed falls off slightly as window size approaches 4 MB.

If you do need to use tiny chunks with Oodle ("tiny" being 32k or smaller; 128k or above is in the normal
intended operating range) here are a few tips to consider :

1. Consider pre-allocating the Decoder object and passing in the memory to the OodleLZ_Decompress calls.
This avoids doing a malloc per call, which may or may not be significant overhead.

2. Consider changing OodleConfigValues::m_OodleLZ_Small_Buffer_LZ_Fallback_Size . The default is 2k bytes.
Buffers smaller than that will use LZB16 instead of the requested compressor, because many of the new ones
don't do well on tiny buffers. If you want to have control of this yourself, you can set this to 0.

3. Consider changing OodleLZ_CompressOptions::spaceSpeedTradeoffBytes . This is the number of bytes
that must be saved from the compressed output size before the encoder will choose a slower decode mode.
eg. it controls decisions like whether literals are sent raw or with entropy coding. This number is
scaled for full size buffers (128k bytes or more). When using tiny buffers, it will choose to avoid
entropy coding more often. You may wish to dial down this value to scale to your buffers. The default is
256 ; I recommend trying 128 to see what the effect is.

2/09/2017

I've been fooling around with strtod for a few days for no good reason. It's been quite interesting.

Some little notes :

strtod in many compilers & standard libraries is broken. (see the excellent pages at
exploring binary for
details)

(to clarify "broken" : they all round-trip doubles correctly; if you print a double and scan it using the same compiler, you will
get the same double; that's quite easy. They're "broken" in the sense of not converting a full-precision string to the closest
possible double, and they're "broken" in the sense of not having a single consistent rule that tells you how any given full
precision numeral string will be converted)

The default rounding rule for floats is round-nearest (banker). What that means is when a value is exactly at the midpoint of
either round-up or round-down (the bits below the bottom bit are 100000..) , then round up or down to make the bottom bit of the
result zero. This is sometimes called "round to even" but crucially it's round to even *mantissa* not round to even *number*.

Our goal for strtod should be to take a full numeral string and convert it to the closest possible double using
banker rounding.

in this case the "banker bit" (bottom bit of mantissa bit range) is off, the next bit is on.

If this value was exact, you should round *down*. (if the banker bit was on, you should round up). If this value is not exact by even the
tiniest bit (as in, there are more significant bits below the range we have seen, eg. it was 1000....001), it changes to rounding up.

If you think of the infinity of the real numbers being divided into buckets, each bucket corresponds to one of the doubles that covers that
range, this value is the boundary of a bucket edge.

In practice the way to do strtod is to work in ints, so if you have the top 64 bits of the (possibly very long) value in an int, this is :

hi = a u64 with first 64 bits
top bit of "hi" is on
(hi>>11) is the 52 bits of mantissa + implicit top bit
hi & 0x400 is the "rounding bit"
hi & 0x800 is the "banker bit"
hi & 0x3ff are the bits under the rounding bit
hi & 0xfff == 0x400 is low boundary that needs more bits
hi & 0xfff == 0xBFF is high boundary that needs more bits
At 0x400 :
"rounding bit" is on, we're right on threshold
"banker bit" is off, so we should round down if this value is exact
but any more bits beyond the first 64 will change us to rounding up
so we need more bits
- just need to see if there are any bits at all that could be on
At 0xBFF :
"banker bit" is on, so if we were at rounding threshold, we should round up
(eg. 0xC00 rounds up, and doesn't need any more bits, we know that exactly)
the bits we have are just below rounding threshold
if the remaining bits are all on (FFFF)
OR if they generate a carry into our bottom bit
then we will round up
- need enough of the remaining value to know that it can't push us up to 0xC00

The standard approach to doing a good strtod is to start by reading the first 19 digits into a U64. You just use *= 10 integer multiplies
to do the base-10 to base-2 conversion, and then you have to adjust the place-value of the right hand side of those digits using a table to find a power of 10.
(the place-value is obviously adjusted based on where the decimal was and the exponent provided in the input string ; so if you are given
"70.23e12" you read "7023" as an int and then adjust pow10 exponent +10).

Once you have the first 19 digits in the U64, you know that the full string must correspond to an interval :

so the difficult cases that need a lot of digits to discriminate are the ones where the first 19 digits put you just below the
threshold of rounding, but the hi end of the interval is above it.

Older MSVC only used the first 17 digits so fails sooner. For example :

34791611969279740608512
=
111010111100000111010011011110000000011001100010011101000000000000000000000
1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
this is right on a rounding threshold; because the banker bit is off it should round down
the correct closest double is
34791611969279739000000.000000
but if you bias that up beyond the digits that MSVC reads :
34791611969279740610310
=
111010111100000111010011011110000000011001100010011101000000000011100000110
1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
this should round up; the closest double is
34791611969279743000000.000000
but old MSVC gets it wrong
-----------
Another example :
2022951805990391198363682
=
110101100011000000110111111101000101100100011110011001000000000000000000000000000
is the double threshold
2022951805990391198666718
=
110101100011000000110111111101000101100100011110011001000000000100100111111011110
is definitely above the threshold, should round up
but if you only read the first 19 base-10 digits :
2022951805990391198000000
=
110101100011000000110111111101000101100100011110011000111111110000010001110000000
you see a value that is below the threshold and would round down
You have to look at the interval of uncertainty -
2022951805990391198000000
2022951805990391199000000
and see that you straddle and boundary and must get more digits.

There are two things that need refinement : getting more digits & doing your place value scaling to more
bits of precision. You can interatively increase the precision of each, which refines your interval smaller
and smaller, until you either know which side of the rounding barrier to take, or you have got all the bits of
precision.

BTW this needing lots of precision case is exactly the same as the arithmetic coder "underflow" case , where
your coding interval looks like, in binary :

hi = 101101010101001 100000000000000.110101010101
lo = 101101010101001 011111111111111.01010101010
^ ^ ^- active coding zone, new symbols add at this fixed point position
| |
| +-- bad underflow zone! can't yet tell if this bit should be a 1 or 0
|
+-- hi and lo the same in top bits, these bits have been sent & renormalized out

anyway, just a funny parallel to me.

There are obviously nasty issues to be aware of with the FPU rounding more, control word, precision,
or with the optimizer doing
funny things to your FPU math (and of course be aware of x86 being on the FPU and x64 being in SSE).
The best solution to all that is to just not use floats, use ints and construct the output floats by hand.

(there is an optimization possible if you can trust the FPU ; numbers that fit completely in an int you can
just use the FPU's int-to-float to do all the work for you, if the int-to-float rounding mode matches the
rounding mode you want from your strtod)

Now, you may think "hey who cares about these boundary cases - I can round-trip doubles perfectly without them".
In fact, round-tripping doubles is easy, since they don't ever have any bits on below the (52+1) in the mantissa,
you never get this problem of bits beyond the last one! (you only need 17 digits to round-trip doubles).

Personally I think that these kinds of routines should have well-defined behavior for all inputs. There should be
a single definite right answer, and a conforming routine should get that right answer. Then nutters can go off and
optimize it and fiddle around, and you can run it through a test suite and verify it *if* you have exactly defined
right answers. This is particulary true for things like the CRT. (this also comes up in how the compiler converts
float and double constants). The fact that so many compilers fucked this up for so long is a bit of a head scratcher
(especially since good conversion routines have been around since the Fortran days).
(the unwillingess of people to just use existing good code is so bizarre to me, they go off and roll their own and
invariably fuck up some of the difficult corner cases).

Anyway, rolling your own is a fun project, but fortunately there's good code to do this :

(BTW lots of people seem to have different versions of dtoa with various changes/fixes; for example gcc,
mozilla, chromium; I can't find a good link or home page for the best/newest version of dtoa; help?)

dtoa is excellent code in the sense of working right, using good algorithms, and being fast. It's terrible code in
the sense of being some apalling spaghetti. This is actual code from dtoa :

Note that dtoa can optionally check the FLT_ROUNDS setting but does not round correctly
unless it is 1 (nearest). Furthermore, in MSVC the FLT_ROUNDS value in the header is broken in some versions
(always returns 1 even if fesetenv has changed it). So, yeah. Don't mess with the float rounding mode please.

dtoa is partly messy because it supports lots of wacky stuff from rare machines (FLT_RADIX != 2 for example).
(though apparently a lot of the weird float format modes are broken).

An alternative I looked at is "floatscan" from MUSL. Floatscan produces correct results, but is very slow.

My variant here ("mine") is a little bit faster than dtoa. That basically just comes from using the modern 64-bit CPU. For example I
use mulq to do 64x64 to 128 multiply , whereas he uses only 32*32 multiplies and simulates large words by doing the carries manually.
I use clz to left-justify binary, whereas he uses a bunch of if's. Stuff like that.

The MUSL floatscan slowness is mmrmm. It seems to be O(N) in the # of digits even when the first 19 digits can unambiguously
resolve the correct answer. Basically it just goes straight into bigint (it makes an array of U32's which each hold 9 base10 digits)
which is unnecessary most of the time.

Tips for using Oodle as efficiently and robustly as possible. (wrst your game runtime loading +
decoding only).

1. Use the new compressors (Kraken/Mermaid/Selkie/Hydra). Aside from being great performance, they
have the most robust and well-tested fuzz safety.

2. Pass FuzzSafe_Yes to the OodleLZ_Decompress option. The KMS decoders are always internally fuzz safe.
What the FuzzSafe_Yes option is add some checks to the initial header decode which will cause the decode
to fail if the header byte has been tampered with to change it to a non-KMS compressor.

3. Use Oodle Core lib only, not Ext. Core lib is very light and tight. Core lib makes no threads, requires no
init, and will do no allocations to decode (if you pass in the memory).

6. If possible keep the Decoder memory around or pull from some shared scratch space rather than
allocating & freeing it every time.

7. If you're on a platform where Oodle is in a DLL or .so , sign it with a mechanism like authentisign
to ensure it isn't tampered with.

8. If possible do async IO and use double-buffering to incrementally load data and decompress, so that you are
maximally using the IO bus and the CPU at the same time. If loading and decoding large buffers, consider
running the decoders "ThreadPhased".

9. If possible use Thread-Phased decoding on large buffers that can't be broken into multiple decodes.
The most efficient use of threading always comes from running multiple unrelated encode/decodes simultaneously,
rather than trying ti multi-thread a single encode/decode.

10. Use a mix of compressors to put your decode time where it helps the most. eg. use the slower compressors
only when they get big size wins, and perhaps tune the compression to the game's latency needs - random access
data that needs to get loaded as fast as possible, use faster decoders, background data that streams in slowly
can use slower decoders. Hydra is the best way to do this, it will automatically measure decode speed and
tune to maximize the space-speed tradeoff.

2/01/2017

02-01-17 | Oodle Hydra

Oodle Hydra - the many headed beast.

Hydra is a meta-compressor which selects Kraken, Mermaid, or Selkie per block. It uses the speed fit model of each compressor to
do a lagrangian space-speed optimization decision about which compressor is maximizing the desired lagrange cost
(size + lambda*time).

It turns out to be quite interesting.

(this is of course in addition to each of those compressors internally making space-speed decisions; each of them can enable or
disable internal processing modes using the same lagrange optimization model. (eg. they can turn on and off entropy coding
for various streams). And there are additional per-block implicit decisions such as choosing uncompressed blocks and huff-only blocks.)

Hydra is a single entry point to all the Oodle compressors. You simply choose how much you care about size vs. decode speed, that
corresponds to a certain lagrange lambda. In Oodle this is called "spaceSpeedTradeoffBytes". It's the # of bytes that compression must
save in order to take up N cycles more of decode time. You then no longer think about "do I want Kraken or Mermaid" , Oodle makes the
right decision for you that optimizes the goal.

Hydra can interpolate the performance of Kraken & Mermaid to create a meta-compressor that targets the points in between.
That in itself is a somewhat surprising result. Say Kraken is at 1000 mb/s , Mermaid is at 2000 mb/s decode speed, but you really
want a compressor that's around 1500 mb/s with compression between the two. We don't know of a Pareto-optimal compressor that
is between Kraken and Mermaid, so you're sunk, right? No, you can use Hydra.

I should note that Hydra is very much about *whole corpus* performance. That is, if your target is 1500 mb/s, you may not hit that
on any one file - that file could go either all-Kraken or all-Mermaid. The target is hit overall. This is intentional and good,
but if for whatever reason you are trying to hit a specific speed for an individual file then Hydra is not the way to do that.

It leads to an idea that I've tried to advocate for before : corpus lagrange optimization for bit rate allocation. If you are dealing
with a limited resource that you want to allocate well, such as disk size or download size or time to load - you want to allocate that
resource to the data that can make the best use of it. eg. spend your decode time where it makes the biggest size difference.
(I encourage this for lossy bit rate allocation as well). So with Hydra some files decode slower and some decode faster, but when they
are slower it's because the time was worth it.

And now some reports. We're going to look at 3 copora. On Silesia and gametestset, Hydra interpolates as expected. But then on PD3D,
something magic happens ...

You can see it visually in the loglog plot; if you draw a line between Kraken & Mermaid, the Hydra data points are above that
line (more compression) and to the right (faster).

What's happening is that once in a while there's a block where Mermaid gets the same or more compression than Kraken. While that's
rare, when it does happen you just get a big free win from switching to Mermaid on that block. More often, Mermaid only gets a little
bit less compression than Kraken but a lot less decode time, so switching is advantageous in the space-speed lagrange cost.

Crucial to Hydra is having a decoder speed fit for every compressor that can simulate decoding a block and count cycles needed to
decode on an imaginary machine. You need a model because you don't want to actually measure the time by running the decoder on
the current machine - it would take lots of runs to get reliable timing, and it would mean that you are optimizing for the exact
machine that you are encoding on. I currently use a single virtual machine that is a blend of various real platforms; in the future
I might expose the ability to use virtual machines that simulate specific target machines (because Hydra might make decisions
differently if it knows it is targeting PC-x64 vs. Jaguar-x64 vs. Aarch64-on-A57 , etc.).

Hydra is exciting to me as a general framework for the future of Oodle. It provides a way to add in new compression modes and
be sure that they are never worse. That is, you always can start with Kraken per block, and then new modes could be picked
block by block only when they are known to beat Kraken (in a space-speed sense). It lets you mix in compressors that you specifically
don't expect to be good in general on all data, but that might be amazing once in a while on certain data.

(Hydra requires compressors that carry no state across blocks, so you can't naively mix in something like PPM or CM/PAQ. To optimize
a switching choice with compressors that carry state requires a trellis-quantization like lattice dynamic programming optimization
and is rather more complex to do quickly)