5/17/2016

05-17-16 | The Weissman Score

That's just wrong. You don't take a logarithm of something with units.
But there are aspects of it that are correct. W should be proportional to r (compression ratio), and a
logarithm of time should be involved. Just not like that.

with disk_speed_lo = 1 MB/s ; which is neater, though this favors fast compressors more than you might like.
While it's a cleaner formula, I think it's less useful for practical purposes, where the bounded hi range focuses
the score more on the area that most people care about.

I came up with this formula because I started thinking about summarizing a score from the
Pareto charts I've made . What if you took the speedup value at several (log-scale) disk speeds;
like you could take the speedup at 1 MB/s,2 MB/s,4 MB/s, and just average them?
speedup is a good way to measure a compressor even if you don't actually care about speed.
Well, rather than just average a bunch of points, what if I average *all* points? eg. integrate to
get the area under the curve? Note that we're integrating in log-scale of disk speed.

ADD : this post was a not-sure-if-joking. But I actually think it's useful. I find it useful anyway.

When you're trying to tweak out some space-speed tradeoff decisions, you get different sizes and speeds, and it can
be hard to tell if that tradeoff was good. You can do things like plot all your options on a space-speed graph and try to
guess the pareto frontier and take those. But when iterating an optimization of a parameter you want just a simple score.

This corrected Weissman score is a nice way to do that. You have to choose what domain you're optimizing for, size-dominant
slower compressors should use Weissman 1-256 , for balance of space and super speed use Weissman 1-inf (or 40-800), for the fast domain
(LZ4-ish) use a range like 100-inf. Then you can just iterate to maximize that number!

For whatever space-speed tradeoff domain you're interested in, there exists a Weissman score range (lo-hi disk speed paramaters)
such that maximizing the Weissman score in that range gives you the best space-speed tradeoff in the domain you wanted.
The trick is choosing what that lo-hi range is (it doesn't necessarily directly correspond to actual disk or
channel speeds; there are other factors to consider like latency, storage usage, etc. that might cause you to bias
the lo-hi away from the actual channel speeds in some way; for example high speed decoders should always set the upper
speed to infinity, which corresponds to the use case that the compressed data might be already resident in RAM so it
has zero time to load).