6/06/2018

The Perils of Holistic Profiling

I have found that many evaluators of Oodle are now trying to do timing of their entire process. That is, profiling the compressor by
measuring the effect on total load time, or by profiling an entire game frame time (as opposed to profiling just the
compression operation, perhaps in situ or perhaps in a test bench).

I believe what's happened is that many people have read about the dangerous of artificial benchmarks. (for example there are some famous
papers on the perils of profiling malloc with synthetic use loads, or how profiling threading primitives in isolation is pretty useless).

While those warnings do raise important issues, the right response is not to switch to timing whole operations.

For example, while timing mallocs with bad synthetic data loads is not useful (and perhaps even harmful), similarly timing an entire application run to determine
whether a malloc is better or not can also be misleading.

Basically I think the wrong lesson has been learned and people are over simplifying. They have taken one bad practice (time operations
by running them in a synthetic test bench over and over), and replaced it with another bad practice (time the whole application).

The reality of profiling is far more complex and difficult. There is no one right answer. There is not a simple prescription of how
to do it. Like any scientific measurement of a complex dynamic system, it requires care and study. It requires looking at the specific
situation and coming up with the right measurement process. It requires secondary measurements to validate your primary measurements,
to make sure you are testing what you think you are.

Now, one of the appealing things of whole-process timing is that in one very specific case, it is the right thing to do.

IF the thing you care about is whole-process time, and the process is always run the same way, and you do timing on the system that the
process is run on, and in the same application state and environment, AND crucially this you are only allowed to make one change to the
process - then whole process timing is right.

Let's first talk about the last issue, which is the "single change" problem.

Quite often a good change can appear to do nothing (or even be negative) for whole process time on its own. By looking at just
the whole process time to evaluate the change, you miss a very positive step. Only if another step is taken will the value of that
first step be shown.

A common case of this is if your process has other limiting factors that need to be fixed.

For example on the macroscopic level, if your game is totally GPU bound, then anything you do to CPU time will not show up at all
if you are only measuring whole frame time. So you might profile a CPU optimization and see no benefit to frame time. You can miss
big improvements this way, because they will only show up if you also fix what's causing the process to be GPU bound.

Similarly at a more microscopic level, it's common to have a major limiting factor in a sequence of code. For example you might have a
memory read that typically misses cache, or an unpredictable branch. Any improvements you make to the arithmetic instructions in that
area may be invisible, because the processor winds up stalling on a very slow cache line fill from memory. If you are timing your
optimization work "in situ" to be "realistic" you can completely miss good changes because they are hidden by other bad code.

Another common example, maybe you convert some scalar code to SIMD. You think it should be faster, but you time it in app
and it doesn't seem to be. Maybe you're bound in other places. Maybe you're suffering added latency from round-tripping from
scalar to SIMD back to scalar. Maybe your data needs to be reformatted to be stored in SIMD friendly ways. Maybe the surrounding
code needs to be also converted to SIMD so that they can hand off more smoothly. There may in fact be a big win there that you
aren't seeing.

This is a general problem that greedy optimization and trying to look at steps one by one can be very misleading when measuring
whole process time. Sometimes taking individual steps is better evaluated by measuring just those steps in isolation, because
using whole process time obscures them. Sometimes you have to try a step that you believe to be good even if it doesn't show up
in measurements, and see if taking more steps will provide a non-greedy multi-step improvement.

Particular perils of IO timing

A very common problem that I see is trying to measure data loading performance, including IO timing, which is
fraught with pitfalls.

If you're doing repeated timings, then you'll be loading data that is already in the system disk cache, so your IO speed may
just look like RAM speed. Is what's important to you cold cache timing (user's first run), or hot cache time? Or both?

Obviously there is a wide range of disk speeds, from very slow hard disks (as on consoles) in the 20 MB/s range up to SSD's and NVMe
in the GB/s range. Which are you timing on? Which will your user have? Whether you have slow seeks or not can be a huge factor.

Timing on consoles with disk simulators (or worse : host FS) is particularly problematic and may not reflect real world performance at all.

The previously mentioned issue of high latency problems hiding good changes is very common. For example doing lots of small IO calls
creates long idle times that can hide other good changes.

Are you timing on a disk that's fragmented, or nearly full? Has your SSD been through lots of write cycles already or does it need
rebalancing? Are you timing when other processes are running hitting the disk as well?

Basically it's almost impossible to accurately recreate the environment that the user will experience. And the variation is not small,
it can be absolutely massive. A 1 byte read could take anything between 1 nanosecond (eg. data already in disk cache) to 100 milliseconds
(slow HD seek + other processes hitting disk).

Because of the uncertainty of IO timing, I just don't do it.
I use a simulated "disk speed" and just set :

disk time = data size / simulated disk speed

Then the question is, well if it's so uncertain, what simulated disk speed do you use? The answer is : all of them. You cannot
say what disk speed the user will experience, there's a huge range, so you need to look at performance over a spectrum of disk speeds.

I do this by making a plot of what the total time for (load + decomp) is over a range of simulated disk speeds. Then I can examine
how the performance is affected over a range of possible client systems, without trying to guess the exact disk speed of the client
runtime environment.
For more on this, see :
Oodle LZ Pareto Frontier or
Oodle Kraken Pareto Frontier .