Criterion’s Confidence Intervals: The Bootstrap

I’ve been doing some benchmarking with Criterion recently, and wanted to write up the results. However, I didn’t really understand how it was arriving at some of the statistics that it spits out, particularly the confidence intervals. One book on the subject later, and I think I’ve got it. I thought I’d write it down in case anyone else needed to know. (I’m not a statistics expert though, so please do send corrections if necessary.) The mean result from Criterion is straightforward; the benchmark is run (typically 100 times) and the mean is calculated in the standard way (total time divided by count). But Criterion also gives 95% confidence intervals for the mean, and these require some more detailed explanation.

Confidence Intervals

Confidence intervals are a notoriously tricky part of statistics. A confidence interval of 95% means that if you run many separate experiments, and calculate a confidence interval for each of them, 95% of those intervals will contain the true mean. The kicker being that you have no way of knowing whether the interval you have for your one experiment is one of the 95% that contain the mean, or one of the 5% that don’t.

One way to calculate a confidence interval is to make an assumption about the distribution of the data. It is important to realise that this is an assumption. For example, you might be tempted to assume the times were normally distributed — but do you really believe that your times come from a perfectly symmetric bell-curve? Run times are often skew, distributed more exponentially, or simply too noisy to pick a distribution that could have generated them. (You can perform a statistical test for the validity of the normality assumption in your sample, if you’re interested.) If you did pick a distribution and calculate the parameters — e.g. the variance of a normal distribution — you could say where 95% of the values are likely to fall because you know what shape the distribution is. But if you don’t know the shape of the real distribution (from which your noisy sample was drawn), you need to use a different technique; Criterion uses the bootstrap.

The Bootstrap

Imagine for a moment that rather than just one mean, you instead had 1000 means. A straightforward way to form a 95% confidence interval would be to arrange your sample times in order, and pick the 2.5% mark (roughly the 25th highest mean) and the 97.5% mark (roughly the 975th highest mean). This method makes no new assumptions, and would derive the interval just from the data. Of course, you don’t have the time to wait around and record enough samples to form 1000 means. (Some of my benchmarks took in the order of 20 minutes per benchmark to collect the 100 values for a single mean; for 1000 means I would have to wait a long time!)

The bootstrap technique generates a lot of means, by “faking” a lot more samples than you actually recorded and taking the mean of each of them. It takes your single sample of 100, and generates many more samples (in Criterion, 100000 by default) by randomly picking 100 values out. Obviously, if you picked 100 values from a set of 100 without replacement you’d always end up with the starting set; the bootstrap picks with replacement, so it is very likely to feature repeated original values in the new fake sample, and also to omit some of the original values in each new fake. The confidence interval can be formed from these fake samples as described in the previous paragraph.

That is the intuition of the bootstrap method of calculating confidence intervals. In fact, Criterion uses bootstrap with bias correction and acceleration. These two extra measures affect how the confidence intervals are calculated by attempting to adjust for skew and sensitivity in the data. Unfortunately the reference books I have available are not very illuminating on the intuition behind bias and acceleration — so you and I will have to remain in the dark, pending a better reference.

The Code

Often, programmers learn a system best by tinkering and seeing what happens. In that spirit, here is a little code that takes a sample and prints out the bootstrapped mean confidence intervals for it (as Criterion would derive them):

Like this:

Related

Nice writeup, Neil. By the way, the bootstrap is not infallible: it’s sensitive to an autocorrelated sample. I haven’t written code to spot autocorrelated timing measurements and set off the alarm bells yet, although it’s not hard. (It doesn’t appear to be a common problem in practice.)

An Introduction to the Bootstrap, by Efron and Tibshirani. Google books have it (http://is.gd/aWRUe) if you want to browse. It’s quite comprehensive and fairly readable for a statistics book, but I’m not sure I’d go so far as to recommend a purchase.

You have some underlying process that generates a distribution for an associated random variable. In this case I’m guessing the process would be the program and the associated random variable T would be the runtime. The distribution F of T is then the distribution of the runtimes.

Given the distribution F, there are various functionals, that is, functions from the space of distributions to the real numbers, you are likely interested in. Classic examples would be the excepted value, the variance, the quantiles, etc. Whatever functional your are interested in, lets call it t(F).

Now, you can’t calculate t(F) because you don’t have F. But you you can generate an approximation t~(T_) based on independent identically distributed samples of the random variable T_ = T1, T2, …, TN. While t(F) is an exact quantity, the t~(T_) is not. Classic examples would be the the average approximating of the expectation, the sample variance approximating the variance, etc.

The important thing is that t~(T_) is a function of the independent identically distributed samples T_. A new set will give it a new value. This makes it a random variable, so it makes sense to talk about things like its bias: its expected difference from the true value E[t~(T_)] – t(T). Of course we can’t calculate this either, as it requires knowing t(T), and, if we did, we wouldn’t be using t~(T_).

This is where to bootstrap idea comes in. T1, T2, …, TN form an empirical distribution F~ that we know exactly. We can, therefore, exactly calculate our functional t(F~) for it. We can also draw independent-identically distributed samples T_~ = T1~, T2~, …, TN~ from F~ and apply our approximation to get an estimate t~(T_~) of t(F~). Now we have two values we can compare to give us an idea of how well our approximation actually works.

Indeed, if we are willing to assume the bias E[t~(T_~)] – t(F~) observed in our approximation applied to the empirical distribution F~ is not that much different from whatever the bias E[t~(T_)] – t(T) is in applying the approximation to the actually distribution F, we have an estimated for the later.

This is the bootstrap bias correction.

What assumption are we really making? Just that the difference between the our functional and our approximation isn’t that sensitive to the actual distribution. This makes it valid to approximate it using the incorrect, but close (assuming enough samples were taken), empirical distribution.

It is not entirely unlike assuming that the derivative (the bias) of a function at some point a (the exact distribution) will be relatively close to that at some other point b (the approximate distribution) given that a and b are relatively close.

Hope that clarifies things a bit. Always good to have a bit of a feel for the underlying assumptions so as to avoid a Wall Street. : )