How to Describe Performance Comparision

I’m writing today about a pet peeve of mine: amateur benchmarking that reports its results badly. Let’s say I’m running benchmarks to compare three of my favorite programming languages: Unlambda, Whitespace, and Plankalkül. So I run some benchmarks, and come up with this table of results.

Benchmark

Unlambda

Whitespace

Plankalkül

Church Numeral Factorial

342 days

167 days

350 days

Tetration

270 days

410 days

450 days

Busy Beaver

1,344 days

996 days

700 days

Iterated Ackermann

20 days

43 days

50 days

Based on this information, I shall now make three arguments:

This data proves that Unlambda is way faster than Whitespace. Whitespace takes 0.49, 1.52, 0.74, and 2.15 times as long, respectively, for these problems. So on average, whitespace takes 1.22 times as long (22% longer) to run programs that Unlambda.

This data proves that Whitespace is way faster than Unlambda. Unlambda takes 2.05, 0.66, 1.35, and 0.47 times as long, respectively, for these problems. So on average, unlambda takes 1.13 times as long (13% longer) to run programs than Whitespace.

This data proves that Plankalkül is faster than either one! In total, running all these programs took 1,976 days in Unlambda, 1,616 days in Whitespace, and 1,550 days in Plankalkül.

So which argument is correct? None of them, of course. The last argument, regarding Plankalkül, is pretty obviously wrong. Had I chosen different sizes for the problem sets, I could have made any of these results change by any constant factor… and in doing so, I could make any one of these languages perform best (by increasing the problem set of the problem it wins at). So clearly, I need to compare ratios of time, not totals.

The first two don’t fare much better, though. The problem here is that taking a ratio is an inherently biased operation. I pick one result to serve as the base, and the other to scale according to that base. If I then take an arithmetic mean (what most people mean by “average”) the answer can depend on which benchmark was chosen as the base! There are two ways to fix this.

First, I can take a geometric mean. In other words, multiply all the numbers together, and then take the nth root of the result. This is good enough to make the results come out fairly. Unfortunately, the raw ratios are still misleading to human readers who aren’t paying really close attention. A language taking 10 times as long seems like a big deal! But a language taking 0.1 times as long doesn’t catch most people’s eyes as much. Or you might see it the other way around. In any case, there’s clearly an asymmetry there that is likely to give readers false ideas about the data.

The other method, which I prefer, is to take the log of the ratios. You can then take the ordinary arithmetic mean (mathematically, the arithmetic mean of logs is equal to the log of the geometric mean), and there isn’t any kind of asymmetry between results in each direction. Positive numbers mean one language did better, and negative numbers the other; and the magnitude fairly represents by how much.

You’re better off either way than by reporting in ways (like the original three) that are just plain wrong.