Lies, damned lies, and benchmarks: is IE9 cheating at SunSpider?

The SunSpider JavaScript performance benchmark, devised by the developers of the WebKit browser engine, is used and quoted widely as a measure of browser scripting performance. A surprising result was recently noticed by a Mozilla developer, Rob Sayre, looking at Internet Explorer 9's performance in this test. On one of the many subtests it performs, Internet Explorer 9 was finishing the subtest almost instantly.

In and of itself, that's not necessarily very interesting; several of the subtests in SunSpider are near-instant in the browser. However, it piqued the developer's curiosity. He made some minorchanges to the test—changes that don't alter the result of the calculation the test performs and that, naively at least, should be treated as equivalent—and saw Internet Explorer 9 slow down considerably. He filed a bug against Internet Explorer.

Sayre's bug report was conservative—he suggested that an optimization that Internet Explorer 9's Chakra JavaScript engine was performing was fragile, and was easily disabled by minor alterations to the code that it should disregard. Coverage earlier today of the same issue was less guarded: Internet Explorer 9 was accused of cheating in the test. The allegation is that Microsoft has built a specific optimization into Chakra that detects, and bypasses, the specific code in SunSpider, but which has no other purpose. In other words, the optimization will not do anything to improve the browser's performance in any other scenario.

Historical precedent

Such a move would not be completely unprecedented. The SPEC organization produces benchmarks for processors, mail servers, JVMs, and a range of other tasks. Its CPU benchmarks are commonly used for evaluating processor performance across a wide range of mostly real-world tasks; it is, or at least was, an important suite of tests. An old version of the benchmark, SPEC2000, included a test of floating point performance called 179.art.

179.art was a program to test image recognition using a neural network. The benchmark code is representative of real-world code; it has not been engineered for maximum performance, and in fact does a number of things that hurts its performance in various ways. A programmer wanting to make 179.art go faster would have a range of reasonably simple changes he could make to yield a healthy performance improvement. But the SPEC tests do not allow programmers to make changes; only optimizations performed by the compiler are permitted.

So what happened was that compiler vendors modified their compilers to specifically detect that they were compiling 179.art, and applied these specific changes. Sun was probably the first to do so, but eventually such optimizations became widespread, with performance gains of 30 times or more common. The optimizations were in no sense general-purpose: they accelerated 179.art, but would not increase the performance of any other piece of code on the planet. Sometimes, they were not even safe: minor code variations that should have changed the result of the calculations would still be subjected to the 179.art-specific optimizations, resulting in broken programs.

Against this backdrop, the suspicion of Internet Explorer is, therefore, at least somewhat understandable. High-profile benchmarks carry with them a lot of bragging rights, and if Microsoft were to tweak Chakra to ensure it got good results, they certainly wouldn't be the first.

Dead code elimination

But that's probably not what's happened here.

The exact optimization in question here is one of a class called dead code elimination. A surprisingly common feature of many programs is that they contain pieces of code that are pointless—dead code. There are two main kinds of dead code. Sometimes, there is no possible pathway through the program that can result in a particular piece of code being executed. The code is said to be "unreachable." One common scenario that leads to unreachable code is when a programmer wants to temporarily skip part of a program; they will do something such as prepend the code with "if(false)" to allow it to be bypassed. Sometimes unreachable code is a bug; a programmer writes a piece of code expecting it to be executed, without noticing that the program quits a few lines above.

The other major kind of dead code is more general; it's code that can be reached and executed, but whose results are never used. This kind of dead code is a common pitfall in benchmark programs. Because benchmarks don't do any useful work, they have a common tendency to perform some slow, expensive task (the one whose performance they are attempting to measure) and then simply ignore the result. After all, a benchmark is generally concerned only with how long something takes. Since this is pointless, compilers are quite entitled to remove the slow-but-ignored calculations. It makes the program run faster, and since the results were ignored anyway, it doesn't change the output of the program.

The subtest of SunSpider with the anomalous results is called "cordic." It tests a function that computes the sine and cosine of a number using a CORDIC algorithm. In many ways this is highly artificial: JavaScript contains built-in sine and cosine functionality, functionality that will be much faster than performing the computation in this way, so it is not something real programs would ever do. And true to many benchmarks, the test does not bother using the results that it has computed. This makes the entire test susceptible to dead code elimination.

And lo, this is what Internet Explorer 9 does. It—accurately—treats the entire test as dead code, and so removes the whole lot. This makes for a very fast benchmark result indeed.

So the optimization itself is legitimate and of a kind that is common to compilers. It's one of the best optimizations there is, in fact—the optimized code runs instantly, as it has been entirely removed, and you don't get much faster than instant.

But what about the fact that small and apparently irrelevant modifications prevented the optimization from kicking in? Compiler optimization is a tricky thing, and compiler authors tend to be very conservative about which optimizations they apply. A program that produces the wrong answer is far worse than a program that's a little bit slow, with a result that if a fragment of code does not follow exactly the pattern the compiler expects, it won't apply the optimization.

The compilation process is normally a multistage affair. The compiler reads the program source code, checking that it makes sense according to the grammatical rules of the language. It builds a kind of in-memory representation of the program, during which it typically ensures that the program "makes sense." This representation is then turned into actual executable code.

Optimizations can be performed both on the in-memory representation, and during the generation of executable code. Different kinds of optimization make sense at different stages. If the intermediate representation, or the final executable code, matches a pattern the optimizer is looking for, the optimization will be applied.

A fragile optimization

Small changes, even changes that should be innocuous, can alter the intermediate representations that the optimizer actually examines. The changes that Sayre made to the cordic test were small, and didn't fundamentally alter the structure of the test code. However, the impact that those changes may have had on the intermediate representations is hard to predict—we don't know exactly what Microsoft's compiler is doing, how it's representing the program internally, or what exact patterns it is looking for. It might well be that the pattern matching is just particularly fragile, and that small changes are throwing it off, and preventing the optimization.

Experimentation by readers of Hacker News paints a complex picture. Some modifications defeated the optimization, but others did not. Microsoft has described a few other code fragments that fit the pattern and trigger the optimization. Sayre has performed further analysis of the browser's behavior; functions that use a limited range of mathematical operations (including addition, subtraction, and incrementing) can be eliminated as dead code. Functions that use other mathematical operations, however—including multiplication and division—will not be.

It happens—whether by coincidence or by design—that the mathematical operations used by the cordic test are all on the "permitted" list, and as such can be eliminated. However, other functions that also use those same operations can also be eliminated. While it's possible, and, I think, likely, that this optimization was "inspired" by cordic, it has been written in such a way that it has applicability beyond that test. This really is an optimization that can apply to other functions, just as long as they meet certain criteria.

To my mind, that makes it a legitimate optimization. If it could only accelerate cordic, it would be illegitimate, as the optimization would have no "real-world" application. But examples have been constructed that also get optimized in the same way. It's not yet clear how much real-world code gets optimized like this, but it's certainly possible that some does, and that's good enough for me.

This is essentially the same standard that SPEC uses for determining the legitimacy of compiler optimizations: an optimization that can only apply to the test is forbidden. But an optimization that can apply to other programs too (even if only a limited number of other programs) is acceptable.