Wednesday, April 12, 2017

Retiring Octane

The genesis of Octane

The history of JavaScript benchmarks is a story of constant evolution. As the web expanded from simple documents to dynamic client-side applications, new JavaScript benchmarks were created to measure workloads that became important for new use cases. This constant change has given individual benchmarks finite lifespans. As web browser and virtual machine (VM) implementations begin to over-optimize for specific test cases, benchmarks themselves cease to become effective proxies for their original use cases. One of the first JavaScript benchmarks, SunSpider, provided early incentives for shipping fast optimizing compilers. However, as VM engineers uncovered the limitations of microbenchmarks and found new ways to optimizearound SunSpider’s limitations, the browser community retired SunSpider as a recommended benchmark.

Designed to mitigate some of the weaknesses of early microbenchmarks, the Octane benchmark suite was first released in 2012. It evolved from an earlier set of simple V8 test cases and became a common benchmark for general web performance. Octane consists of 17 different tests, which were designed to cover a variety of different workloads, ranging from Martin Richards’ kernel simulation test to a version of Microsoft’s TypeScript compiler compiling itself. The contents of Octane represented the prevailing wisdom around measuring JavaScript performance at the time of its creation.

Diminishing returns and over-optimization

In the first few years after its release, Octane provided a unique value to the JavaScript VM ecosystem. It allowed engines, including V8, to optimize their performance for a class of applications that stressed peak performance. These CPU-intensive workloads were initially underserviced by VM implementations. Octane helped engine developers deliver optimizations that allowed computationally-heavy applications to reach speeds that made JavaScript a viable alternative to C++ or Java. In addition, Octane drove improvements in garbage collection which helped web browsers avoid long or unpredictable pauses.

By 2015, however, most JavaScript implementations had implemented the compiler optimizations needed to achieve high scores on Octane. Striving for even higher benchmark scores on Octane translated into increasingly-marginal improvements in the performance of real web pages. Investigations into the execution profile of running Octane versus loading common websites (such as Facebook, Twitter, or Wikipedia) revealed that the benchmark doesn’t exercise V8’s parser or the browser loading stack the way real-world code does. Moreover, the style of Octane’s JavaScript doesn’t match the idioms and patterns employed by most modern frameworks and libraries (not to mention transpiled code or newer ES2015+ language features). This means that using Octane to measure V8 performance didn’t capture important use cases for the modern web, such as loading frameworks quickly, supporting large applications with new patterns of state management, or ensuring that ES2015+ features are as fast as their ES5 equivalents.

In addition, we began to notice that JavaScript optimizations which eked out higher Octane scores often had a detrimental effect on real-world scenarios. Octane encourages aggressive inlining to minimize the overhead of function calls, but inlining strategies that are tailored to Octane have led to regressions from increased compilation costs and higher memory usage in real-world use cases. Even when an optimization may be genuinely useful in the real-world, as is the case with dynamic pretenuring, chasing higher Octane scores can result in developing overly-specific heuristics which have little effect or even degrade performance in more generic cases. We found that Octane-derived pretenuring heuristics led to performance degradations in modern frameworks such as Ember. The `instanceof` operator was another example of an optimization tailored to a narrow set of Octane-specific cases that led to significant regressions in Node.js applications.

Another problem is that over time, small bugs in Octane become a target for optimizations themselves. For example, in the Box2DWeb benchmark, taking advantage of a bug where two objects were compared using the `<` and `>=` operators gave a ~15% performance boost on Octane. Unfortunately, this optimization had no effect in the real world and complicates more general types of comparison optimizations. Octane sometimes even negatively penalizes real-world optimizations: engineers working on other VMs have noticed that Octane seems to penalize lazy parsing, a technique that helps most real websites load faster given the amount of dead code frequently found in the wild.

Beyond Octane and other synthetic benchmarks

These examples are just some of the many optimizations which increased Octane scores to the detriment of running real websites. Unfortunately, similar issues exist in other static or synthetic benchmarks, including Kraken and JetStream. Simply put, such benchmarks are insufficient methods of measuring real-world speed and create incentives for VM engineers to over-optimize narrow use cases and under-optimize generic cases, slowing down JavaScript code in the wild.

Given the plateau in scores across most JS VMs and the increasing conflict between optimizing for specific Octane benchmarks rather than implementing speedups for a broader range of real-world code, we believe that it is time to retire Octane as a recommended benchmark.

Octane enabled the JS ecosystem to make large gains in computationally-expensive JavaScript. The next frontier, however, is improving the performance of real web pages, modern libraries, frameworks, ES2015+ language features, new patterns of state management, immutable object allocation, and modulebundling. Since V8 runs in many environments, including server side in Node.js, we are also investing time in understanding real-world Node applications and measuring server-side JavaScript performance through workloads such as AcmeAir.

4 comments:

Hello,Please keep the benchmark going, here is why. We work for a company with 200+ employees, We use the benchmark to test different computers and see how their hardware compares. Latest Firefox/Chrome score is then used to triggers "that computer needs to be replaced". We found that any PC with score <10,000 needs replacement now. Score <14000 is to be replaced soon. This is the only benchmark that is consistent where you can quickly determine if hardware needs updating.(I wish I could run it across the whole domain, and get a dashboard indicating which computers are under-performing...but removing it is not what we had in mind) We would love for you to continue to offer this score. Please reconsider.Thank you!

Octane/Chromium javascript engine devs : Octane only tests some specialized parts of browsers' code and has little link with real life browsing. So we're (finally) deprecating its use for anything relative to measuring effectiveness (aka real perfs).

IT manager: Please keep it up running! We use it to benchmark the real life capabilities of the computers in our company.