Improving DataView performance in V8

DataViews are one of the two possible ways to do low-level memory accesses in JavaScript, the other one being TypedArrays. Up until now, DataViews were much less optimized than TypedArrays in V8, resulting in lower performance on tasks such as graphics-intensive workloads or when decoding/encoding binary data. The reasons for this have been mostly historical choices, like the fact that asm.js chose TypedArrays instead of DataViews, and so engines were incentivized to focus on performance of TypedArrays.

Because of the performance penalty, JavaScript developers such as the Google Maps team decided to avoid DataViews and rely on TypedArrays instead, at the cost of increased code complexity. This article explains how we brought DataView performance to match — and even surpass — equivalent TypedArray code in V8 v6.9, effectively making DataView usable for performance-critical real-world applications.

Since the introduction of ES2015, JavaScript has supported reading and writing data in raw binary buffers called ArrayBuffers. ArrayBuffers cannot be directly accessed; rather, programs must use a so-called array buffer view object that can be either a DataView or a TypedArray.

TypedArrays allow programs to access the buffer as an array of uniformly typed values, such as an Int16Array or a Float32Array.

On the other hand, DataViews allow for more fine-grained data access. They let the programmer choose the type of values read from and written to the buffer by providing specialized getters and setters for each number type, making them useful for serializing data structures.

Until recently, the DataView methods used to be implemented as built-in C++ runtime functions in V8. This is very costly, because each call would require an expensive transition from JavaScript to C++ (and back).

In order to investigate the actual performance cost incurred by this implementation, we set up a performance benchmark that compares the native DataView getter implementation with a JavaScript wrapper simulating DataView behavior. This wrapper uses an Uint8Array to read data byte by byte from the underlying buffer, and then computes the return value from those bytes. Here is, for example, the function for reading little-endian 32-bit unsigned integer values:

Our first step in improving the performance of DataView objects was to move the implementation from the C++ runtime to CodeStubAssembler (also known as CSA). CSA is a portable assembly language that allows us to write code directly in TurboFan’s machine-level intermediate representation (IR), and we use it to implement optimized parts of V8’s JavaScript standard library. Rewriting code in CSA bypasses the call to C++ completely, and also generates efficient machine code by leveraging TurboFan’s backend.

However, writing CSA code by hand is cumbersome. Control flow in CSA is expressed much like in assembly, using explicit labels and gotos, which makes the code harder to read and understand at a glance.

In order to make it easier for developers to contribute to the optimized JavaScript standard library in V8, and to improve readability and maintainability, we started designing a new language called V8 Torque, that compiles down to CSA. The goal for Torque is to abstract away the low-level details that make CSA code harder to write and maintain, while retaining the same performance profile.

Rewriting the DataView code was an excellent opportunity to start using Torque for new code, and helped provide the Torque developers with a lot of feedback about the language. This is what the DataView’s getUint32() method looks like, written in Torque:

When JavaScript code gets hot, we compile it using our TurboFan optimizing compiler, in order to generate highly-optimized machine code that runs more efficiently than interpreted bytecode.

TurboFan works by translating the incoming JavaScript code into an internal graph representation (more precisely, a “sea of nodes”). It starts with high-level nodes that match the JavaScript operations and semantics, and gradually refines them into lower and lower level nodes, until it finally generates machine code.

In particular, a function call, such as calling one of the DataView methods, is internally represented as a JSCall node, which eventually boils down to an actual function call in the generated machine code.

However, TurboFan allows us to check whether the JSCall node is actually a call to a known function, for example one of the builtin functions, and inline this node in the IR. This means that the complicated JSCall gets replaced at compile-time by a subgraph that represents the function. This allows TurboFan to optimize the inside of the function in subsequent passes as part of a broader context, instead of on its own, and most importantly to get rid of the costly function call.

Initial TurboFan DataView performance

Implementing TurboFan inlining finally allowed us to match, and even exceed, the performance of our Uint8Array wrapper, and be 8 times as fast as the former C++ implementation.

Looking at the machine code generated by TurboFan after inlining the DataView methods, there was still room for some improvement. The first implementation of those methods tried to follow the standard pretty closely, and threw errors when the spec indicates so (for example, when trying to read or write out of the bounds of the underlying ArrayBuffer).

However, the code that we write in TurboFan is meant to be optimized to be as fast as possible for the common, hot cases — it doesn’t need to support every possible edge case. By removing all the intricate handling of those errors, and just deoptimizing back to the baseline Torque implementation when we need to throw, we were able to reduce the size of the generated code by around 35%, generating a quite noticeable speedup, as well as considerably simpler TurboFan code.

Following up on this idea of being as specialized as possible in TurboFan, we also removed support for indices or offsets that are too large (outside of Smi range) inside the TurboFan-optimized code. This allowed us to get rid of handling of the float64 arithmetic that is needed for offsets that do not fit into a 32-bit value, and to avoid storing large integers on the heap.

Compared to the initial TurboFan implementation, this more than doubled the DataView benchmark score. DataViews are now up to 3 times as fast as the Uint8Array wrapper, and around 16 times as fast as our original DataView implementation!

We’ve evaluated the performance impact of the new implementation on some real-world examples, on top of our own benchmark.

DataViews are often used when decoding data encoded in binary formats from JavaScript. One such binary format is FBX, a format that is used for exchanging 3D animations. We’ve instrumented the FBX loader of the popular three.js JavaScript 3D library, and measured a 10% (around 80 ms) reduction in its execution time.

We compared the overall performance of DataViews against TypedArrays. We found that our new DataView implementation provides almost the same performance as TypedArrays when accessing data aligned in the native endianness (little-endian on Intel processors), bridging much of the performance gap and making DataViews a practical choice in V8.