Monday, February 12, 2018

TL;DR: Lazy deserialization was recently enabled by default in V8 version 6.4, reducing V8’s memory consumption by over 500 KB per browser tab on average. Read on to find out more!

Introducing V8 snapshots

But first, let’s take a step back and have a look at how V8 uses heap snapshots to speed up creation of new Isolates (which roughly correspond to a browser tab in Chrome). My colleague Yang Guo gave a good introduction on that front in his article on custom startup snapshots:

The JavaScript specification includes a lot of built-in functionality, from math functions to a full-featured regular expression engine. Every newly-created V8 context has these functions available from the start. For this to work, the global object (for example, the window object in a browser) and all the built-in functionality must be set up and initialized into V8’s heap at the time the context is created. It takes quite some time to do this from scratch.

Fortunately, V8 uses a shortcut to speed things up: just like thawing a frozen pizza for a quick dinner, we deserialize a previously-prepared snapshot directly into the heap to get an initialized context. On a regular desktop computer, this can bring the time to create a context from 40 ms down to less than 2 ms. On an average mobile phone, this could mean a difference between 270 ms and 10 ms.

To recap: snapshots are critical for startup performance, and they are deserialized to create the initial state of V8’s heap for each Isolate. The size of the snapshot thus determines the minimum size of the V8 heap, and larger snapshots translate directly into higher memory consumption for each Isolate.

A snapshot contains everything needed to fully initialize a new Isolate, including language constants (e.g., the undefined value), internal bytecode handlers used by the interpreter, built-in objects (e.g., String), and the functions installed on built-in objects (e.g., String.prototype.replace) together with their executable Code objects.

Over the past two years, the snapshot has nearly tripled in size, going from roughly 600 KB in early 2016 to over 1500 KB today. The vast majority of this increase comes from serialized Code objects, which have both increased in count (e.g., through recent additions to the JavaScript language as the language specification evolves and grows); and in size (built-ins generated by the new CodeStubAssembler pipeline ship as native code vs. the more compact bytecode or minimized JS formats).

This is bad news, since we’d like to keep memory consumption as low as possible.

Lazy deserialization

One of the major pain points was that we used to copy the entire content of the snapshot into each Isolate. Doing so was especially wasteful for built-in functions, which were all loaded unconditionally but may never have ended up being used.

This is where lazy deserialization comes in. The concept is quite simple: what if we were to only deserialize built-in functions just before they were called?

A quick investigation of some of the most popular websites showed this approach to be quite attractive: on average, only 30% of all built-in functions were used, with some sites only using 16%. This looked remarkably promising, given that most of these sites are heavy JS users and these numbers can thus be seen as a (fuzzy) lower bound of potential memory savings for the web in general.

As we began working on this direction, it turned out that lazy deserialization integrated very well with V8’s architecture and there were only a few, mostly non-invasive design changes necessary to get up and running:

Well-known positions within the snapshot. Prior to lazy deserialization, the order of objects within the serialized snapshot was irrelevant since we’d only ever deserialize the entire heap at once. Lazy deserialization must be able to deserialize any given built-in function on its own, and therefore has to know where it is located within the snapshot.

Deserialization of single objects. V8’s snapshots were initially designed for full heap deserialization, and bolting on support for single-object deserialization required dealing with a few quirks such as non-contiguous snapshot layout (serialized data for one object could be interspersed with data for other objects) and so-called backreferences (which can directly reference objects previously deserialized within the current run).

The lazy deserialization mechanism itself. At runtime, the lazy deserialization handler must be able to a) determine which code object to deserialize, b) perform the actual deserialization, and c) attach the serialized code object to all relevant functions.

Our solution to the first two points was to add a new dedicated built-ins area to the snapshot, which may only contain serialized code objects. Serialization occurs in a well-defined order and the starting offset of each Code object is kept in a dedicated section within the built-ins snapshot area. Both back-references and interspersed object data are disallowed.

Lazy built-in deserialization is handled by the aptly named DeserializeLazy built-in, which is installed on all lazy built-in functions at deserialization time. When called at runtime, it deserializes the relevant Code object and finally installs it on both the JSFunction (representing the function object) and the SharedFunctionInfo (shared between functions created from the same function literal). Each built-in function is deserialized at most once.

In addition to built-in functions, we have also implemented lazy deserialization for bytecode handlers. Bytecode handlers are code objects that contain the logic to execute each bytecode within V8’s Ignition interpreter. Unlike built-ins, they neither have an attached JSFunction nor a SharedFunctionInfo. Instead, their code objects are stored directly in the dispatch table into which the interpreter indexes when dispatching to the next bytecode handler. Lazy deserialization is similar as to built-ins: the DeserializeLazy handler determines which handler to deserialize by inspecting the bytecode array, deserializes the code object, and finally stores the deserialized handler in the dispatch table. Again, each handler is deserialized at most once.

Results

We evaluated memory savings by loading the top 1000 most popular websites using Chrome 65 on an Android device, with and without lazy deserialization.

On average, V8’s heap size decreased by 540 KB, with 25% of the tested sites saving more than 620 KB, 50% saving more than 540 KB, and 75% saving more than 420 KB.

Runtime performance (measured on standard JS benchmarks such as Speedometer, as well as a wide selection of popular websites) has remained unaffected by lazy deserialization.

Next steps

Lazy deserialization ensures that each Isolate only loads the built-in code objects that are actually used. That is already a big win, but we believe it is possible to go one step further and reduce the (built-in-related) cost of each Isolate to effectively zero.

We hope to bring you updates on this front later this year. Stay tuned!

Thursday, February 1, 2018

Every six weeks, we create a new branch of V8 as part of our release process. Each version is branched from V8’s Git master immediately before a Chrome Beta milestone. Today we’re pleased to announce our newest branch, V8 version 6.5, which is in beta until its release in coordination with Chrome 65 Stable in several weeks. V8 v6.5 is filled with all sorts of developer-facing goodies. This post provides a preview of some of the highlights in anticipation of the release.

Untrusted code mode

In response to the latest speculative side-channel attack called Spectre, V8 introduced an untrusted code mode. If you embed V8, consider leveraging this mode in case your application processes user-generated, not-trustworthy code. Please note that the mode is enabled by default, including in Chrome.

Streaming compilation for WebAssembly code

The WebAssembly API provides a special function to support streaming compilation in combination with the fetch() API:

constmodule = await WebAssembly.compileStreaming(fetch('foo.wasm'));

This API has been available since V8 v6.1 and Chrome 61, although the initial implementation didn’t actually use streaming compilation. However, with V8 v6.5 and Chrome 65 we take advantage of this API and compile WebAssembly modules already while we are still downloading the module bytes. As soon as we download all bytes of a single function, we pass the function to a background thread to compile it.

Our measurements show that with this API, the WebAssembly compilation in Chrome 65 can keep up with up to 50 Mbit/sec download speed on high-end machines. This means that if you download WebAssembly code with 50 Mbit/sec, compilation of that code finishes as soon as the download finishes.

For the graph below we measure the time it takes to download and compile a WebAssembly module with 67 MB and about 190,000 functions. We do the measurements with 25 Mbit/sec, 50 Mbit/sec, and 100 Mbit/sec download speed.

When the download time is longer than the compile time of the WebAssembly module, e.g. in the graph above with 25 Mbit/sec and 50 Mbit/sec, then WebAssembly.compileStreaming() finishes compilation almost immediately after the last bytes are downloaded.

When the download time is shorter than the compile time, then WebAssembly.compileStreaming() takes about as long as it takes to compile the WebAssembly module without downloading the module first.

Speed

We continued to work on widening the fast-path of JavaScript builtins in general, adding a mechanism to detect and prevent a ruinous situation called a “deoptimization loop.” This occurs when your optimized code deoptimizes, and there is no way to learn what went wrong. In such scenarios, TurboFan just keeps trying to optimize, finally giving up after about 30 attempts. This would happen if you did something to alter the shape of the array in the callback function of any of our second order array builtins. For example, changing the length of the array — in v6.5, we note when that happens, and stop inlining the array builtin called at that site on future optimization attempts.

We also widened the fast-path by inlining many builtins that were formerly excluded because of a side-effect between the load of the function to call and the call itself, for example a function call. And String.prototype.indexOf got a 10× performance improvement in function calls.

In V8 v6.4, we’d inlined support for Array.prototype.forEach, Array.prototype.map, and Array.prototype.filter. In V8 v6.5 we’ve added inlining support for:

Array.prototype.reduce

Array.prototype.reduceRight

Array.prototype.find

Array.prototype.findIndex

Array.prototype.some

Array.prototype.every

Furthermore, we’ve widened the fast path on all these builtins. At first we would bail out on seeing arrays with floating-point numbers, or (even more bailing out) if the arrays had “holes” in them, e.g. [3, 4.5, , 6]. Now, we handle holey floating-point arrays everywhere except in find and findIndex, where the spec requirement to convert holes into undefined throws a monkey-wrench into our efforts (for now…!).

The following image shows the improvement delta compared to V8 v6.4 in our inlined builtins, broken down into integer arrays, double arrays, and double arrays with holes. Time is in milliseconds.

V8 API

Please use git log branch-heads/6.4..branch-heads/6.5 include/v8.h to get a list of the API changes.