Tuesday, December 19, 2017

Every six weeks, we create a new branch of V8 as part of our release process. Each version is branched from V8’s Git master immediately before a Chrome Beta milestone. Today we’re pleased to announce our newest branch, V8 version 6.4, which is in beta until its release in coordination with Chrome 64 Stable in several weeks. V8 v6.4 is filled with all sorts of developer-facing goodies. This post provides a preview of some of the highlights in anticipation of the release.

Speed

This release also addresses some performance cliffs in Function.prototype.bind. For example, TurboFan now consistently inlines all monomorphic calls to bind. In addition, TurboFan also supports the bound callback pattern, meaning that instead of the following:

doSomething(callback, someObj);

You can now use:

doSomething(callback.bind(someObj));

This way, the code is more readable, and you still get the same performance.

As part of V8’s on-going effort to improve the performance of array built-ins, we improved Array.prototype.slice performance ~4× by reimplementing it using the CodeStubAssembler. Additionally, calls to Array.prototype.map and Array.prototype.filter are now inlined for many cases, giving them a performance profile competitive with hand-written versions.

Memory

V8’s built-in code objects and bytecode handlers are now deserialized lazily from the snapshot, which can significantly reduce memory consumed by each Isolate. Benchmarks in Chrome show savings of several hundred KB per tab when browsing common sites.

Look out for a dedicated blog post on this subject early next year.

ECMAScript language features

This V8 release includes support for two new exciting regular expression features.

Thanks to Groupon, V8 now implements import.meta, which enables embedders to expose host-specific metadata about the current module. For example, Chrome 64 exposes the module URL via import.meta.url, and Chrome plans to add more properties to import.meta in the future.

To assist with local-aware formatting of strings produced by internationalization formatters, developers can now use Intl.NumberFormat.prototype.formatToParts() to format a number to a list of tokens and their type. Thanks to Igalia for implementing this in V8!

V8 API

Please use git log branch-heads/6.3..branch-heads/6.4 include/v8.h to get a list of the API changes.

Wednesday, December 13, 2017

What is it?

Code coverage provides information about whether, and optionally how often certain parts of an application have been executed. It’s commonly used to determine how thoroughly a test suite exercises a particular codebase.

Why is it useful?

As a JavaScript developer, you may often find yourself in a situation in which code coverage could be useful. For instance:

Interested in the quality of your test suite? Refactoring a large legacy project? Code coverage can show you exactly which parts of your codebase is covered.

Want to quickly know if a particular part of the codebase is reached? Instead of instrumenting with console.log for printf-style debugging or manually stepping through the code, code coverage can display live information about which parts of your applications have been executed.

Or maybe you’re optimizing for speed and would like to know which spots to focus on? Execution counts can point out hot functions and loops.

JavaScript code coverage in V8

Earlier this year, we added native support for JavaScript code coverage to V8. The initial release in version 5.9 provided coverage at function granularity (showing which functions have been executed), which was later extended to support coverage at block granularity in 6.2 (likewise, but for individual expressions).

Precise coverage ensures that no data is lost to the GC, and users can choose to receive execution counts instead of binary coverage information; but performance might be impacted by increased overhead (see the next section for more details). Precise coverage can be collected either at function or block granularity.

Behind the scenes

As stated in the previous section, V8 supports two main modes of code coverage: best-effort and precise coverage. Read on for an overview of their implementation.

Best-effort coverage

Both best-effort and precise coverage modes heavily reuse other V8 mechanisms, the first of which is called the invocation counter. Each time a function is called through V8’s Ignition interpreter, we increment an invocation counter on the function’s feedback vector. As the function later becomes hot and tiers up through the optimizing compiler, this counter is used to help guide inlining decisions about which functions to inline; and now, we also rely on it to report code coverage.

The second reused mechanism determines the source range of functions. When reporting code coverage, invocation counts need to be tied to an associated range within the source file. For example, in the example below, we not only need to report that function f has been executed exactly once, but also that f’s source range begins at line 1 and ends in line 3.

functionf() {
console.log('Hello World');
}
f();

Again we got lucky and were able to reuse existing information within V8. Functions already knew their start- and end positions within source code due to Function.prototype.toString, which needs to know the function’s location within the source file to extract the appropriate substring.

When collecting best-effort coverage, these two mechanisms are simply tied together: first we find all live function by traversing the entire heap. For each seen function we report the invocation count (stored on the feedback vector, which we can reach from the function) and source range (conveniently stored on the function itself).

Note that since invocation counts are maintained regardless of whether coverage is enabled, best-effort coverage does not introduce any runtime overhead. It also does not use dedicated data structures and thus neither needs to be explicitly enabled or disabled.

So why is this mode called best-effort, what are its limitations? Functions that go out of scope may be freed by the garbage collector. This means that associated invocation counts are lost, and in fact we completely forget that these functions ever existed. Ergo ‘best-effort’: even though we try our best, the collected coverage information may be incomplete.

Precise coverage (function granularity)

In contrast to the best-effort mode, precise coverage guarantees that the provided coverage information is complete. To achieve this, we add all feedback vectors to V8’s root set of references once precise coverage is enabled, preventing their collection by the GC. While this ensures no information is lost, it increases memory consumption by keeping objects alive artificially.

The precise coverage mode can also provide execution counts. This adds another wrinkle to the precise coverage implementation. Recall that the invocation counter is incremented each time a function is called through V8’s interpreter, and that functions can tier up and be optimized once they become hot. But optimized functions no longer increment their invocation counter, and thus the optimizing compiler must be disabled for their reported execution count to remain accurate.

Precise coverage (block granularity)

Block-granularity coverage must report coverage that is correct down to the level of individual expressions. For example, in the following piece of code, block coverage could detect that the else branch of the conditional expression : c is never executed, while function granularity coverage would only know that the function f (in its entirety) is covered.

functionf(a) {
return a ? b : c;
}
f(true);

You may recall from the previous sections that we already had function invocation counts and source ranges readily available within V8. Unfortunately, this was not the case for block coverage and we had to implement new mechanisms to collect both execution counts and their corresponding source ranges.

The first aspect is source ranges: assuming we have an execution count for a particular block, how can we map them to a section of the source code? For this, we need to collect relevant positions while parsing the source files. Prior to block coverage, V8 already did this to some extent. One example is the collection of function ranges due to Function.prototype.toString as described above. Another is that source positions are used to construct the backtrace for Error objects. But neither of these is sufficient to support block coverage; the former is only available for functions, while the latter only stores positions (e.g. the position of the if token for if-else statements), not source ranges.

We therefore had to extend the parser to collect source ranges. To demonstrate, consider an if-else statement:

if (cond) {
/* Then branch. */
} else {
/* Else branch. */
}

When block coverage is enabled, we collect the source range of the then and else branches and associate them with the parsed IfStatement AST node. The same is done for other relevant language constructs.

After collecting source range collection during parsing, the second aspect is tracking execution counts at runtime. This is done by inserting a new dedicated IncBlockCounter bytecode at strategic positions within the generated bytecode array. At runtime, the IncBlockCounter bytecode handler simply increments the appropriate counter (reachable through the function object).

In the above example of an if-else statement, such bytecodes would be inserted at three spots: immediately prior to the body of the then branch, prior to the body of the else branch, and immediately after the if-else statement (such continuation counters are needed due to possibility of non-local control within a branch).

Finally, reporting block-granularity coverage works similarly to function-granularity reporting. But in addition to invocations counts (from the feedback vector), we now also report the collection of interesting source ranges together with their block counts (stored on an auxiliary data structure that hangs off the function).

If you’d like to learn more about the technical details behind code coverage in V8, see the coverage and block coverage design documents.

Conclusion

We hope you’ve enjoyed this brief introduction to V8’s native code coverage support. Please give it a try and don’t hesitate to let us know what works for you, and what doesn’t. Say hello on Twitter (@schuay and @hashseed) or file a bug at crbug.com/v8/new.

Coverage support in V8 has been a team effort, and thanks are in order to everyone that has contributed: Benjamin Coe, Jakob Gruber, Yang Guo, Marja Hölttä, Andrey Kosyakov, Alexey Kozyatinksiy, Ross McIlroy, Ali Sheikh, Michael Starzinger. Thank you!

Wednesday, November 29, 2017

JavaScript objects in V8 are allocated on a heap managed by V8’s garbage collector. In previous blog posts we have already talked about how we reduce garbage collection pause times (more than once) and memory consumption. In this blog post we introduce the parallel Scavenger, one of the latest features of Orinoco, V8’s mostly concurrent and parallel garbage collector and discuss design decisions and alternative approaches we implemented on the way.

V8 partitions its managed heap into generations where objects are initially allocated in the “nursery” of the young generation. Upon surviving a garbage collection, objects are copied into the intermediate generation, which is still part of the young generation. After surviving another garbage collection, these objects are moved into the old generation (see Figure 1). V8 implements two garbage collectors: one that frequently collects the young generation, and one that collects the full heap including both the young and old generation. Old-to-young generation references are roots for the young generation garbage collection. These references are recorded to provide efficient root identification and reference updates when objects are moved.

Figure 1: Generational garbage collection

Since the young generation is relatively small (up to 16MiB in V8) it fills up quickly with objects and requires frequent collections. Until M62, V8 used a Cheney semispace copying garbage collector (see below) that divides the young generation into two halves. During JavaScript execution only one half of the young generation is available for allocating objects, while the other half remains empty. During a young garbage collection, live objects are copied from one half to the other half, compacting the memory on the fly. Live objects that have already been copied once are considered part of the intermediate generation and are promoted to the old generation.

Starting with M62, V8 switched the default algorithm for collecting the young generation to a parallel Scavenger, similar to Halstead’s semispace copying collector with the difference that V8 makes use of dynamic instead of static work stealing across multiple threads. In the following we explain three algorithms: a) the single-threaded Cheney semispace copying collector, b) a parallel Mark-Evacuate scheme, and c) the parallel Scavenger.

Single-threaded Cheney’s Semispace Copy

Until M62, V8 used Cheney’s semispace copying algorithm which is well-suited for both single-core execution and a generational scheme. Before a young generation collection, both semispace halves of memory are committed and assigned proper labels: the pages containing the current set of objects are called from-space while the pages that objects are copied to are called to-space.

The Scavenger considers references in the call stack and references from the old to the young generation as roots. Figure 2 illustrates the algorithm where initially the Scavenger scans these roots and copies objects reachable in the from-space that have not yet been copied to the to-space. Objects that have already survived a garbage collection are promoted (moved) to the old generation. After root scanning and the first round of copying, the objects in the newly allocated to-space are scanned for references. Similarly, all promoted objects are scanned for new references to from-space. These three phases are interleaved on the main thread. The algorithm continues until no more new objects are reachable from either to-space or the old generation. At this point the from-space only contains unreachable objects, i.e., it only contains garbage.

Parallel Mark-Evacuate

We experimented with a parallel Mark-Evacuate algorithm based on the V8’s full Mark-Sweep-Compact collector. The main advantage is leveraging the already existing garbage collection infrastructure from the full Mark-Sweep-Compact collector. The algorithm consists of three phases: marking, copying, and updating pointers, as shown in Figure 3. To avoid sweeping pages in the young generation to maintain free lists, the young generation is still maintained using a semispace that is always kept compact by copying live objects into to-space during garbage collection. The young generation is initially marked in parallel. After marking, live objects are copied in parallel to their corresponding spaces. Work is distributed based on logical pages. Threads participating in copying keep their own local allocation buffers (LABs) which are merged upon finishing copying. After copying, the same parallelization scheme is applied for updating inter-object pointers. These three phases are performed in lockstep, i.e., while the phases themselves are performed in parallel, threads have to synchronize before continuing to the next phase.

Parallel Scavenge

The parallel Mark-Evacuate collector separates the phases of computing liveness, copying live objects, and updating pointers. An obvious optimization is to merge these phases, resulting in an algorithm that marks, copies, and updates pointers at the same time. By merging those phases we actually get the parallel Scavenger used by V8, which is a version similar to Halstead’s semispace collector with the difference that V8 uses dynamic work stealing and a simple load balancing mechanism for scanning the roots (see Figure 4). Like the single-threaded Cheney algorithm, the phases are: scanning for roots, copying within the young generation, promoting to the old generation, and updating pointers. We found that the majority of the root set is usually the references from the old generation to the young generation. In our implementation, remembered sets are maintained per-page, which naturally distributes the roots set among garbage collection threads. Objects are then processed in parallel. Newly found objects are added to a global work list from which garbage collection threads can steal. This work list provides fast task local storage as well as global storage for sharing work. A barrier makes sure that tasks do not prematurely terminate when the sub graph currently processed is not suitable for work stealing (e.g. a linear chain of objects). All phases are performed in parallel and interleaved on each task, maximizing the utilization of worker tasks.

Processing

Figure 4: Young generation parallel Scavenger in V8

Results and outcome

The Scavenger algorithm was initially designed having optimal single-core performance in mind. The world has changed since then. CPU cores are often plentiful, even on low-end mobile devices. More importantly, often these cores are actually up and running. To fully utilize these cores, one of the last sequential components of V8’s garbage collector, the Scavenger, had to be modernized.

The big advantage of a parallel Mark-Evacuate collector is that exact liveness information is available. This information can e.g. be used to avoid copying at all by just moving and relinking pages that contain mostly live objects which is also performed by the full Mark-Sweep-Compact collector. In practice, however, this was mostly observable on synthetic benchmarks and rarely showed up on real websites. The downside of the parallel Mark-Evacuate collector is the overhead of performing three separate lockstep phases. This overhead is especially noticeable when the garbage collector is invoked on a heap with mostly dead objects, which is the case on many real-world webpages. Note that invoking garbage collections on heaps with mostly dead objects is actually the ideal scenario, as garbage collection is usually bounded by the size of live objects.

The parallel Scavenger closes this performance gap by providing performance that is close to the optimized Cheney algorithm on small or almost empty heaps while still providing a high throughput in case the heaps get larger with lots of live objects.

V8 supports, among many other platforms, as Arm big.LITTLE. While offloading work on little cores benefits battery lifetime, it can lead to stalling on the main thread when work packages for little cores are too big. We observed that page-level parallelism does not necessarily load balance work on big.LITTLE for a young generation garbage collection due to the limited number of pages. The Scavenger naturally solves this issue by providing medium-grained synchronization using explicit work lists and work stealing.

Figure 5: Total young generation garbage collection time (in ms) across various websites

V8 now ships with the parallel Scavenger which reduces the main thread young generation garbage collection total time by about 20%–50% across a large set of benchmarks (details on our perf waterfalls). Figure 5 shows a comparison of the implementations across various real-world websites, showing improvements around 55% (2×). Similar improvements can be observed on maximum and average pause time while maintaining minimum pause time. The parallel Mark-Evacuate collector scheme has still potential for optimization. Stay tuned if you want to find out what happens next.

Thursday, November 16, 2017

In this post we’d like to introduce the CodeStubAssembler (CSA), a component in V8 that has been a very useful tool in achieving some bigperformancewins over the last several V8 releases. The CSA also significantly improved the V8 team’s ability to quickly optimize JavaScript features at a low-level with a high degree of reliability, which improved the team’s development velocity.

A brief history of builtins and hand-written assembly in V8

To understand the CSA’s role in V8, it’s important to understand a little bit of the context and history that led to its development.

V8 squeezes performance out of JavaScript using a combination of techniques. For JavaScript code that runs a long time, V8’s TurboFan optimizing compiler does a great job of speeding up the entire spectrum of ES2015+ functionality for peak performance. However, V8 also needs to execute short-running JavaScript efficiently for good baseline performance. This is especially the case for the so-called builtin functions on the pre-defined objects that are available to all JavaScript programs as defined by the ECMAScript specification.

Historically, many of these builtin functions were self-hosted, that is, they were authored by a V8 developer in JavaScript—albeit a special V8-internal dialect. To achieve good performance, these self-hosted builtins rely on the same mechanisms V8 uses to optimize user-supplied JavaScript. As with user-supplied code, the self-hosted builtins require a warm-up phase in which type feedback is gathered and they need to be compiled by the optimizing compiler.

Although this technique provides good builtin performance in some situations, it’s possible to do better. The exact semantics of the pre-defined functions on the Array.prototype are specified in exquisite detail in the spec. For important and common special cases, V8’s implementers know in advance exactly how these builtin functions should work by understanding the specification, and they use this knowledge to carefully craft custom, hand-tuned versions up front. These optimized builtins handle common cases without warm-up or the need to invoke the optimizing compiler, since by construction baseline performance is already optimal upon first invocation.

To squeeze the best performance out of hand-written built-in JavaScript functions (and from other fast-path V8 code that are also somewhat confusingly called builtins), V8 developers traditionally wrote optimized builtins in assembly language. By using assembly, the hand-written builtin functions were especially fast by, among other things, avoiding expensive calls to V8’s C++ code via trampolines and by taking advantage of V8’s custom register-based ABI that it uses internally to call JavaScript functions.

Because of the advantages of hand-written assembly, V8 accumulated literally tens of thousands of lines of hand-written assembly code for builtins over the years… per platform. All of these hand-written assembly builtins were great for improving performance, but new language features are always being standardized, and maintaining and extending this hand-written assembly was laborious and error-prone.

Enter the CodeStubAssembler

V8 developers wrestled with a dilemma for many years: is it possible to create builtins that have the advantage of hand-written assembly without also being fragile and difficult to maintain?

With the advent of TurboFan the answer to this question is finally “yes”. TurboFan’s backend uses a cross-platform intermediate representation (IR) for low-level machine operations. This low-level machine IR is input to an instruction selector, register allocator, instruction scheduler and code generator that produce very good code on all platforms. The backend also knows about many of the tricks that are used in V8’s hand-written assembly builtins—e.g. how to use and call a custom register-based ABI, how to support machine-level tail calls, and how to elide the construction of stack frames in leaf functions. That knowledge makes the TurboFan backend especially well-suited for generating fast code that integrates well with the rest of V8.

This combination of functionality made a robust and maintainable alternative to hand-written assembly builtins feasible for the first time. The team built a new V8 component—dubbed the CodeStubAssembler or CSA—that defines a portable assembly language built on top of TurboFan’s backend. The CSA adds an API to generate TurboFan machine-level IR directly without having to write and parse JavaScript or apply TurboFan’s JavaScript-specific optimizations. Although this fast-path to code generation is something that only V8 developers can use to speed up the V8 engine internally, this efficient path for generating optimized assembly code in a cross-platform way directly benefits all developers’ JavaScript code in the builtins constructed with the CSA, including the performance-critical bytecode handlers for V8’s interpreter, Ignition.

The CSA and JavaScript compilation pipelines

The CSA interface includes operations that are very low-level and familiar to anybody who has ever written assembly code. For example, it includes functionality like “load this object pointer from a given address” and “multiply these two 32-bit numbers”. The CSA has type verification at the IR level to catch many correctness bugs at compile time rather than runtime. For example, it can ensure that a V8 developer doesn’t accidentally use an object pointer that is loaded from memory as the input for a 32-bit multiplication. This kind of type verification is simply not possible with hand-written assembly stubs.

A CSA test-drive

To get a better idea of what the CSA offers, let’s go through a quick example. We’ll add a new internal builtin to V8 that returns the string length from an object if it is a String. If the input object is not a String, the builtin will return undefined.

First, we add a line to the BUILTIN_LIST_BASE macro in V8’s builtin-definitions.h file that declares the new builtin called GetStringLength and specifies that it has a single input parameter that is identified with the constant kInputObject:

TFS(GetStringLength, kInputObject)

The TFS macro declares the builtin as a TurboFan builtin using standard CodeStub linkage, which simply means that it uses the CSA to generate its code and expects parameters to be passed via registers.

TF_BUILTIN(GetStringLength, CodeStubAssembler) {
Label not_string(this);
// Fetch the incoming object using the constant we defined for
// the first parameter.
Node* const maybe_string = Parameter(Descriptor::kInputObject);
// Check to see if input is a Smi (a special representation
// of small numbers). This needs to be done before the IsString
// check below, since IsString assumes its argument is an
// object pointer and not a Smi. If the argument is indeed a
// Smi, jump to the label |not_string|.
GotoIf(TaggedIsSmi(maybe_string), &not_string);
// Check to see if the input object is a string. If not, jump to
// the label |not_string|.
GotoIfNot(IsString(maybe_string), &not_string);
// Load the length of the string (having ended up in this code
// path because we verified it was string above) and return it
// using a CSA "macro" LoadStringLength.
Return(LoadStringLength(maybe_string));
// Define the location of label that is the target of the failed
// IsString check above.
BIND(&not_string);
// Input object isn't a string. Return the JavaScript undefined
// constant.
Return(UndefinedConstant());
}

Note that in the example above, there are two types of instructions used. There are primitive CSA instructions that translate directly into one or two assembly instructions like GotoIf and Return. There are a fixed set of pre-defined CSA primitive instructions roughly corresponding to the most commonly used assembly instructions you would find on one of V8’s supported chip architectures. Others instructions in the example are macro instructions, like LoadStringLength, TaggedIsSmi, and IsString, that are convenience functions to output one or more primitive or macro instructions inline. Macro instructions are used to encapsulate commonly used V8 implementation idioms for easy reuse. They can be arbitrarily long and new macro instructions can be easily defined by V8 developers whenever needed.

After compiling V8 with the above changes, we can run mksnapshot, the tool that compiles builtins to prepare them for V8’s snapshot, with the --print-code command-line option. This options prints the generated assembly code for each builtin. If we grep for GetStringLength in the output, we get the following result on x64 (the code output is cleaned up a bit to make it more readable):

Even though our new builtin uses a non-standard (at least non-C++) calling convention, it’s possible to write test cases for it. The following code can be added to test-run-stubs.cc to test the builtin on all platforms:

For more details about using the CSA for different kinds of builtins and for further examples, see this wiki page.

A V8 developer velocity multiplier

The CSA is more than just an universal assembly language that targets multiple platforms. It enables much quicker turnaround when implementing new features compared to hand-writing code for each architecture as we used to do. It does this by providing all of the benefits of hand-written assembly while protecting developers against its most treacherous pitfalls:

With the CSA, developers can write builtin code with a cross-platform set of low-level primitives that translate directly to assembly instructions. The CSA’s instruction selector ensures that this code is optimal on all of the platforms that V8 targets without requiring V8 developers to be experts in each of those platform’s assembly languages.

The CSA’s interface has optional types to ensure that the values manipulated by the low-level generated assembly are of the type that the code author expects.

Register allocation between assembly instructions is done by the CSA automatically rather than explicitly by hand, including building stack frames and spilling values to the stack if a builtin uses more registers than available or makes call. This eliminates a whole class of subtle, hard-to-find bugs that plagued hand-written assembly builtins. By making the generated code less fragile the CSA drastically reduces the time required to write correct low-level builtins.

The CSA understands ABI calling conventions—both standard C++ and internal V8 register-based ones—making it possible to easily interoperate between CSA-generated code and other parts of V8.

Since CSA code is C++, it’s easy to encapsulate common code generation patterns in macros that can be easily reused in many builtins.

Because V8 uses the CSA to generate the bytecode handlers for Ignition, it is very easy to inline the functionality of CSA-based builtins directly into the handlers to improve the interpreter’s performance.

All in all, the CSA has been a game changer for V8 development. It has significantly improved the team’s ability to optimize V8. That means we are able to optimize more of the JavaScript language faster for V8’s embedders.

Monday, November 6, 2017

JavaScript performance has always been important to the V8 team, and in this post we would like to discuss a new JavaScript Web Tooling Benchmark that we have been using recently to identify and fix some performance bottlenecks in V8. You may already be aware of V8’s strong commitment to Node.js and this benchmark extends that commitment by specifically running performance tests based on common developer tools built upon Node.js. The tools in the Web Tooling Benchmark are the same ones used by developers and designers today to build modern web sites and cloud-based applications. In continuation of our ongoing efforts to focus on real-world performance rather than artificial benchmarks, we created the benchmark using actual code that developers run every day.

The Web Tooling Benchmark suite was designed from the beginning to cover important developer tooling use cases for Node.js. Because the V8 team focuses on core JavaScript performance, we built the benchmark in a way that focuses on the JavaScript workloads and excludes measurement of Node.js-specific I/O or external interactions. This makes it possible to run the benchmark in Node.js, in all browsers, and in all major JavaScript engine shells, including ch (ChakraCore), d8 (V8), jsc (JavaScriptCore) and jsshell (SpiderMonkey). Even though the benchmark is not limited to Node.js, we are excited that the Node.js benchmarking working group is considering using the tooling benchmark as a standard for Node performance as well (nodejs/benchmarking#138).

The individual tests in the tooling benchmark cover a variety of tools that developers commonly use to build JavaScript-based applications, for example:

Based on past experience with other benchmarks like Speedometer, where tests quickly become outdated as new versions of frameworks become available, we made sure it is straight-forward to update each of the tools in the benchmarks to more recent versions as they are released. By basing the benchmark suite on npm infrastructure, we can easily update it to ensure that it is always testing the state of the art in JavaScript development tools. Updating a test case is just a matter of bumping the version in the package.json manifest.

We created a tracking bug and a spreadsheet to contain all the relevant information that we have collected about V8’s performance on the new benchmark up to this point. Our investigations have already yielded some interesting results. For example, we discovered that V8 was often hitting the slow path for instanceof (v8:6971), incurring a 3–4× slowdown. We also found and fixed performance bottlenecks in certain cases of property assignments of the form of obj[name] = val where obj was created via Object.create(null). In these cases, V8 would fall off the fast-path despite being able to utilize the fact that obj has a null prototype (v8:6985). These and other discoveries made with the help of this benchmark improve V8, not only in Node.js, but also in Chrome.

We not only looked into making V8 faster, but also fixed and upstreamed performance bugs in the benchmark’s tools and libraries whenever we found them. For example, we discovered a number of performance bugs in Babel where code patterns like

value = items[items.length - 1];

lead to accesses of the property "-1", because the code didn’t check whether items is empty beforehand. This code pattern causes V8 to go through a slow-path due to the "-1" lookup, even though a slightly modified, equivalent version of the JavaScript is much faster. We helped to fix these issues in Babel (babel/babel#6582, babel/babel#6581 and babel/babel#6580). We also discovered and fixed a bug where Babel would access beyond the length of a string (babel/babel#6589), which triggered another slow-path in V8. Additionally we optimized out-of-bounds reads of arrays and strings in V8. We’re looking forward to continue working with the community on improving the performance of this important use case, not only when run on top of V8, but also when run on other JavaScript engines like ChakraCore.

Our strong focus on real-world performance and especially on improving popular Node.js workloads is shown by the constant improvements in V8’s score on the benchmark over the last couple of releases:

Over the last several years, the V8 team has come to recognize that no one JavaScript benchmark — even a well-intentioned, carefully crafted one — should be used as a single proxy for a JavaScript engine’s overall performance. However, we do believe that the new Web Tooling Benchmark highlights areas of JavaScript performance that are worth focusing on. Despite the name and the initial motivation, we have found that the Web Tooling Benchmark suite is not only representative of tooling workloads, but is representative of a large range of more sophisticated JavaScript applications that are not tested well by front end-focused benchmarks like Speedometer. It is by no means a replacement for Speedometer, but rather a complementary set of tests.

The best news of all is that given how the Web Tooling Benchmark is constructed around real workloads, we expect that our recent improvements in benchmark scores will translate directly into improved developer productivity through less time waiting for things to build. Many of these improvements are already available in Node.js: at the time of writing, Node 8 LTS is at V8 6.1 and Node 9 is at V8 6.2.

Wednesday, October 25, 2017

Every six weeks, we create a new branch of V8 as part of our release process. Each version is branched from V8’s git master immediately before a Chrome Beta milestone. Today we’re pleased to announce our newest branch, V8 version 6.3, which is in beta until its release in coordination with Chrome 63 Stable in several weeks. V8 v6.3 is filled with all sorts of developer-facing goodies. This post provides a preview of some of the highlights in anticipation of the release.

Speed

Jank Busters III hit the shelves as part of the Orinoco project. Concurrent marking (70-80% of marking is done on a non-blocking thread) is shipped.

string.js has been completely ported to CodeStubAssembler. Thanks a lot to @peterwmwong for his awesome contributions! As a developer this means that builtin string functions like String#trim are a lot faster starting with 6.3.

Thursday, October 5, 2017

Introduction

Proxies have been an integral part of JavaScript since ES2015. They allow
intercepting fundamental operations on objects and customizing their behavior.
Proxies form a core part of projects like jsdom and the Comlink RPC library.
Recently, we put a lot of effort into improving the performance of proxies in
V8. This article sheds some light on general performance improvement patterns
in V8 and for proxies in particular.

Proxies are “objects used to define custom behavior for fundamental
operations (e.g. property lookup, assignment, enumeration, function invocation,
etc.)” (definition by MDN).
More info can be found in the full specification.
For example, the following code snippet adds logging to every property access on
the object:

Constructing proxies

The first feature we'll focus on is the construction of
proxies. Our original C++ implementation here followed the EcmaScript
specification step-by-step, resulting in at least 4 jumps between the C++ and JS
runtimes as shown in the following figure. We wanted to port this implementation
into the platform-agnostic CodeStubAssembler
(CSA), which is executed in the JS runtime as opposed to the C++ runtime.This
porting minimizes that number of jumps between the language runtimes. CEntryStub
and JSEntryStub represent the runtimes in the figure below. The dotted lines
represent the borders between the JS and C++ runtimes. Luckily, lots of helper
predicates were already implemented in the assembler, which made the initial
version concise and readable.

The figure below shows the execution flow for calling a Proxy with any proxy
trap (in this example apply, which is being called when the proxy
is used as a function) generated by the following sample code:

After porting the trap execution to CSA all of the execution happens in the JS
runtime, reducing the number of jumps between languages from 4 to 0.

This change resulted in the following performance improvements::

Our JS performance score shows an improvement between 49% and
74%. This score roughly measures how many times the given
microbenchmark can be executed in 1000ms. For some tests the code is run
multiple times in order to get an accurate enough measurement given the timer
resolution. The code for all of the following benchmarks can be found in
our js-perf-test directory.

Call and construct traps

The next section shows the results from optimizing call and construct traps
(a.k.a. "apply"
and "construct").

The performance improvements when calling proxies are significant — up
to 500% faster! Still, the improvement for proxy construction
is quite modest, especially in cases where no actual trap is defined — only
about 25% gain. We investigated this by running the following
command with the d8
shell:

It turned out most of the time is spent in NewObject and the
functions called by it, so we started planning how to speed this up in future
releases.

Get trap

The next section describes how we optimized the other most common operations —
getting and setting properties through proxies. It turned out the get
trap is more involved than the previous cases, due to the specific behavior of
V8's inline cache. For a detailed explanation of inline caches, you can watch
this talk.

Eventually we managed to get a working port to CSA with the following results:

After landing the change, we noticed the size of the Android .apk
for Chrome had grown by ~160KB, which is more than expected for
a helper function of roughly 20 lines, but fortunately we track such statistics.
It turned out this function is called twice from another function, which is
called 3 times, from another called 4 times. The cause of the problem turned out
to be the aggressive inlining. Eventually we solved the issue by turning the
inline function into a separate code stub, thus saving precious KBs - the end
version had only ~19KB increase in .apk size.

Has trap

The next section shows the results from optimizing the has
trap. Although at first we thought it would be easier (and reuse most of the
code of the get trap), it turned out to have its own peculiarities.
A particularly hard-to-track-down problem was the prototype chain walking when
calling the in operator. The improvement results achieved vary
between 71% and 428%. Again the gain is more prominent in cases
where the trap is present.

Set trap

The next section talks about porting the set
trap. This time we had to differentiate between named and
indexed properties (elements).
These two main types are not part of the JS language, but are essential for V8's
efficient property storage. The initial implementation still bailed out to the
runtime for elements, which causes crossing the language boundaries again.
Nevertheless we achieved improvements between 27% and 438% for
cases when the trap is set, at the cost of a decrease of up to
23% when it's not. This performance regression is due to the
overhead of additional check for differentiating between indexed and named
properties. For indexed properties, there is no improvement yet. Here are the
complete results:

Real-world usage

The jsdom-proxy-benchmark project compiles the ECMAScript specification using the Ecmarkup tool. As of v11.2.0,the jsdom
project
(which underlies Ecmarkup) uses proxies to implement the common data structures
NodeList and HTMLCollection. We used this benchmark to
get an overview of some more realistic usage than the synthetic
micro-benchmarks, and achieved the following results, average of 100 runs:

Chai.js is a popular assertion library which makes heavy use of proxies. We've
created a kind of real-world benchmark by running its tests with different
versions of V8 an improvement of roughly 1s out of more than
4s, average of 100 runs:

Wednesday, October 4, 2017

Roughly three months ago, I joined the V8 team (Google Munich) as an intern and
since then I’ve been working on the VM’s Deoptimizer — something
completely new to me which proved to be an interesting and challenging project.
The first part of my internship focused on improving the VM security-wise. The second part
focused on performance improvements. Namely, on the removal of a data-structure
used for the unlinking of previously deoptimized functions, which was a
performance bottleneck during garbage collection. This blog post describes this
second part of my internship. I’ll explain how V8 used to unlink deoptimized
functions, how we changed this, and what performance improvements were obtained.

Let’s (very) briefly recap the V8 pipeline for a JavaScript function: V8’s
interpreter, Ignition, collects profiling information about that function while
interpreting it. Once the function becomes hot, this information is passed to
V8’s compiler, TurboFan, which generates optimized machine code. When the
profiling information is no longer valid — for example because one of the
profiled objects gets a different type during runtime — the optimized machine
code might become invalid. In that case, V8 needs to deoptimize it.

Upon optimization, TurboFan generates a code object, i.e. the optimized machine
code, for the function under optimization. When this function is invoked the
next time, V8 follows the link to optimized code for that function and executes
it. Upon deoptimization of this function, we need to unlink the code object in
order to make sure that it won’t be executed again. How does that happen?

For example, in the following code, the function f1 will be invoked
many times (always passing an integer as argument). TurboFan then generates
machine code for that specific case.

Each function also has a trampoline to the interpreter — more details in these
slides
— and will keep a pointer to this trampoline in its
SharedFunctionInfo (SFI). This trampoline will be used whenever V8
needs to go back to unoptimized code. Thus, upon deoptimization, triggered by
passing an argument of a different type, for example, the Deoptimizer can simply
set the code field of the JavaScript function to this trampoline.

Although this seems simple, it forces V8 to keep weak lists of optimized
JavaScript functions. This is because it is possible to have different functions
pointing to the same optimized code object. We can extend our example as
follows, and the functions f1 and f2 both point to the
same optimized code.

const f2 = g();
f2(0);

If the function f1 is deoptimized (for example by invoking it with
an object of different type {x: 0}) we need to make sure that the
invalidated code will not be executed again by invoking f2.

Thus, upon deoptimization, V8 used to iterate over all the optimized JavaScript
functions, and would unlink those that pointed to the code object being
deoptimized. This iteration in applications with many optimized JavaScript
functions became a performance bottleneck. Moreover, other than slowing down
deoptimization, V8 used to iterate over these lists upon stop-the-world cycles
of garbage collection, making it even worse.

In order to have an idea of the impact of such data-structure in the performance
of V8, we wrote a micro-benchmark
that stresses its usage, by triggering many scavenge cycles after creating many
JavaScript functions.

When running this benchmark, we could observe that V8 spent around 98% of its
execution time on garbage collection. We then removed this data structure, and
instead used an approach for lazy unlinking, and this was what we
observed on x64:

Although this is just a micro-benchmark that creates many JavaScript functions
and triggers many garbage collection cycles, it gives us an idea of the overhead
introduced by this data structure. Other more realistic applications where we
saw some overhead, and which motivated this work, were the router benchmark
implemented in Node.js and ARES-6
benchmark suite.

Lazy unlinking

Rather than unlinking optimized code from
JavaScript functions upon deoptimization, V8 postpones it for the next
invocation of such functions. When such functions are invoked, V8 checks whether
they have been deoptimized, unlinks them and then continues with their lazy
compilation. If these functions are never invoked again, then they will never be
unlinked and the deoptimized code objects will not be collected. However, given
that during deoptimization, we invalidate all the embedded fields of the code
object, we only keep that code object alive.

The commit
that removed this list of optimized JavaScript functions required changes in
several parts of the VM, but the basic idea is as follows. When assembling the
optimized code object, we check if this is the code of a JavaScript function. If
so, in its prologue, we assemble machine code to bail out if the code object has
been deoptimized. Upon deoptimization we don’t modify the deoptimized code —
code patching is gone. Thus, its bit marked_for_deoptimization is
still set when invoking the function again. TurboFan generates code to check it,
and if it is set, then V8 jumps to a new builtin,
CompileLazyDeoptimizedCode, that unlinks the deoptimized code from
the JavaScript function and then continues with lazy compilation.

In more detail, the first step is to generate instructions that load the address
of the code being currently assembled. We can do that in x64, with the following
code:

We can then test the bit and if it is set, we jump to the
CompileLazyDeoptimizedCode built in.

// Test if the bit is set, that is, if the code is marked for deoptimization.
__ testl(Operand(rcx, offset),
Immediate(1 << Code::kMarkedForDeoptimizationBit));
// Jump to builtin if it is.
__ j(not_zero, /* handle to builtin code here */, RelocInfo::CODE_TARGET);

On the side of this CompileLazyDeoptimizedCode builtin, all that’s
left to do is to unlink the code field from the JavaScript function and set it
to the trampoline to the Interpreter entry. So, considering that the address of
the JavaScript function is in the register rdi, we can obtain the
pointer to the SharedFunctionInfo with:

This new technique is already integrated in V8 and, as we’ll discuss later,
allows for performance improvements. However, it comes with a minor
disadvantage: Before, V8 would consider unlinking only upon deoptimization. Now,
it has to do so in the activation of all optimized functions. Moreover, the
approach to check the marked_for_deoptimization bit is not as
efficient as it could be, given that we need to do some work to obtain the
address of the code object. Note that this happens when entering every optimized
function. A possible solution for this issue is to keep in a code object a
pointer to itself. Rather than doing work to find the address of the code object
whenever the function is invoked, V8 would do it only once, after its
construction.

Results

We now look at the performance gains and regressions obtained with this project.

General Improvements on x64

The following plot shows us some
improvements and regressions, relative to the previous commit. Note that the
higher, the better.

The promises benchmarks are the ones where we see greater
improvements, observing almost 33% gain for the bluebird-parallel
benchmark, and 22.40% for wikipedia. We also observed a few
regressions in some benchmarks. This is related to the issue explained above, on
checking whether the code is marked for deoptimization.

We also see improvements in the ARES-6 benchmark suite. Note that in this chart
too, the higher the better. These programs used to spend considerable amount of
time in GC-related activities. With lazy unlinking we improve performance by
1.9% overall. The most notable case is the Air steadyState where we
get an improvement of around 5.36%.

AreWeFastYet results

The performance results for the Octane
and ARES-6 benchmark suites also showed up on the AreWeFastYet tracker. We
looked at these performance results on September 5th, 2017, using the provided
default machine (macOS 10.10 64-bit, Mac Pro, shell).

Impact on Node.js

We can also see performance improvements in
the router-benchmark. The following two plots show the number of operations per
second of each tested router. Thus the higher the better. We have performed two
kinds of experiments with this benchmark suite. Firstly, we ran each test in
isolation, so that we could see the performance improvement, independently from
the remaining tests. Secondly, we ran all tests at once, without switching of
the VM, thus simulating an environment where each test is integrated with other
functionalities.

For the first experiment, we saw that the router and
express tests perform about twice as many operations than before,
in the same amount of time. For the second experiment, we saw even greater
improvement. In some of the cases, such as routr,
server-router and router, the benchmark performs
approximately 3.80×, 3× and 2× more operations, respectively. This happens
because V8 accumulates more optimized JavaScript functions, test after test.
Thus, whenever executing a given test, if a garbage collection cycle is
triggered, V8 has to visit the optimized functions from the current test and
from the previous ones.

Further Optimization

Now that V8 does not keep the linked-list of JavaScript functions in the
context, we can remove the field next from the
JSFunction class. Although this is a simple modification, it allows
us to save the size of a pointer per function, which represent significant
savings in several web pages:

Benchmark

Kind

Memory savings (absolute)

Memory savings (relative)

facebook.com

Average effective size

170KB

3.7%

twitter.com

Average size of allocated objects

284KB

1.2%

cnn.com

Average size of allocated objects

788KB

1.53%

youtube.com

Average size of allocated objects

129KB

0.79%

Acknowledgments

Throughout my internship, I had lots of help from several people, who were
always available to answer my many questions. Thus I would like to thank the
following people: Benedikt Meurer, Jaroslav Sevcik, and Michael Starzinger for
discussions on how the Compiler and the Deoptimizer work, Ulan Degenbaev for
helping with the Garbage Collector whenever I broke it, and Mathias Bynens,
Peter Marshall, Camillo Bruni, and Maya Lekova for proofreading this article.

Finally, this article is my last contribution as a Google intern and I would
like to take the opportunity to thank everyone in the V8 team, and especially my
host, Benedikt Meurer, for hosting me and for giving me the opportunity to work
on such an interesting project — I definitely learned a lot and enjoyed my time
at Google!