JavaScript Optimization Patterns (Part 2)

Following up on part one of this series last week, here's another (hopefully interesting) episode about optimization patterns for JavaScript (based on my background working on the V8 engine for more than four years). This week we're going to look into an optimization called Function Context Specialization, that we introduced to V8 with TurboFan (other engines like JavaScriptCore implement similar optimizations). The name is a bit misleading. What it essentially does is to allow TurboFan to constant-fold certain values when generating optimized code, and it does that by specializing the generated machine code for a function to its surrounding context (which is V8 speak for the runtime representation of scope).

Consider the following simple code snippet:

constINCREMENT=1;functionincr(x){return x +INCREMENT;}

Assume that we run this in on <script> level in Chrome (or on top-level in the d8 shell), then we see the following bytecode generated for the function incr:

The interesting bit here is the access to the constant INCREMENT on script scope: It is loaded from the surrounding context via the LdaImmutableCurrentContextSlot bytecode and then immediately checked whether the value is what we call the_hole in V8; the_hole is an internal marker that is used to implement the temporal dead zone for lexical scoping (see Variables and scoping in ECMAScript 6 by Axel Rauschmayer for details on this). This is a bit counter-intuitive to many developers that I talk to, as the intuition is that the VM needs to do less work for const than var, especially inside of local scopes, but the reality is that - at least initially - the VM needs to do even more work because of the additional TDZ (temporal dead zone) check. This is necessary because of the way scoping works, i.e. let's look at ex2.js:

What happens here is that the TDZ check fails, because the assignment const INCREMENT = 1 wasn't executed before incr was run. I have to admit that even though I'm working on the VM side of this for quite a while, I still find this behavior highly counter-intuitive, but I also don't consider myself a very good language designer... Ok, ranting aside. Looking at the example again, it obviously works if you put the call to incr last

Performance-wise there's one very interesting (and maybe obvious) observation here: Once a particular const slot in a context is assigned, it will keep that value, and will not go back to ever contain the_hole again (that's what const guarantees). And we use exactly this fact in TurboFan to avoid loading and checking const slot values each time.

The only really interesting line here is line at offset 26 with the instruction addl rbx,0x1, where rbx contains the integer value of the parameter x passed to the function (based on the fact that we warmed up incr with integer values for x before), and the 0x1 is the constant-folded value of the INCREMENT constant from the surrounding context. The constant-folding in this case is only valid, because TurboFan knows that no one can change the value of INCREMENT anymore once it's no longer the_hole (i.e. outside the TDZ). Actually it's not TurboFan that figures this out, but the Ignition interpreter forwards this information to TurboFan via the dedicated bytecode LdaImmutableCurrentContextSlot that we saw earlier, specifically it's the immutable bit in this bytecode that tells TurboFan that the context slot cannot change anymore once it contains a non-holey value. We can see the difference when we try the same example with let:

Here we see that Ignition has to use LdaCurrentContextSlot, i.e. it cannot proof that the value of INCREMENT cannot change afterwards, because every other script could just modify INCREMENT later. And as such TurboFan cannot constant-fold the value 1, but instead has to generate explicit code to load INCREMENT from the script context and check that it's not the_hole (the code between offset 17 and 2f in the listing above does that).

So in this sense, const is a performance feature, but only once it reaches the optimizing compiler and if the Function Context Specialization kicks in, which depends on a rather simple condition that might not be obvious: It's only enabled for the first closure of any function in a given native context (which is V8 speak for <iframe>). So what does that mean? In the examples above, there was always only a single closure of incr. But let's consider this simple counter-example ex6.js:

It's definitely a bit artificial, but it's important to highlight the key takeaway: There are now multiple closures for the same function incr, generated by makeIncr. Running this in d8 reveals what I just described:

Ignition sticks an LdaImmutableCurrentContextSlot bytecode in there, because it's a const context slot, but Function Context Specialization only kicks in for the first closure. The second closure get's new optimized code, which is not specialized. The reason behind this is that if you have more than one closure per function we would like to share the code between different closure, as it would be a waste of resources - both time and memory - to generate one code object per closure then, esp. if you use arrow functions with higher order builtins like for example

let b = a.map(x => x +1);

where you don't want to have the optimizing compiler run every time you execute this line just to generate a specialized code object for x => x + 1. So the rule here is simple:

You only get Function Context Specialization for the first closure of every function in any given <iframe> (native context in V8 speak).

The native context part doesn't apply to Node as there you only have one native context, except when you use the vm module.

Now considering that class is like let, i.e. it's a mutable binding (again for reasons that I don't want to buy), you don't necessarily benefit from Function Context Specialization when using classes. Let's consider ex7.js:

What's interesting to see here is that the constructor for A is properly inlined into makeA in the optimized code and we essentially just stamp out instances of A with the best possible code, except for the additional checks that we need to perform because TurboFan doesn't know that A cannot change (in fact A can change at any moment, since it's a mutable binding). So all the code between offset 17 and offset 2f loads the context slot for A and checks that it's not the_hole and the next two lines check that it's actually the JSFunction A that we saw earlier (during warmup). As you can see TurboFan nevertheless tries hard to generate pretty decent code. But you can help it further by using const here as well:

This is the perfect x64 machine code for makeA, there are no redundant checks in this code left (the two checks in there are the stack check to ensure that V8 doesn't overflow the execution stack and the bump pointer check to trigger garbage collection when new space is filled up).

So far the only way to get LdaImmutableCurrentContextSlot instead of LdaCurrentContextSlot was by using const. But this was because I was demonstrating only code operating on lexically bound names on script level (or top-level in d8). If we go back to the simple let example in ex5.js and run that in Node 9 (or 8.2.0-rc1) we see that INCREMENT get's constant-folded despite using let:

We see that this is the ideal code. The reason for this is the CommonJS module system used by Node. Every module is implicitly wrapped into a function. So ex7.js in Node corresponds roughly to the following code in Chrome or d8:

This is simplified (as I don't want to explain webpack as well here). What's interesting here, is that A is local to the anonymous closure, and thus the parser can actually proof that A never changed after the initial definition, because no code outside the closure can see (and touch) the binding A. Thereby Ignition sticks an LdaImmutableCurrentContextSlot in there and TurboFan can generate awesome code for makeA: