notes.public

The argument goes that just-in-time (JIT) compilation can be faster than ahead-of-time (AOT) compilation because JIT can specialize based on the common codepaths and runtime data. For the sake of argument, let’s ignore profile-guided optimization (PGO), which nobody likes.

The counter-argument goes that if JIT is so great, why can’t you JIT C code?

It took me a while, but I finally found the answer. It turns out that there is something lighter than JIT, and more adaptive than AOT, and it’s built into every modern CPU: branch prediction!

The argument that CPUs are optimized for C code is both true and false. It’s true that CPUs are optimized for the most common code they run, which historically has been C/C++. However, the deeper truth is that CPUs (especially X86) are optimized as much as possible, period. Those optimizations are just limited by reality (hardware constraints, generality, etc.).

For the most part, C generates logic that is static enough that the CPU’s branch predictor is enough. But for dynamically typed languages, absolutely everything requires a branch prediction, which overwhelms the hardware.

The only dynamically typed language that is both commonly AOT compiled and heavily optimized that I’m aware of is Objective-C, which makes it an interesting case study. Of course, ObjC is not commonly considered that fast, even compared to C++ vtables.

The idea here is that, if you had a fast execution model for dynamically typed languages, you could make JavaScript as fast as V8 without 90% of the overhead.

First, imagine an AOT-compiled JavaScript that just used libobjc (the ObjC runtime which handles dynamic dispatch, amongst other things). It’d be almost as fast as Objective-C (but slower due to the lack of inline C and different idioms) and almost as lightweight (but still more bloated, assuming you need a garbage collector).

Next, the real question: is there an even faster execution model? Based on the idea of working with the CPU branch predictor (rather than duplicating its effort) and doing a bit of extra “branch prediction in software,” I think that there is.

What if you wrote an AOT compiler that generated instructions that were designed to be changed at runtime? Basically, in psuedocode:

In this case, Class is probably a class pointer. Op is inline assembly that gets overwritten. The arguments and return value are carefully arranged so that op can be a single CPU instruction (like an add) or a function call (for complicated/custom types).

From this basic setup, you can do a lot of additional optimizations. For example, you can keep two versions of hot functions: one specialized and one generic. That might be useful because change_specialization() is fairly expensive (requiring two calls to mprotect(2) under W^X, although JITs have that same overhead). You can also do more sophisticated runtime profiling to decide exactly when to optimize/deoptimize like JITs do (but that has its own overhead/complexity, so it might not be worth it).

Assuming all goes well, at this point you have a compiled, dynamically typed language that’s basically slightly slower than Golang (or maybe faster when you’re doing lots of dynamic dispatch). If you can replace the garbage collector with reference counting (ARC), you can eliminate the memory bloat too (but you have to deal with cycles… how much overhead does Python’s cycle collector have?).

A statically compiled JavaScript would not be much use in web browsers (unless you target WebAssembly, but that’s pretty ballsy and I don’t know if it lets you do self-modifying code), but it’d be great for Node.js and Electron.

And of course this execution model would work for any dynamic language, including Python, Objective-C, Lisp, etc.