Chris Leary

You can think of a dynamic language code as a template. This code has generality because of duck typing; the idea is roughly, "If the code can work on a value, it will, with no muss or fuss."

Some programmers find this to be an empowering form of "do what I say" — you don't need to explain everything you want your code to do in terms that even an optimizing compiler can understand.

In traditional dynamic language VMs, this template is instantiated with concrete value types at runtime. When a value flows into a bit of code, the interpreter inspects the type of the value and then runs the appropriate "instantiation of the template".

Example

Consider the function:

function bitAndAdd(a, b, c) {
return a + (b & c);
}

In its maximally generic, template-looking form it reads like this pseudo bytecode:

In the JS engine, the T-prefixed types can be any in the set {null, undefined, boolean, int32, double, object, string} — these are the engine internal value tags.

In a bytecode-based interpreter, each bytecode typically asks, "Is this thing that I'm acting on an int32? Or double? Or string? Or ...? Or user defined type?" [*] Once that's determined, it runs the appropriate sub-section of the bytecode implementation for that type. The question-asking part of the bytecode is, in effect, directing you to the appropriate template instantiation. (Note that bytecodes are kind of like mini-templates composed into the larger function template.)

When you call bitAndAdd<int32, int32, int32>, you get an instantiated code path at runtime that looks like:

This path represents all of the machine code executed in the interpreter that does useful work for this function. Imagine the bytecodes that make up the function as a bunch of enum values in an array. For each enum value in the array, the interpreter dispatches to an type-unspecialized bytecode implementation. Then, the unspecialized bytecode implementation inspects the input types and dispatches to the specialized implementation.

A decent example is the ADD opcode in JavaScript. The unspecialized version of ADD looks something like:

Specializing the template

Of course, with JIT technology, we try to be a lot smarter about making the instantiation efficient. There are a few key mechanisms:

Avoiding bytecode dispatch

The composition of mini-templates (bytecodes) into the larger program template (made out of bytecodes) causes "jump to the next bytecode in the program" cost in an interpreter loop. By creating a machine-code implementation for the sequence of bytecodes that make up a function, we naturally cause machine-code duplication, but avoid that interpretation overhead. This is a natural consequence of choosing to JIT compile a function.

If the JIT sees that you're performing a bitwise-and, and the spec says that the result of a bitwise-and is always an int32, the JIT can safely assume that fact downstream in the code; for example, in:

function byteSquare(x) {
var a = x &0xff;
var b = a * a;
return b;
}

The inputs to the a * a multiply are int32s, without question. There's no reason to instantiate the machine code for other types when you know it's going to be a int32.

Flowing conservative type information through all the code in the system tells you key interprocedural facts. For example, just by looking at your program it may be easy to see that the divmod(dividend, divisor) function is only ever called with two constant integers as arguments.

With that as the basic idea, the analysis can get a lot fancier; for example, analyses can see that you're only storing new Car() objects into the vehicle property of your Person records, and that new Car() will always have an int32 value in its mufflerCount attribute.

By considering everything[†] that the template could possibly do, we can still narrow down the possibilities in worthwhile cases. This less us instantiate less code and know more facts for use in optimization.

When a loop is determined to be important (i.e. it hits a certain execution count) a recorder notes which bytecodes execute in the interpreter and what types flow into them. This avoids both bytecode dispatch overhead and type ambiguity, creating a very highly specialized template.

However, if the types ever stop matching up with the observed bytecode execution — say, myObject.quotient used to be an int32, but when we ran the trace a subsequent time it turned out to be a double — the execution of that trace cannot continue, because it's instantiated for the wrong types downstream. Because of this possibility, a bunch of checks for type compatibility have to be inserted in the template instantiation.

It's important to note that the generated template is more than just type dependent — it can also be value depdendent! Because a trace only records a single execution path and questions like "Do I execute the if body or the else body?" can be determined by the values held by a given type.

If you think of trace recording as observing bytecode for a single run of execution, profiling is observing bytecode for n runs of execution [‡] and accumulating the notes. This eases the pain of compiling "flip floppy" code, because you don't have to see the primarily taken path the first time, you have n tries to get a feel for it.

This process lets you instantiate a template that's general enough to handle all the things you've seen over the n iterations you've observed.

While trace compilation culls out all but one path of control flow, profiling data can be used to cull out all unobserved (and presumably cold) control flow as well. This may increase the effectiveness and decrease the runtime of dataflow-analysis-based optimizations, because there's less unimportant code to consider. I'm not aware of existing profiling JITs that do this, however, which indicates to me that it may not be an overall win.

Further discussion

Dynamic language compilers try to use runtime feedback as well as things it can infer about the code to specialize the template in the best way possible and make generated machine code blazingly fast. We want to be able to mix, "things we definitely know" (from analysis) with "things we are pretty sure we know" (from runtime feedback) in order to get the maximum speedup possible.

One exciting thing about Just-In-Time is that this runtime feedback makes it at least theoretically possible to exceed the speed of a corresponding static program; however, there are signficant challenges involved. Whenever we start leaning too heavily on "things we are pretty sure we know," we need to be able to keep the world from falling apart when we're wrong.

It's actually a fairly emotional topic in compiler engineering: how do you hold the world together when everything you thought you knew is wrong?

Steel your mind for a tale of intrigue, intertwingled with a complex topic in browser security. (It's kind of all over the place, but I might spray something useful.)

Our story, like so many others, starts out with a browser user like yourself, a bottle of red wine, and a devoted young hacker from the Eastern Bloc [*] that answers to the handle "Coleslaw". [†]

Winey-and-Cheesy Corporation, the largest international wine and cheese distributor, has just blitzkrieg bopped the mainstream media over the head with a tactical PR campaign — a free case of wine and sizable wheel of Gouda for the five millionth visitor to their website. [‡]

The only problem is that Winey-and-Cheesy's massively trafficked website... has been owned.

Coleslaw is something of a wunderkind, and has, through feats of social engineering and technical prowess paralleled only by terrible movies from the mid 90s, gained the ability to insert some arbitrary, special-sauce HTML and JavaScript into that promotional page.

Coleslaw intends to perform a "zero-day attack" — this means that there's a bug in the browser that Coleslaw knows about, but that the browser vendors are unaware of. Coleslaw thinks that this bug can be used to take over the machines of some unsuspecting users who visit the promotional page, capitalizing on their maniacal love of fine dining.

The attacker's dilemma

So, to recap, Coleslaw has found a bug in the browser. Coleslaw wants to exploit that bug in order to obtain arbitrary code execution — the ability to run whatever code Coleslaw feels like on the machine that's running the vulnerable browser. The question is, how does Coleslaw get from point A, "I see that there's a bug", to point B, "I can run anything I want on the vulnerable machine"? The process of figuring this out is called exploit development.

The exploit development process is a narrative about control. Coleslaw starts off by having control over a small set of things — the JavaScript and HTML on a page that the browser visits — but wants to end up controlling everything that the user controls on the vulnerable machine. The environment that internet sites have access to is supposed to be sandboxed; i.e. internet sites are expected to have a somewhat limited and carefully selected set of things it can control. For example, web sites that you happen to stumble across shouldn't be able to delete files off of your hard drive.

Strongly-related to this narrative about control is the concept of determinism. If Coleslaw has a concrete understanding that performing some action, like calling alert from JavaScript, always results in some consequence, like creating and displaying an alert window in the browser, then Coleslaw has effectively extended the realm of control from JavaScript to triggering-the-code-in-the browser-that-displays-an-alert-dialog. Barring bugs, the realm of control is always confined to the sandbox — the set of possible actions are those that the browser vendor permits an untrusted website to take.

Not all bugs are created equal

There are lots of different kinds of bugs that browser software can have. There's a relatively tiny set of bugs that permit control flow hijacking, which are generally of interest for gaining arbitrary code execution. Successful hijacking implies that you have the ability to control the address of the instruction being executed, which is commonly referred to as pseudo-register %eip (where ip is short for instruction pointer). With full control of %eip, the attacker can point it at any executable code — possibly at executable code that they've created.

Control flow hijacking is typically accomplished through some kind of memory corruption, stemming from errors in the use of type-unsafe programming constructs in the browser. In general, the bugs of interest for control flow hijacking are:

Memory writes that can be used to clobber vtable pointer or function pointer values. The attacker may have control over the location of the memory write, the value being written, or both.

Buffer overruns that can be used to manipulate values that are ultimately used to determine code to run. The classic example of this is clobbering return addresses present on the C stack. [§]

There's also the possibility of using an attacker-controllable invalid memory read bug to cause an invalid write to happen further along in program execution. Bugs that cause segfaults are carefully evaluated by browser security teams to see if the invalid memory access being performed can be manipulated for use in control flow hijacking.

Platform-level mitigations: DEP, ASLR, and canaries

There are some nifty platform-level protections against traditional control flow hijacking techniques. They make both taking control of %eip and executing an attacker controlled code sequence more difficult.

One control-flow hijacking mitigation is stack smashing protection, which is enabled at compile time using a technique referred to as "canary values". An attacker could historically use stack buffer overruns to clobber the return address in a function frame with a target %eip value, and the ret instruction at the end of the function's machine code would return to that new (attacker controlled) address value. With this mitigation enabled, however, the compiler places a special value on the stack between local variables (where the buffer lives) and the return value. The compiler also augments the function body with pre-return function prologue code that checks the canary value on the stack against its original value. If a stack buffer overrun causes the return value to be overwritten, the canary that lives in the contiguous space between the locals and return value should indicate that things have gone horribly wrong.

Generally, we tend to think of executables as containing all their executable code as static machine-code. Other than the code that the compiler spat out as specific sections of the executable, nothing else should run over the course of the program's execution. This expectation is codified in an OS-level mitigation called Data Execution Prevention (DEP).

The goal of DEP is to prevent things which are not code from being executed as code at runtime. Your program stack, for example, is just a bunch of space for data that your C function frames can't keep in registers. There's basically no reason that a sane program would ever want to start executing the stack area of memory like it were code — if something like that were to happen, it would be better if your program just terminated, because it could be the pivotal point before an attacker like Coleslaw takes control. Program termination means loss of control for the attacker.

Trying to execute code that was not in the original binary will generally cause the program to fault. In a JIT, however, we purposefully create code at runtime, violating the all-the-code-is-in-the-binary assumption. As a result, we have to explicitly mark the machine code that we create as executable by calling to an operating system API function, like VirtualProtect or mprotect, to indicate that the data the process has created should really be executable.

DEP's close friend from acronym club is Address Space Layout Randomization (ASLR). ASLR reduces determinism in the process that the attacker is trying to exploit by randomizing the stack address, library loading addres, heap address, and PEB address, amongst other key program components. With this mitigation, hardcoded constant addresses in attacker crafted code become probabilistically unlikely to succeed at hitting their target. As an example, the start of the program stack could wind up being placed at one of 16 thousand locations!

This also means that the address of system DLLs, like the ones containing OS API functions like VirtualProtect and C library functions like system, are probabilistically unknown to the attacker. Since the browser ships linked with all ASLR-enabled DLLs, it's difficult to use linked DLL code as direct footholds in process space.

Coleslaw wants to run an attacker-controlled code payload, but DEP makes it difficult to execute that payload, since it won't be marked as executable by default.

Coleslaw wants to be able to turn the bug that relinquishes control of %eip into a reliable exploit, but ASLR makes it difficult to know where to point %eip in order to run exploit code.

I imagine that turning a crash into an exploit isn't trivial these days.

Staged shellcode payloads

The machine-code payloads that attackers create are referred to as shellcode. Shellcode is generally characterized by its size and its goal, which is usually reflected by the "stage" it's said to be running. For example, the very first shellcode to run, in computer science style, is referred to as "stage 0".

Intermediate stages of shellcode are often used to bootstrap more complex executable code sequences. The complexity involved in turning a bug into an exploit often prevents arbitrarily complex code sequences from executing immediately, so tinier code sequences are written that just delegate responsibility to a more easily formed executable payload. Constraints that apply to the code that the exploit starts running directly tend to disappear after you've gone through some amount of indirection.

Shellcode can easily embed astoundingly small code sequences called "egg hunters" to find the memory address of other attacker-controlled payloads. The egg hunters are designed to avoid faulting the application, because faults cause the attacker to lose control. They work by performing a series of fast-and-minimally-sized system calls to determine whether a virtual memory page is safe to traverse through and read to find the "egg" payload delimiter.

Once the address of a stage 1 data payload is determined, stage 0 shellcode may attempt to make that segment of memory executable. Despite ASLR, the address of the VirtualProtect function can be derived by hopping from the known TEB address to the PEB address to the DLL loader address mapping table. Once executable permissions have been added to the stage 1 shellcode, it can simply be jumped to.

Another alternative, if the stage 0 shellcode is executing out of a code space with both writable and executable permissions and sufficient available space, is to use what's called a "GetPC" shellcode sequence to determine the current value of %eip and then copy the contents of a stage 1 shellcode payload buffer into the current code space.

For some bugs it may be possible to create "common" stage 0 shellcode to bootstrap any other shellcode payload — this common shellcode is a valuable commodity for exploit toolkits.

JIT spray, deconstructed

As mentioned earlier, the JIT has to mark its own assembly buffers as executable. An attacker may look at using that fact to generate executable stage 0 shellcode, in order to bypass some of the pain inflicted by DEP. But how could you possibly use JIT compilation process to make shellcode?

JIT spraying is the process of coercing the JIT engine to write many executable pages with embedded shellcode.

Check out the first line — it's showing that the first instruction is a MOV that places the 32-bit immediate payload into the EAX register. The 32-bit immediate payload from that instruction (3C54D0D9) is exactly the immediate that was used as the left-hand-side to the long XOR sequence in the original ActionScript code.

Now, if we look at the subsequent lines, we see that the addr column, which is showing the address of instructions relative to the start of the sequence, goes up by 5 every time. That's because each instruction after the initial MOV is performing an XOR against the original value in the accumulator register, EAX, exactly as the ActionScript program described.

Each of these instructions is exactly five bytes long — each instruction has a one-byte opcode prefix, given under the op column, followed by a 32 bit immediate constant: the opcode for MOV EAX,[imm32] is 0xB8, and the opcode sequence for XOR EAX,[imm32] is 0x35.

The immediate column may look confusing at a glance, but it's actually just the little-endian equivalent of the 32-bit immediate given in the assembly: the "little end" (least significant byte) goes "in" (at the lowest memory address), which is why the byte order looks flipped around from the one given in the assembly (and in the original ActionScript program).

It may not look so sinister, but the above table is actually deceiving you! In the table, all of the instructions are the same number of bytes (5) in length. On x86 CPUs, however, instructions are actually a variable number of bytes in length: instructions can be as small as a single byte, but can get quite long: the nop instruction is just a 0x90 opcode byte with no operands, whereas the movl $0xdeadbeef, 0x12345678(%ebx,%edx,1) instruction is significantly larger. [#]

As a result, when we look at this instruction sequence "crooked" — with a one-byte skew to the address — we decode a totally different sequence of instructions. I'll show you what I mean.

If you look down the middle part of the two disassemblies, before the assembly mnemonics, you can read that the bytes are the same from left to right: the first line of the first disassemblies goes b8 d9 d0 54 3c, and the second disassembly starts on the second byte of that same sequence with d9 d0 54 3c, straddling multiple instructions. This is the magic of variable length instruction encoding: when you look at an instruction stream a little bit sideways, things can change very drastically.

Yo dawg, I heard you like X86 assembly...

It's not obvious, at first glance, just how clever this technique is.

The goal of the ActionScript code pattern is for the attacker to insert arbitrary bytes into the code stream that the JIT otherwise generates. The attacker then uses these arbitrary bytes as an alternate instruction stream. However, the attacker has to compensate for the non-attacker-controlled bytes that surround its own.

Each 32-bit immediate encoded in the ActionScript program starts with a MSB of 0x3c — that byte is little-endian encoded and placed, in memory, right before each of the 0x35s that represent the XOR EAX,[imm32] opcode.

Jumping to the 1-byte offset from the base address of the instruction stream starts us off executing 0xd9 0xd0 — a two-byte instruction that runs a no-op on the floating point unit. Both of these bytes were part of the attacker's immediate value: 0x3c54d0d9.

Effectively, the attacker is able to control 4 out of every 5 bytes per instruction in the stream. They are somewhat limited by the bytes fixed in the instruction stream, however — the MSB of each immediate is a 0x3c so that it can successfully combine with the 0x35 from the XOR EAX,[imm32] opcode to create a nop-like instruction, cmp al,0x35, that keeps the stream executing at the 1-byte offset.

It'd be ideal for the attacker if they could find a way to incorporate the 0x35 into an instruction in a useful way, instead of having to lose a byte in order to control it; however, there are lots of fun tricks that you can play to make compact instruction sequences. By making use of the stacky subset of x86 you can get a nice little MISCy program: pushes and pops are nice one-byte instructions that you can split across the semantic nops to simulate moves, and pushing 8-bit signed immediates only takes two bytes, as you can see at buf+11. Dumping your floating point coprocessor state out to the stack is a two byte sequence. Accessing the TEB is a three byte sequence. How can you not love x86?

For this particular code sequence, the attacker only has a 1/5 chance of jumping to an %eip that gives control back to the JIT program — if you land anywhere in the constant-encoded portion, the instruction sequence will be entirely different.

Outstanding issues

So now we know the basic requirements for pulling off a JIT spray attack:

Deterministic attacker control of values embedded in the instruction stream

Control of %eip

The ability to jump somewhere inside the JIT code, in order to probabilistically execute the attacker's interleaved instruction stream

But wait, how do you know where to jump?

JIT spray opens up the possibility for an attacker to create a lot of very similar code via the JIT compiler, possibly with nop sled prefixes. As a result, one approach to bypassing both DEP and ASLR is to fill enough of the address space with JIT code that you can jump to a random location and hit an attacker-controlled portion valid JIT code buffer with reasonable probability.

But this leads to further questions: What address does the attacker pick to jump to? How much code memory does the attacker spray? Creating a reliable exploit seems significantly more difficult.

Blazakis' solution

In order to create a reliable exploit — as opposed to a probabilistic one — Blazakis used the techniques of pointer inferencing and heap feng shui.

The sandbox makes it particularly tricky to figure out where things live in memory. Those kinds of details definitely aren't supposed to be exposed through the sandbox. If the attacker were able to figure out the locations of things in memory space through the sandbox, it would be considered an information leak.

Pointer inferencing is the technique that Blazakis used to accurately determine the memory location of heapified ActionScript entities in the Flash VM. The inferencing described in Blazakis' paper is cleverly based on the fact that literal integer values in the Flash virtual machines are [♠] hashed alongside of heap-object pointers. By observing the default dictionary enumeration order — the order in which keys exist in the hash table — Blazakis was able to narrow down the value of the object pointer to its exact location. [♥]

"Heap feng shui" is the process of understanding the memory allocation behaviors of the sandboxed environment that code is running in, and using that knowledge to place objects in some known locations in memory relative to each other. Blazakis noted that the ActionScript object heap expands in 16MiB increments and took into account the heuristics for executable allocations when loading ActionScript bytecode entities. Blazakis also relied on the usage of VirtualAlloc in the ActionScript memory allocator, with the knowledge that VirtualAlloc maps the first 64KiB aligned hole that's found in a linear scan through the virtual address space.

Blazakis was able to combine these techniques into reliable stage 0 shellcode execution by:

Determining the exact pointer of the first object within a 16MiB heap chunk.

Spraying just enough JIT code to place a JIT code allocation right after that 16MiB chunk.

Determining the JIT spray address to be the object address + 16MiB.

Adding a value like 0x101 to the base address to get an unaligned JIT code location, as described in the JIT spray section above.

Jumping to that resulting address.

Back to the story: the law of large numbers

So, Coleslaw intends to use a multi-step process:

Find bug that permits control flow hijacking

Perform JIT spray

Jump to probabilistic address for stage 0 shellcode

Importance of leaked information about the memory map becomes apparent here: it prevents you from doing a JIT-spray and jump-pray. However, given enough visitors, like the five million to Winey-and-Cheesy's giveaway, we have to start calculating expected values. As mitigations are added to lower the probability of success, we can see the expected value of ownage drop as well.

What will become of our anti-hero?

Will Coleslaw be able to use JIT spray to successfully exploit the browser?

Will pesky JavaScript engine hackers join forces with The League of Extraordinary Security Experts in an attempt to foil Coleslaw's plans with epic mitigations?

What will become of the wheel of Gouda? I hear that they don't keep for very long...

All this, and more, next week — same time, same channel — on Honest to a Segfault!

I figured if the PyPy folks — the folks who wrote a language system in which you can write interpreters such that those interpreters can be traced by construction — aren't sure what the JS team is up to, then some clarification couldn't hurt. Many thanks to Alex Gaynor for taking the time to chat about good topics for further discussion.

SpiderMonkey: primordial ooze

Back in the day, there was only one monkey.

The name SpiderMonkey was the very first monkey in the codebase, a name that was picked back when Brendan created the JavaScript engine implementation for Netscape 2. [*] I was told that the name was chosen because the spider monkey is the fastest of the monkeys, but I'm having trouble citing a reputable source — I feel the Internet Archive geocities page for user SpiderMonkeysRule3375 may have a tinge of bias.

Anyways, in the unimonkey era, it was clear that SpiderMonkey referred to "that chunk of code that runs JavaScript." That chunk of code had an interpreter component inside of it. Back in those days people said things like, "Duh, JavaScript is an interpreted language. Now pass me that anti-skip enabled CD-player because I just got the new Kris Kross album." Meanwhile, Chambers, Ungar, and Hölzle were getting so-called "interpreted" languages like Self compiling within 5x performance of optimized C code, [†] but that's a whole 'nother rant...

SpiderMonkey's interpreter-based development continued for many years. Then, with the release of Firefox 3.5, in a move generally regarded as epic, the JS engine folks decided to ship a trace compiler implementation to 90 million active daily users. The addition of this Just-In-Time (JIT) compilation capability was a large evolutionary step in the codebase, so in 2008 that evolutionary step in the Mozilla JS engine implementation became known as TraceMonkey.

So, to recap: the JS engine was still called SpiderMonkey, but the evolutionary step was called TraceMonkey. Think of it like a release version.

TraceMonkey: monkey two in the monkey sea

So, with the advent of TraceMonkey, the JS engine had two "execution modes": the interpreter and the tracer. Each "execution mode" has its own implementation of the JavaScript bytecode execution:

The interpreter is a piece of C++ code that loops over the bytecodes and executes them one by one.

The tracer is a piece of C++ code that observes the bytecodes being run by the interpreter and emits its own implementation of those bytecodes into the tracer backend.

The tracer uses a compiler backend called NanoJIT. The observation that the tracer does, called recording, actually builds the NanoJIT compiler's primary data structure, known in compiler jargon as its intermediate representation (IR). When the tracer is done observing, it asks NanoJIT to turn the data structure that it built into native assembly code. See the MDC docs on the Tracing JIT for more info.

Along the way, NanoJIT does some of the alphabet soup compiler optimizations on this data structure: Common Subexpression Elimination (CSE), Dead Code Elimination (DCE), and local register allocation. More recently, with efforts on the part of nnethercote, it's capable of Loop Invariant Code Motion and distinction between alias groups. As I understand it, the tracer's NanoJIT compiler pipeline configuration does an incremental forwards pass as you emit instructions into the buffer, and then performs a subsequent backwards pass that does something I've heard called "destination driven code generation". Sorry that I'm short on details, but I haven't worked on NanoJIT.

NanoJIT's IR form is called linear Static Single Assignment (SSA). [‡] The term "SSA" is a piece of compiler jargon that sounds scary and inaccessible, but it's very simple when you whittle it down. If you think of instructions as producing values instead of writing to variables, then the sources of your instructions become other instructions. It's generally easier for a backend, like NanoJIT, to analyze instructions in the IR when they avoid sticking their results in a shared space (like a variable). Unfortunately, a full treatment of SSA form and why it's useful for optimizing compilers is future entry fodder.

I believe the term "linear SSA" refers to the fact that the tracer records a strictly linear sequence of instructions known as the trace. This recording process uses the instruction-as-producing-value model in building its data structure. The IR is qualified as "linear" because, in a linear sequence of instructions, there are no "joins", like you would deal with when compiling conditionals or loop headers.

There's been talk of adding a "join" instruction to the NanoJIT IR (jargon: phi instruction), which would permit the tracer to form simple branching operations in a trace; for example, computing the max of two numbers, coercing a number value that may be an integer to a double, or inlining a simple/tiny "else" block (jargon: alternate) of a conditional. From a cursory look at the NanoJIT source I don't think that's happened yet.

JägerMonkey: brushin' the monkey-fangs with a bottle of Jack

The tracer was optimized and stabilized through Firefox 3.6. Post FF 3.6, slightly before I arrived on the scene, the JS team decided that a baseline JIT should be implemented to increase performance in situations where the tracer was suboptimal — either because the record-and-compile process was taking too long to or because of trace instability.

In 2010, the team created JägerMonkey, the next evolutionary step in the SpiderMonkey engine. JägerMonkey added a whole-method compiler with miniscule compile times.

At this point we had three execution modes:

The interpreter

Trace-JIT compiled code

Method-JIT compiled code

Conveniently, each of these modes corresponded to an evolutionary step in the engine, so the components were colloquially named:

The JägerMonkey method compiler

The TraceMonkey trace compiler

The SpiderMonkey interpreter

Even though the components are referred to as such, the JS engine as a whole is named SpiderMonkey.

The JägerMonkey compiler loops over the bytecodes of a method, performing abstract interpretation to turn the stack-based bytecode program into register-based assembly. During abstract interpretation, the compiler performs a suite of local optimizations: copy propagation, constant propagation, arithmetic expression folding, and greedy local register allocation. However, because the JägerMonkey compiler is designed, in part, for efficient speed of compilation, it doesn't build an IR and doesn't perform any "global" (cross-basic-block) optimizations. In contrast to the tracer, as a whole-method JIT it has to deal with the full gamut of control flow, so it ends up dropping most of its optimization information at join points.

The method compiler also works on the JS execution stack directly, instead of forming its own side-stack for executing on "unboxed" values, like our trace-compiled code does. The performance impact of "falling off trace" and figuring out how to restore (jargon: reify) the latent JS stack state was one of the things that the baseline JägerMonkey compiler was designed to avoid.

IonMonkey: highly focused monkey energy

The most recent evolutionary step for the engine has been dubbed "IonMonkey". The IonMonkey compiler introduces a bushel of ideas we haven't seen in the SpiderMonkey engine before. It's designed as a whole-method compiler, but IonMonkey knows quite a few tricks that the JägerMonkey compiler was never taught.

The primary purpose of IonMonkey is perform whole-method (jargon: global) compiler optimizations on code with join points (jargon: regions), like loops and conditionals, using a full-fledged IR. Generally, this makes it a more powerful, modern compiler design, incorporating some of the "classical" compiler algorithms that you have in close-to-the-metal compiler suites like LLVM and GCC, except specialized to represent and work on higher-level JS semantics with minimal overhead.

Type specialization

You can significantly optimize JavaScript programs by specializing the generated code for the types of objects that actually flow through it. The IonMonkey compiler is designed to use type information from either static type-inference analysis or runtime type profiling to specialize the code it compiles. Each of these type-information gathering approaches has its own merits.

Type data collected over several runtime code executions is a promising alternative to the single-recording trace formation that we currently use in the tracer. For code that's identified as hot, runtime type profiling collects, over the course of several executions, the set of types observed at each program point. This set of types is used to bias the generated code — the compiler assumes the types runtime profiling has observed are the ones that matter, and guards that that assumption is correct in the generated code with branch instructions.

Static type inference, on the other hand, is known to produce a (conservatively) correct set of types that can flow through each point in the program — this is what's called a sound whole-program analysis. When it's used to complement run-time type profiling, type inference is all about giving you the knowledge to eliminate guards and unnecessary fallback code.

The IonMonkey compiler is capable of consolidating guards, but some guards defy consolidation, [§] and eliminating even very predictable guards can be important in hot loops. [¶] With type inference, the compiler knows things about a point in a JS program like:

The value in a is definitely a dense array of length 42 (with no "holes") that only contains integer values.

When you knowa is full of integers and that idx is an integer, you don't have to guard that the value of a[idx] is an integer. You can also use interval analysis to avoid the length-check guard that establishes idx is less than 42.

Design space

Personally, I don't think the design space for whole-method compilation of dynamic languages is well explored at this point. Making a HotSpot analog for JS has a lot of open ended, how-will-it-work-best style questions. Here's a sampling:

How do you best represent inline caches at the IR level?

Which events should cause deoptimization to occur, and which should we guard against "up front" in the generated code?

How do you efficiently represent debug/deoptimization information to elegantly handle the more complex deoptimization scenarios?

Just how beefy will the adaptive levels get? Can you adapt JS compilation in the browser all the way into the JVM's -server flag space? [#]

How important is off-thread compilation now that there are higher compiler optimization levels?

Will there be a monkey king?

In the coming months the IonMonkey team will be answering an interesting question: do all of the existing execution modes have their place? The thing about adaptive compilation is just that... it's adaptive. Clearly, the fittest monkeys will survive.

An adaptive compiler can potentially subsume the roles of other execution modes. Yes, IonMonkey is being designed to whole-method optimize code that runtime profiling indicates is hot. However, despite the overhead of building an IR, it can potentially do quick block-local optimizations to make "baseline" generated code, like JägerMonkey did.

The IonMonkey backend can also potentially be used as the compiler for traces formed using the existing monitoring infrastructure — compilers that handle regions can work just as well on linear instruction sequences. Linear instruction sequences are generally easier to optimize, because you don't have to reconcile (jargon: meet) optimization information at join points.

The idea of a one-size-fits-all JS compiler platform is alluring, but, in my mind, it's not clear how feasible that really is. As the IonMonkey compiler comes online, the experimentation can start. The team will just have to bust out the range finders and measure the badassitude of the ion beam. We won't regress performance, so which monkeys remain standing has yet to be seen.

I suppose what I'm saying is that you can count on us to pit our vicious monkeys against each other in a deathmatch for your personal benefit.

Microprocessor architecture perspective: even though modern microprocessors have loop buffer with pre-translated microcode and full branch prediction, you can't retire any of the guard's dependent instructions until you see that the guard has been predicted correctly.

Alongside our ferocious fixing, one of our late-game performance initiatives was to get all of our polymorphic inline caches (AKA PICs) enabled on ARM devices. It was low risk and of high benefit to our Firefox for Mobile browser, whose badass-yet-cute codename is Fennec.

Melts in your mouth, not in your ARM

To recap, JägerMonkeyJM is also known as the "method compiler": it takes a method's bytecode as input and orders up the corresponding blob of machine code with some helpful information on the side. Its primary sub-components are the register tracker, which helps the compiler transform the stack-based bytecode and reuse already-allocated machine registers intelligently, and the MacroAssembler, which is the machine-code-emitting component we imported from Webkit's Nitro engine.

The MacroAssembler is the secret sauce for JägerMonkeyJM's platform independence. It's an elegantly-designed component that can be used to emit machine code for multiple target architectures: all of x86, x86-64, and ARM assembly are supported through the same C++ interface! This abstraction is the reason that we only need one implementation of the compiler for all three architectures, which has been a clear win in terms of cross-platform feature additions and maintainability.

"So", you ask, "if you've got this great MacroAssembler-thingy-thing, why didn't all the inline caches work on all the platforms to begin with?" Or, alternatively, "If all the compiler code is shared among all the platforms, why didn't all the inline caches crash on ARM?"

The answer is that some platform-specifics had crept into our compiler code!

ARM'd and ifdef-dangerous

As explained in the entry on inline caches, an inline cache is a chunk of self-modifying machine code. A machine code "template" is emitted that is later tweaked to reflect the cached result of a common value. If you're frequently accessing the nostrilCount property of Nose objects, inline caches make that fast by embedding a shortcut for that access into the machine code itself.

In the machine code "template" that we use for inline caches, we need to know where certain constants, like object type and object-property location, live as offsets into the machine code so that we can change them later, during a process called repatching. However, when our compiler says, "If this value is not 0xdeadbeef, go do something else," we wind up with different encodings on each platform.

As you may have guessed, machine-code offsets are different for each platform, which made it easier for other subtle platform-specifics to creep into the compiler as well.

To answer the question raised earlier, the MacroAssembler interface wasn't heavily relied on for the early inline cache implementations. Inline caches were first implemented for x86, and although x86 is a variable-width instruction set, all of the instruction sequences emitted from the compiler had a known instruction width and format. [*] This permitted us to use known-constant-offset values for the x86 platform inline caches. These known-constant-offsets never changed and so didn't require any space or access time overhead in side-structures. They seemed like the clear solution when x86 was the only platform to get up-and-running.

Then x86-64 (AKA x64) came along, flaunting its large register set and colorful plumage. On x64, the instruction sequence did not have a known width and format! Depending on whether the extended register set is used, things like mov instructions may require a special REX prefix byte in the instruction stream (highlighted in blue above). This led to more ifdefs — on x64 a bunch more values have to be saved in order to know where to patch our inline caches!

As a result, getting inline caches working on ARM was largely a JägerMonkey refactoring effort. Early on, we had used conditional compilation (preprocessor flags) to get inline caches running on a platform-by-platform basis, which was clearly the right decision for rapid iteration, but we decided that it was time to pay down some of our technical debt.

Paying down the debt: not quite an ARM and a leg

The MacroAssembler deals with raw machine values — you can tell it dull-sounding machine-level things like, "Move this 17 bit sign-extended immediate into the EAX register."

On the other hand, we have our own awesome-sounding value representation in the SpiderMonkey engine: on both 32-bit and 64-bit platforms every "JS value" is a 64-bit wide piece of data that contains both the type of the data and the data itself. [†] Because the compiler is manipulating these VM values all the time, when we started the JägerMonkeyJM compiler it was only natural to put the MacroAssembler in a delicious candy coating that also knew how to deal with these VM values.

The NunboxAssembler, pictured in red, [‡] is a specialized assembler with routines to deal with our "nunbox" value representation. [§] The idea of the refactoring was to candy-coat a peer of the MacroAssembler, the Repatcher, with routines that knew how to patch common inline cache constructs that the NunboxAssembler was emitting.

With the inline cache Repatcher in place, we were once again able to move all the platform-specific code out of the compiler and into a single, isolated part of the code base, hidden behind a common interface.

Routines like NunboxAssembler::emitTypeGuard, which knows how to emit a type guard regardless of the platform, are paired with routines like ICRepatcher::patchTypeGuard(newType), which knows how to patch a type guard regardless of platform. Similarly, NunboxAssembler::loadObjectProperty has a ICRepatcher::patchObjectPropertyLoad. The constructs that are generated by the NunboxAssembler are properly patched by the corresponding ICRepatcher method on a miss. It's all quite zen.

Frog ARMs

On real devices running the Fennec betas, we've seen marked improvements since Beta 3. [¶] Most notably, we've leapfrogged the stock Android 2.2 browser on the V8-V5 benchmark on both the Galaxy S and the Nexus One. Pretty graphs courtesy of Mark Finkle.

ARMn't you glad I didn't say banana?

Since I've run out of remotely-acceptable ARM malapropisms, these topics will be left to further discussion. Feel free to comment on anything that deserves further clarification!

Footnotes

For example, if you always emit a simple mov from a 32-bit register to a 32-bit register, that has a known constant length. The "variable width" part of "variable width instruction set" refers to the fact that different instructions do not generally take the same number of bytes. It does not mean that the encoding of a given instruction (like mov) with particular operands (like two 32-bit registers) is totally variable.

On the fancy Tegra 2 board I was developing on, running the SunSpider harness on the JavaScript shell with methodjit-only, this work net us a whopping 230% speedup on the V8-V4 benchmark and a 15% speedup on SunSpider 0.9.1.

Inline caching is a critical ingredient in the delicious pie that is dynamic language performance optimization. What follows is a gentle-albeit-quirky introduction to what polymorphic inline caches (PICs) are and why they're useful to JavaScript Just-In-Time compilers like JaegerMonkey.

But first, the ceremonial giving of the props: the initial barrage of PIC research and implementation in JaegerMonkey was performed by Dave Mandelin and our current inline cache implementations are largely the work of David Anderson. As always, the performance improvements of Firefox's JavaScript engine can be monitored via the Are We Fast Yet? website.

C is for speed, and that's good enough for me

C is fast.

Boring people (like me) argue about astoundingly interesting boring things like, "Can hand-tuned assembly be generally faster than an equivalent C program on modern processor architectures?" and "Do languages really have speeds?", but you needn't worry — just accept that C is fast, and we've always been at war with Eurasia.

So, as we've established, when you write a program in C, it executes quickly. If you rewrite that program in your favorite dynamic language and want to know if it still executes quickly, then you naturally compare it to the original C program.

C is awesome in that it has very few language features. For any given snippet of C code, there's a fairly direct translation to the corresponding assembly instructions. [*] You can almost think of C as portable assembly code. Notably, there are (almost) zero language features that require support during the program's execution — compiling a C program is generally a non-additive translation to machine code.

Dynamic languages like JavaScript have a massive number of features by comparison. The language, as specified, performs all kinds of safety checks, offers you fancy-n-flexible data record constructs, and even takes out the garbage. These things are wonderful, but generally require runtime support, which is supplied by the language engine. [†] This runtime support comes at a price, but, as you'll soon see, we've got a coupon for 93 percent off on select items! [‡]

You now understand the basic, heart-wrenching plight of the performance-oriented dynamic language compiler engineer: implement all the fancy features of the language, but do it at no observable cost.

Interpreters, virtual machines, and bears

"Virtual machine" sounds way cooler than "interpreter". Other than that, you'll find that the distinction is fairly meaningless in relevant literature.

An interpreter takes your program and executes it. Generally, the term "virtual machine" (AKA "VM") refers to a sub-category of interpreter where the source program is first turned into fake "instructions" called bytecodes. [§]

I call these instructions fake because they do things that a hardware processing units are unlikely to ever do: for example, an ADD bytecode in JavaScript will try to add two arbitrary objects together. [¶] The point that languages implementors make by calling it a "virtual machine" is that there is conceptually a device, whether in hardware or software, that could execute this set of instructions to run the program.

These bytecodes are then executed in sequence. A program instruction counter is kept in the VM as it executes, analogous to a program counter register in microprocessor hardware, and control flow bytecodes (branches) change the typical sequence by indicating the next bytecode instruction to be executed.

Virtual (machine) reality

Languages implemented in "pure" VMs are slower than C. Fundamentally, your VM is a program that executes instructions, whereas compiled C code runs on the bare metal. Executing the VM code is overhead!

To narrow the speed gap between dynamic languages and C, VM implementers are forced to eliminate this overhead. They do so by extending the VM to emit real machine instructions — bytecodes are effectively lowered into machine-codes in a process called Just-In-Time (JIT) compilation. Performance-oriented VMs, like Firefox's SpiderMonkey engine, have the ability to JIT compile their programs.

The term "Just-In-Time" is annoyingly vague — just in time for what, exactly? Dinner? The heat death of the universe? The time it takes me to get to the point already?

In today's JavaScript engines, the lowering from bytecodes to machine instructions occurs as the program executes. With the new JaegerMonkey JIT compiler, the lowering occurs for a single function that the engine sees you are about to execute. This has less overhead than compiling the program as a whole when the web browser receives it. The JaegerMonkey JIT compiler is also known as the method JIT, because it JIT compiles a method at a time.

For most readers, this means a few blobs of x86 or x86-64 assembly are generated as you load a web page. The JavaScript engine in your web browser probably spewed a few nice chunks of assembly as you loaded this blog entry.

Aside: TraceMonkey

In SpiderMonkey we have some special sauce: a second JIT, called TraceMonkey, that kicks in under special circumstances: when the engine detects that you're running loopy code (for example, a for loop with a lot of iterations), it records a stream of bytecodes that corresponds to a trip around the loop. This stream is called a trace and it's interesting because a) it can record bytecodes across function calls and b) the trace optimizer works harder than the method JIT to make the resulting machine code fast.

There's lots more to be said about TraceMonkey, but the inline caching optimization that we're about to discuss is only implemented in JaegerMonkey nowadays, so I'll cut that discussion short.

The need for inline caching

In C, accessing a member of a structure is a single "load" machine instruction:

The way that the members of struct Nose are laid out in memory is known to the C compiler because it can see the struct definition — getting the attribute nose->isPointy translates directly into a load from the address addressof(nose) + offsetof(Nose, isPointy).

Note: Just to normalize all the terminology, let's call the data contained within a structure the properties (instead of members) and the way that you name them the identifiers. For example, isPointy is an identifier and the boolean data contained within nose->isPointy is the property. The act of looking up a property through an identifier is a property access.

On the other hand, objects in JavaScript are flexible — you can add and delete arbitrary properties from objects at runtime. There is also no language-level support for specifying the types that an identifier can take on. As a result, there's no simple way to know what memory address to load from in an arbitrary JavaScript property access.

Consider the following snippet:

function isNosePointy(nose) {
return nose.isPointy;
}

To get at the isPointy property, the JavaScript VM emits a single bytecode, called GETPROP, which says "pull out the property with the identifier isPointy". [#] Conceptually, this operation performs a hash-map lookup (using the identifier as a key), which takes around 45 cycles in my microbenchmark. [♠]

The process of "looking up a property at runtime because you don't know the exact type of the object" falls into a general category of runtime support called dynamic dispatch. Unsurprisingly, there is execution time overhead associated with dynamic dispatch, because the lookup must be performed at runtime.

To avoid performing a hash-map lookup on every property access, dynamic language interpreters sometimes employ a small cache for (all) property accesses. You index into this cache with the runtime-type of the object and desired identifier. [♥] Resolving a property access against this cache under ideal circumstancestakes about 8.5 cycles.

WTF is inline caching already!?

So we've established that, with good locality, JS property accesses are at least 8.5x slower than C struct property accesses. We've bridged the gap quite a bit from 45x slower. But how do we bridge the gap even bridgier?

The answer is, surprisingly, self-modifying code: code that modifies code-that-currently-exists-in-memory. When we JIT compile a property access bytecode, we emit machine-code that looks like this:

Now, if you ask Joe Programmer what he thinks of that code snippet, he would correctly deduce, "The slow lookup code will always be executed!" However, we've got the self-modifying code trick up our sleeves. Imagine that the type matched, so we didn't have to go to the slow lookup code — what's our new property access time?

One type load, one comparison, an untaken branch, and a property value load. Assuming good locality/predictability and that the object's type happened to already be in the register (because you tend to use it a lot), that's 0+1+1+1 == 3 cycles! Much better.

But how do we get the types to match? Joe Programmer is still looking pretty smug over there.

The trick is to have the slowLookupCode actually modify this snippet of machine code! After slowLookupCode resolves the property in the traditional ways mentioned in previous sections, it fills in a reasonable value for IMPOSSIBLE_TYPE and IMPOSSIBLE_SLOT like they were blank fields in a form. This way, the next time you run this machine code, there's a reasonable chance you won't need to go to slowLookupCode — the types might compare equal, in which case you can perform a simple load instruction to get the property that you're looking for!

This technique of modifying the JIT-compiled code to reflect a probable value is called inline caching: inline, as in "in the emitted code"; caching, as in "cache a probable value in there". This the basic idea behind inline caches, AKA ICs.

Also, because we emit this snippet for every property-retrieving bytecode we don't rely on global property access patterns like the global property cache does. We mechanical mariners are less at the mercy of the gods of locality.

Where does "P" come from?

Er, right, we're still missing a letter. The "P" in "PIC" stands for polymorphic, which is a fancy sounding word that means "more than one type".

The inline cache demonstrated above can only remember information for a single type — any other type will result is a shapeIsKnown of False and you'll end up going to the slowLookupCode.

Surveys have shown that the degree of polymorphism (number of different types that actually pass through a snippet during program execution) in real-world code tends to be low, in JavaScript [♦] as well as related languages. However, polymorphism happens, and when it does, we like to be fast at it, too.

So, if our inline cache only supports a single type, what can we do to handle polymorphism? The answer may still be surprising: self-modify the machine code some more!

Before we talk about handling the polymorphic case, let's recap the PIC lifecycle.

The PIC lifecycle

The evolution of the PIC is managed through slowLookupCode, which keeps track of the state of the inline cache in addition to performing a traditional lookup. Once the slow lookup is performed and the PIC evolves, the slowLookupCode jumps back (to the instruction after the slot load) to do the next thing in the method.

When a PIC is born, it has that useless-looking structure you saw in the previous section — it's like a form waiting to be filled out. The industry terminology for this state is pre-monomorphic, meaning that it hasn't even seen one (mono) type pass through it yet.

The first time that inline cache is executed and we reach slowLookupCode we, shockingly, just ignore it. We do this because there is actually a hidden overhead associated with modifying machine code in-place — we want to make sure that you don't incur any of that overhead unless there's an indication you might be running that code a bunch of times. [♣]

The second time we reach the slowLookupCode, the inline cache is modified and the PIC reaches the state called monomorphic. Let's say we saw a type named ElephantTrunk — the PIC can now recognize ElephantTrunk objects and perform the fast slot lookup.

When the PIC is monomorphic and another type, named GiraffeSnout, flows through, we have a problem. There are no more places to put cache entries — we've filled out the whole form. This is where we get tricky: we create a new piece of code memory that contains the new filled-out form, and we modify the original form's jump to go to the new piece of code memory instead of slowLookupCode.

Recognize the pattern? We're making a chain of cache entries: if it's not an ElephantTrunk, jump to the GiraffeSnout test. If the GiraffeSnout fails, then jump to the slowLookupCode. An inline cache that can hit on more than one type is said to be in the polymorphic state.

There's one last stage that PICs can reach, which is the coolest sounding of all: megamorphic. Once we detect that there are a lot of types flowing through a property access site, slowLookupCode stops creating cache entries. The assumption is that you might be passing an insane number of types through this code, in which case additional caching would only only slow things down. For a prime example of megamorphism, the 280slides code has an invocation site with 1,437 effective types! [**]

Conclusion

There's a lot more to discuss, but this introduction is rambling enough as-is — if people express interest we can further discuss topics like:

Why in the name of Knuth is it okay to perform a linear search of the types in the inline cache? Can't we do better?

What are all the types of inline caches that you currently implement? What bytecodes do they correspond to? Who are the handsome fellows who implemented each of those inline caches and are they currently seeing anybody?

Why are you talking about types in a language like JavaScript, where typeof o is almost always "object"?!

What happens if the property lives on a prototype of the object, instead of the object itself? ...then it can't be a simple slot load! I feel hurt and betrayed... what other caveats haven't been discussed!?

Why might you choose to use monomorphic inline caches (MICs) instead of using PICs everywhere?

Does the TraceMonkey execution time suffer because hits in the inline cache prevent fills of the (global) property cache?

What kick-arse optimizations related to PICs have been discussed in the literature that have yet to be implemented in JaegerMonkey?

What the crap have I been talking about this whole time?

Suffice it to say that JavaScript gets a nice speed boost by enabling PICs: x86 JaegerMonkey with PICs enabled is 25% faster on SunSpider than with them disabled on my machine. [††] If something makes a dynamic language fast, then it is awesome. Therefore, inline caches are awesome. (Modus ponens says so.)

Alternative interpreter designs tend to walk over something that looks more like the source text — either an abstract syntax tree or the program tokens themselves. These designs are less common in modern dynamic languages.

There have historically been implementations that do things like this; notably, the Lisp machines and Jazelle DBX. The JavaScript semantics for ADD are particularly hairy compared to these hosted languages, because getting the value-for-adding out of an object can potentially invoke arbitrary functions, causing re-entrance into JavaScript interpretation.

Note that there is actually further overhead in turning the looked-up property into an appropriate JavaScript value. For example, there are additional checks to see whether the looked-up value represents a "getter" function that should be invoked.

Gregor Richards published a paper in PLDI 2010 that analyzed a set of popular web-based JS applications. The results demonstrated that more than eighty percent of all call sites were monomorphic (had the same function body). I'm speculating that this correlates well to the property accesses we're discussing, though that wasn't explicitly established by the research — in JS, property access PIC are easier to discuss than function invocation PICs. In related languages, like Self, there is no distinction between method invocation and property access.

"Hidden overhead my foot! Where does it come from?" Today's processors get a little scared when you write to the parts of memory that contain code. Modern processor architecture assumes that the memory you're executing code out of will not be written to frequently, so they don't optimize for it. [‡‡]

The annoying part is that the instruction prefetcher may have buffered up the modified instructions, so you have to check if the modified cache line is in there. Older cache coherency protocols I've read about flush lines past unified caches if they detect a hit in both the instruction and data caches — maybe it's better nowadays.