Pages

Monday, June 29, 2009

I've been reading a lot of papers lately with respect to opcode dispatch because I've been trying hard to prepare for a possible migration to L1. As I've mentioned on other posts before, L1 is going to enable a number of cool techniques and optimization possibilities that we don't really have access to right now, and make a number of other optimizations which are currently possible more beneficial. What I want to do is understand all these possibilities and their requirements so we can design L1 specifically to provide them. The goal of this post is to go over some ideas about opcode dispatch and runcores from the literature I've been reading, to make sure everybody has a basic understanding of some of these concepts.

This is the start of what I hope will be an ongoing series of posts where I try to discuss theory and algorithms behind some of Parrot's systems, and theories about things we would like to add to Parrot in the future. I'll tag posts like this "ParrotTheory".

"Dispatch" is the action of moving control flow into the logic body of an opcode, based on the value of a "program counter" (PC) pointer. In microprocessor hardware terminology, the PC is a hardware register that points to the memory address of the currently executing machine code instruction. x86 aficionados will probably recognize this as the "IP" (Instruction Pointer) register. Same concept, different name. Bytecode comes in to the interpreter as a large block of sequential opcodes, and the current opcode is pointed to by *PC. Moving to the next opcode means we increment PC, so PC++ or something similar.

Most simplistic VMs will use a switch-based core, which uses the C switch statement to dispatch:

Control flow in this snippet is a complete mess because we're following the switch, breaking to the bottom of the switch, and then jumping back up to the top of the loop to repeat the process. A good optimizing C compiler will probably convert this switch statement into a dispatch table of jump addresses, which creates a series of indirect jumps. Hardware has trouble with indirect jumps in general because it needs to load the address to jump to from a different address in memory (which can change at runtime and therefore cannot be easily predicted). I can describe this in more detail elsewhere for anybody who is interested (I'm doing a particularly shitty job right now). We end up running the CPU at a fraction of it's maximum throughput speed. It has to keep refilling the instruction pipeline because it can't anticipate where control flow will move to next. This is bad.

A system called direct threading stores all the opcode bodies together in a single large function, similar to the switch core. However, instead of using a loop and a switch, it uses a goto instruction to jump directly to a label. Each opcode has a label, and those are usually stored in a table somewhere. So instead of the example above, we have this:

It turns out that this is a little faster, but the hardware microprocessor is still not able to properly anticipate where all the indirect jumps lead. Sometimes it can, and some architectures handle this case better then others, but it's far from perfect. Parrot has two cores that implement direct-threading: the computed goto (CG) and predereferenced computed goto (PCG or CGP) cores. Because of limitations in some compilers and the C99 spec, CG and PCG are only available when Parrot is built with GCC. These can be accessed with the "-R cgoto" and "-R cgp" flags respectively.

[Update 01/07/09: As a note of clarification, switch-based cores and indirect-threading cores perform differently on different systems. Some platforms handle one better then the other. Most compilers will generate bounds-checking logic for the switch core that does not exist in the direct-threaded core. I've seen numbers that show the switch core leads to almost a 100% branch prediction failure on some systems, while direct-threading leads to only about a 50% branch misprediction on those same systems. I would like to see a more robust cross-platform comparison of these two.]

In these cases, it's really interesting to note that the number of machine code instructions needed to dispatch an opcode is not nearly as important to system performance as the ability of the microprocessor to anticipate control flow and keep it's pipeline full. Getting a speculation wrong means that the processor will have to flush it's pipeline and reload instructions. Some processors will even stall, and not execute anything until the new address is known. These problems, called pipeline hazards are very very bad for performance.

Pipeline hazards are bad for performance, but cache hazards are even worse. Cache hazards occur when the processor attempts to access memory that is not stored in it's cache, and has to load the necessary data from the comparatively slow RAM. We run into a cache hazard in terms of opcode dispatch when the code size of the opcodes is very large, and we can't fit it all into the processor cache. So what we want is a dispatch mechanism that is good in the processor cache, but also makes things easier on the branch predictor. This is one of my motivations for making L1 a very small opcode set, to ensure that all opcodes can easily fit into the processor cache without creating all sorts of cache hazards.

Inlining, which is a technique frequently used in JIT and modern optimizers, tries to help with branch prediction by converting a stream of opcodes with indirect branches into a stream of machine code instructions with relative branches. Instead of dispatching to the opcode body, the opcode body is copied into a running stream of machine code and executed directly by the processor. This completely eliminates pipeline hazards due to indirect opcode dispatch. However, you end up with more code to cache because you have an entire stream of machine code, where there may be duplication of individual ops. You also spend a lot of overhead calling memcpy repeatedly on the opcodes. This technique increases memory footprint and can reduce cache performance.

A subroutine-threaded core stores each opcode as a separate C-level function. Each op in sequence is called and then the op returns back to the runcore. This is two branch instructions to dispatch each op, compared to only one for a direct-threaded core. However, recent benchmarks I have seen in Parrot show that the subroutine core actually performs faster then the direct-threaded core does. This is because modern microprocessors have lots of hardware dedicated to predicting and optimizing control flow in call/return situations, because that is one of the most common idioms in modern software. This is a nonintuitive situation where more machine code instructions actually execute faster then fewer instructions. Parrot's default "slow" core ("-R slow") and the so-called "fast" core ("-R fast") use this technique (actually, these cores aren't exactly "subroutine-threaded", but it's close). From the numbers I have seen, the fast core is the fastest in Parrot. Here's how it works, basically:

[Update01/07/09: There is some disagreement in the literature about what "Subroutine-threading" really is. Some sources refer to the example I posted above as being Call-Threaded code, and use the term "subroutine threading" more in the way that I am describing context-threading below.]

Context threading, which is a slightly more advanced technique, combines some ideas from the subroutine-threaded and direct-threaded modes, and a little bit of JIT magic. We create what's called a Context Thread Table, which is actually a memory buffer of executable machine code. We store opcodes as functions like we do in the subroutine-threaded system, and use a simple JIT engine to create a series of calls to those functions. This PASM code:

new P0, 'foo'add_i P0, 1push P1, P0

Becomes this sequence in machine code:

call Parrot_op_newcall Parrot_op_add_icall Parrot_op_push

What context threading does, in essence, is aligns the VMs PC pointer with the hardware IP register, and maximizes the ability of the hardware to predict branches. There are some complexities here involving branches at the PIR level, but they aren't insurmountable. Parrot does not currently have a context-threaded core, but I would very much like it if somebody added one. Numbers I've seen show that a good context-threaded core reduces pipeline and cache hazards by significant amounts, which has the effect of increasing performance significantly.

So that's a quick overview of some basic opcode dispatch mechanisms. I know I am leaving a few issues out, and I know I misrepresented a few topics here for brevity and clarity. Next time I think I will talk about method dispatch and some optimization techniques used for that.

3 comments:

I do find it humorous that we are now targetting the CPU features that were added to CPUs to cope with spaghetti code. Maybe if people start writing really incredibly inefficient jumptables, CPU vendors will start implementing jump-table prediction instead of branch prediction and in a decade we'll be all set :-)

I hadn't seen those slides before, but I did read the paper that they are based on. Very interesting information. Some of the terminology that they use is a little different from what I use, I think. I find that there isn't a whole hell of a lot of consistency in this area.