Friday, November 7, 2008

Porting the JIT to CLI (part 2)

In my previous post, we saw that PyPy JIT generator can produce huge
speedups when applied to the tlc toy language.

In this post we will dive a bit into the internals of PyPy JIT, to see how it
manages to do so. Note that this is a very high level overview of how the
JIT works, and applies to all backends. Then, in the third post of this
series, we will look closer at the CLI JIT backend, seeing how it works around
some .NET limitations and how the generated code looks like.

PyPy JIT for dummies

As you surely know, the key idea of PyPy is that we are too lazy to write a
JIT of our own: so, instead of passing nights writing a JIT, we pass years
coding a JIT generator that writes the JIT for us :-).

I'm not going to explain how the JIT generator does its job, (perhaps this
will be the subject of another blog post), but how the generated JIT
works.

There are values that, if known at compile-time (i.e., when the JIT compiler
runs), let the JIT to produce very efficient code. In a dynamic language,
types are the primary example: for instance, suppose you are a compiler and
you have to compile to following Python function:

def mysum(a):
return a + 1

At compile time, you don't have any knowledge about the type of the parameter:
it could be integer, float, an user defined object, etc. In this situation,
the only safe choice is to emit code which does the usual, slow, full lookup
to know how to perform the operations.

On the other hand, suppose that you knew in advance that the parameter is an
integer: this time, you could emit code that exploits this extra
knowledge, by performing directly a fast integer addition.

The idea behind PyPy JIT is that if you don't have enough knowledge to
generate efficient code, you stop compiling and wait until you know
exactly what you need. Concretely, you emit code that runs until the point
where you stopped the compilation, then it triggers a special procedure that
restarts the compiler. This time the JIT compiler knows everything
you need, because you can inspect the state of the running program.

Let's see an example: the first time the JIT compiles mysum, it produces
something like this pseudo-code:

If you call mysum(41), the execution goes in the default branch of the
switch, thus calling continue_compilation: its job is to restart the JIT
compiler, which now can emit fast code because it knows the exact type of
a; then, it modifies the original mysum_compiled function, in
order to make it executing the newly generated code the next time it
encounters an integer at that point:

From now on, every time we call mysum with an integer argument, the JIT
compiler is not called anymore and the fast path is directly executed; if we
happen to call mysum with a float arguments, the switch goes again in the
default branch, and the JIT compiler is started once more to produce fast
code also for this case. What happens in practice is that compile-time and
runtime are continuously intermixed, until the switches are stable enough and
the compiler is not needed anymore.

In PyPy jargon, this kind of "growable switch" is called flexswitch, and
it's one of the most important concept of our JIT generator.

Promotion

How can the JIT generator know which values are useful to know to generate
efficient code and which aren't? Unfortunately it can't, or at least our JIT
generator is not smart enough at the moment.

To get the best from it, the developers of the VM need to instruct it by
annotating the variables on which we want the JIT to stop until it knows the
actual values; this is done by using particular hints, called promote
and promote_class; variables annotated with such hints are said to be
promoted. If something is promoted, a flexswitch is used to gain
information about it, as seen in the last section.

For an example, let's look at an excerpt from main dispatch loop of the tlc
virtual machine:

This the implementation of the ADD opcode: first, it pops two values from
the stack; then, it computes the result; finally, it push the result to the
stack again. In between, both the classes of a and b have been
promoted: this means that when the JIT emits the code for b.add(a), it
knows exactly what is happening: if it sees that both are instances of the
IntObj class, it inlines the method call and emits a fast integer addition
instead.

Virtuals

The other important concept of the JIT is the presence of virtual
structures, virtual lists, and virtual dictionaries. Again, I'm not
going to explain in depth how they work, but only why they are so important for
generating highly efficient code.

The essence of virtuals is that you don't allocate objects until you really
need to do it, e.g. because they are being passed as an argument to some
external function. Instead, we store all the informations we need as local
variables; e.g., in the case of a virtual structure, we create as many local
variables as the number of its fields: if the structure escapes the local
scope, we force it to a real object, by allocating memory on the heap and
initializing it after the current value of the local variables.

This technique allows the JIT to avoid the allocation of many temporary
objects that hold intermediate results; consider for example the following
Python loop:

result = 0
for i in range(N):
result += i
return result

Without the JIT, at each iteration, a new int object is created and bound
to the result variable, while the previous one is discarded and not needed
anymore. By combining virtuals and promotion, the JIT can emit code that does
the whole computation locally, and allocates a real object only at the end,
when it escapes from the local scope because it is returned from the
function.

Putting it all together

This is, essentially, how PyPy's generated JITs work. To summarize, our JITs
emit multiple versions of each chunk of code: each version is specialized
and optimized for one particular case.

The cost of selecting the right specialization to use (through flexswitches)
is almost always negligible compared to how much time you save by running the
fast version instead of the more-general-but-slow one. Moreover, each
specialized version knows the exact shape of the objects it's dealing with, so
they can be virtualized to make the generated code even more efficient.

At the end, the actual code generation is done by one of the JIT backends:
the backends exploit all the knowledge gathered by the previous steps to
produce highly efficient code, but this will be the subject of the next blog
post.

In my previous post, we saw that PyPy JIT generator can produce huge
speedups when applied to the tlc toy language.

In this post we will dive a bit into the internals of PyPy JIT, to see how it
manages to do so. Note that this is a very high level overview of how the
JIT works, and applies to all backends. Then, in the third post of this
series, we will look closer at the CLI JIT backend, seeing how it works around
some .NET limitations and how the generated code looks like.

PyPy JIT for dummies

As you surely know, the key idea of PyPy is that we are too lazy to write a
JIT of our own: so, instead of passing nights writing a JIT, we pass years
coding a JIT generator that writes the JIT for us :-).

I'm not going to explain how the JIT generator does its job, (perhaps this
will be the subject of another blog post), but how the generated JIT
works.

There are values that, if known at compile-time (i.e., when the JIT compiler
runs), let the JIT to produce very efficient code. In a dynamic language,
types are the primary example: for instance, suppose you are a compiler and
you have to compile to following Python function:

def mysum(a):
return a + 1

At compile time, you don't have any knowledge about the type of the parameter:
it could be integer, float, an user defined object, etc. In this situation,
the only safe choice is to emit code which does the usual, slow, full lookup
to know how to perform the operations.

On the other hand, suppose that you knew in advance that the parameter is an
integer: this time, you could emit code that exploits this extra
knowledge, by performing directly a fast integer addition.

The idea behind PyPy JIT is that if you don't have enough knowledge to
generate efficient code, you stop compiling and wait until you know
exactly what you need. Concretely, you emit code that runs until the point
where you stopped the compilation, then it triggers a special procedure that
restarts the compiler. This time the JIT compiler knows everything
you need, because you can inspect the state of the running program.

Let's see an example: the first time the JIT compiles mysum, it produces
something like this pseudo-code:

If you call mysum(41), the execution goes in the default branch of the
switch, thus calling continue_compilation: its job is to restart the JIT
compiler, which now can emit fast code because it knows the exact type of
a; then, it modifies the original mysum_compiled function, in
order to make it executing the newly generated code the next time it
encounters an integer at that point:

From now on, every time we call mysum with an integer argument, the JIT
compiler is not called anymore and the fast path is directly executed; if we
happen to call mysum with a float arguments, the switch goes again in the
default branch, and the JIT compiler is started once more to produce fast
code also for this case. What happens in practice is that compile-time and
runtime are continuously intermixed, until the switches are stable enough and
the compiler is not needed anymore.

In PyPy jargon, this kind of "growable switch" is called flexswitch, and
it's one of the most important concept of our JIT generator.

Promotion

How can the JIT generator know which values are useful to know to generate
efficient code and which aren't? Unfortunately it can't, or at least our JIT
generator is not smart enough at the moment.

To get the best from it, the developers of the VM need to instruct it by
annotating the variables on which we want the JIT to stop until it knows the
actual values; this is done by using particular hints, called promote
and promote_class; variables annotated with such hints are said to be
promoted. If something is promoted, a flexswitch is used to gain
information about it, as seen in the last section.

For an example, let's look at an excerpt from main dispatch loop of the tlc
virtual machine:

This the implementation of the ADD opcode: first, it pops two values from
the stack; then, it computes the result; finally, it push the result to the
stack again. In between, both the classes of a and b have been
promoted: this means that when the JIT emits the code for b.add(a), it
knows exactly what is happening: if it sees that both are instances of the
IntObj class, it inlines the method call and emits a fast integer addition
instead.

Virtuals

The other important concept of the JIT is the presence of virtual
structures, virtual lists, and virtual dictionaries. Again, I'm not
going to explain in depth how they work, but only why they are so important for
generating highly efficient code.

The essence of virtuals is that you don't allocate objects until you really
need to do it, e.g. because they are being passed as an argument to some
external function. Instead, we store all the informations we need as local
variables; e.g., in the case of a virtual structure, we create as many local
variables as the number of its fields: if the structure escapes the local
scope, we force it to a real object, by allocating memory on the heap and
initializing it after the current value of the local variables.

This technique allows the JIT to avoid the allocation of many temporary
objects that hold intermediate results; consider for example the following
Python loop:

result = 0
for i in range(N):
result += i
return result

Without the JIT, at each iteration, a new int object is created and bound
to the result variable, while the previous one is discarded and not needed
anymore. By combining virtuals and promotion, the JIT can emit code that does
the whole computation locally, and allocates a real object only at the end,
when it escapes from the local scope because it is returned from the
function.

Putting it all together

This is, essentially, how PyPy's generated JITs work. To summarize, our JITs
emit multiple versions of each chunk of code: each version is specialized
and optimized for one particular case.

The cost of selecting the right specialization to use (through flexswitches)
is almost always negligible compared to how much time you save by running the
fast version instead of the more-general-but-slow one. Moreover, each
specialized version knows the exact shape of the objects it's dealing with, so
they can be virtualized to make the generated code even more efficient.

At the end, the actual code generation is done by one of the JIT backends:
the backends exploit all the knowledge gathered by the previous steps to
produce highly efficient code, but this will be the subject of the next blog
post.

@bobyour guess is right, how to implement the flexswitch is backend-dependent. This is the hardest part for .NET, as the flexswitch needs to grow dynamically (i.e., you have to add more case after the .NET method has already been compiled). It will be subject of the next blog post.

It seems that an implication of the JIT way is that, by adopting a consistent habit of implementing type driven Generic Functions, the JIT could accomplish nearly all of the possible optimizations in a single pass. In other words, by definition, each type based variation of a Generic Function call can only be fired when data of that type is provided as a parameter.