Tuesday, November 4, 2008

Porting the JIT to CLI (part 1)

As the readers of this blog already know, I have been working on the CLI
JIT backend for some months: last Friday, it reached an important milestone,
as it is now able to produce huge speedups for a little dynamic language. To
know how huge the speedup is, read on :-).

The goal of PyPy JIT generator is to take an interpreter and, with the help of
few annotations, automatically generate a JIT compiler for it. In this post,
we will talk about the tlc virtual machine: while tlc it is just a toy
language, it contains some features that make it an interesting target for our
JIT generator.

The tlc virtual machine

tlc is executed by a stack based, dynamically typed virtual machine (for
those who knows a bit about the Python VM: does it sound familiar? :-)).

There are three types of objects: integers, nil, and cons cells (i.e.
lisp-like pairs of objects).

As the VM is very simple, it provides only few opcodes:

opcodes to manipulate the stack, like PUSH, POP, etc.

integer operations, like ADD, MUL, all the comparisons, etc.:
these operations can only be applied to integers;

list operations, like CONS, CAR, CDR: these operations can
only be applied to lists;

other operations, including jumps and conditional jumps.

The VM is interesting for our purposes because it has a lot of similarities
with Python (though on a smaller scale, of course):

it has to do type-checks at runtime before doing most of the operations;

every time you do an arithmetic operation, it has to unbox the operand,
do the computation, and the box the result again.

This means that even if you have a program which only uses integers, you are
paying a lot of overhead.

To know more about this toy VM, look at its source code: the interesting
bits are the classes used to represent objects, and the interp_eval
function, which contains the main loop of the virtual machine. As you can
see, the implementation is quite straightforward; all the hint calls you
see are the special annotations needed by the JIT generator to produce better
code.

Let's JIT it!

So, the whole point is to generate a JIT compiler from it, isn't it?

First, checkout a fresh copy of the oo-jit branch:

$ svn co http://codespeak.net/svn/pypy/branch/oo-jit

Then, go to the oo-jit/pypy/jit/tldirectory, and compile the tlc VM
with the CLI backend and JIT enabled:

Cool, it seems that the result was computed correcly :-). As you can see from
the output, we ran the program three times:

by plain interpretation, without any jitting;

with the jit enabled: this run includes the time spent by doing the
compilation itself, plus the time spent by running the produced code;

again with the jit enabled, but this time the compilation has already
been done, so we are actually measuring how good is the code we produced.

So, it's time to run a benchmark: let's try to compute the factorial of a very
big number; the result will be 0, because obviously after a while we overflow,
but after all we are interested in the time spent, not in the result:

And no, I didn't make any mistake in copying&pasting: the jitted version is
really 400 times faster that the non jitted one!

Warning: my laptop seems to be not very well suited for benchmarks, as the
results vary a lot from run to run; I've run the benchmarks a lot of times,
and I got speedup factors up to 500 times, so your results may be different.

More benchmarks

It's also interesting to compare the result with a manual written C#
version of the factorial, to see how good is code we produced; to get
reasonable results, we need to compute a larger factorial, to let to code to
run a bit more:

We know that the generated code is far from being optimal, but probably the
factor of five is at least partially due to the fact that Mono's own JIT is optimized for
C#-like code, and our code has a completely different shape.

All the benchmarks above were run under Linux, with Mono 1.9.1. Here are the
results for the same benchmarks, but run with Microsoft CLR (on a different
machine, so the absolute values are not comparable):

The results are even better than before; this is probably thanks to CLR's JIT,
that does a better job than Mono when faced to something which is different
than the usual C#-like code.

Conclusions (for now)

This is a very important result, because it proves that PyPy's approach to JIT
compilers can be applied effectively also to OO virtual machines; the result
is even better than what I expected, because when generating code for .NET we
have much less freedom than when generating assembly code, and I had to play
some tricks to work around some .NET limitations.

Moreover, it worked at the first try :-). I tried to compile the tlc
virtual machine as soon as all the related JIT tests were passing, and
surprisingly everything worked just fine, even if it was the very first time I
was trying to apply some features of the JIT to something bigger than a test:
I think this is yet another prove that Test Driven Development just works!

Even if this is a major milestone, the CLI JIT backend is not yet completed:
as a consequence it can't still be used for the full PyPy, but all the
hardest problems should have been solved now.

Since a lot of readers asked for more technical details, especially about the
JIT, I will try to soon write a second blog post explaining how the CLI backend works
internally, with a brief look to the generated code to see how it looks like.

As the readers of this blog already know, I have been working on the CLI
JIT backend for some months: last Friday, it reached an important milestone,
as it is now able to produce huge speedups for a little dynamic language. To
know how huge the speedup is, read on :-).

The goal of PyPy JIT generator is to take an interpreter and, with the help of
few annotations, automatically generate a JIT compiler for it. In this post,
we will talk about the tlc virtual machine: while tlc it is just a toy
language, it contains some features that make it an interesting target for our
JIT generator.

The tlc virtual machine

tlc is executed by a stack based, dynamically typed virtual machine (for
those who knows a bit about the Python VM: does it sound familiar? :-)).

There are three types of objects: integers, nil, and cons cells (i.e.
lisp-like pairs of objects).

As the VM is very simple, it provides only few opcodes:

opcodes to manipulate the stack, like PUSH, POP, etc.

integer operations, like ADD, MUL, all the comparisons, etc.:
these operations can only be applied to integers;

list operations, like CONS, CAR, CDR: these operations can
only be applied to lists;

other operations, including jumps and conditional jumps.

The VM is interesting for our purposes because it has a lot of similarities
with Python (though on a smaller scale, of course):

it has to do type-checks at runtime before doing most of the operations;

every time you do an arithmetic operation, it has to unbox the operand,
do the computation, and the box the result again.

This means that even if you have a program which only uses integers, you are
paying a lot of overhead.

To know more about this toy VM, look at its source code: the interesting
bits are the classes used to represent objects, and the interp_eval
function, which contains the main loop of the virtual machine. As you can
see, the implementation is quite straightforward; all the hint calls you
see are the special annotations needed by the JIT generator to produce better
code.

Let's JIT it!

So, the whole point is to generate a JIT compiler from it, isn't it?

First, checkout a fresh copy of the oo-jit branch:

$ svn co http://codespeak.net/svn/pypy/branch/oo-jit

Then, go to the oo-jit/pypy/jit/tldirectory, and compile the tlc VM
with the CLI backend and JIT enabled:

Cool, it seems that the result was computed correcly :-). As you can see from
the output, we ran the program three times:

by plain interpretation, without any jitting;

with the jit enabled: this run includes the time spent by doing the
compilation itself, plus the time spent by running the produced code;

again with the jit enabled, but this time the compilation has already
been done, so we are actually measuring how good is the code we produced.

So, it's time to run a benchmark: let's try to compute the factorial of a very
big number; the result will be 0, because obviously after a while we overflow,
but after all we are interested in the time spent, not in the result:

And no, I didn't make any mistake in copying&pasting: the jitted version is
really 400 times faster that the non jitted one!

Warning: my laptop seems to be not very well suited for benchmarks, as the
results vary a lot from run to run; I've run the benchmarks a lot of times,
and I got speedup factors up to 500 times, so your results may be different.

More benchmarks

It's also interesting to compare the result with a manual written C#
version of the factorial, to see how good is code we produced; to get
reasonable results, we need to compute a larger factorial, to let to code to
run a bit more:

We know that the generated code is far from being optimal, but probably the
factor of five is at least partially due to the fact that Mono's own JIT is optimized for
C#-like code, and our code has a completely different shape.

All the benchmarks above were run under Linux, with Mono 1.9.1. Here are the
results for the same benchmarks, but run with Microsoft CLR (on a different
machine, so the absolute values are not comparable):

The results are even better than before; this is probably thanks to CLR's JIT,
that does a better job than Mono when faced to something which is different
than the usual C#-like code.

Conclusions (for now)

This is a very important result, because it proves that PyPy's approach to JIT
compilers can be applied effectively also to OO virtual machines; the result
is even better than what I expected, because when generating code for .NET we
have much less freedom than when generating assembly code, and I had to play
some tricks to work around some .NET limitations.

Moreover, it worked at the first try :-). I tried to compile the tlc
virtual machine as soon as all the related JIT tests were passing, and
surprisingly everything worked just fine, even if it was the very first time I
was trying to apply some features of the JIT to something bigger than a test:
I think this is yet another prove that Test Driven Development just works!

Even if this is a major milestone, the CLI JIT backend is not yet completed:
as a consequence it can't still be used for the full PyPy, but all the
hardest problems should have been solved now.

Since a lot of readers asked for more technical details, especially about the
JIT, I will try to soon write a second blog post explaining how the CLI backend works
internally, with a brief look to the generated code to see how it looks like.

7 comments:

Anonymous
said...

If you are benchmarking on Linux then watch out for CPU speed scaling. For example on Ubuntu by default the ondemand governor is used which runs the CPU at lowest possible speed and until there is any CPU demand at which point it runs at fastest. The time to switch varies (eg sometimes it can be instantaneous, other times a second or two, and other times not at all).

- I know about cpufreq-set, but even setting the governor to performance doesn't help, the timings vary a lot between different runs. If someone knows a way to run reliable benchmarks, it would be very appreciated!

- I have plans to experiment also the JIT on the JVM: since HotSpot usually does a better job than CLR's JIT, it's possible/likely that the JVM is a better platform for our purposes. Also, the experimental Da Vinci Machine contains features that could be very useful for us. Unfortunately the PyPy non-JIT JVM backend is not as advanced as the CLI one, and it lacks some features that are really needed for writing a JIT backend.

- Float operations are already (mostly) supported by our JIT backends; I bet that if you add a FloatObj to the tlc interpreter, you will see huge speedups as well. However, the real point of PyPy's approach is that once finished it will optimize much more than ints and floats, including features that are currently not implemented by psyco (e.g. generators).