Thursday, March 5, 2009

JIT - a bit of look inside

The previous post about our JIT explained a bit from the 1000 km
perspective how the tracing JIT would approach a language like Python.

I would like to step a bit inside and give a zoom to some of its features that
are already working.
While probably not the most innovative, I think it's very nice to look
at the way we work with the JIT and what tools we use.

The main cool thing is that you can work on and try the JIT (including trying
it on the Python interpreter!) without even generating a single bit of
assembler. How? Let's start with something very simple. Let's take
a simple interpreter for language X.

Language X has 3 opcodes: CO_INCREASE, CO_DECREASE and CO_JUMP_BACK_3.
CO_INCREASE increase the accumulator by one, CO_DECREASE decrease
it by one, CO_JUMP_BACK_3 jump 3 opcodes back, if the accumulator is smaller
than 100 (this is only to maintain some halting conditions possible).
The interpreter for language X looks like this::

All very simple code, expect the jitdriver hints, which instruct JIT how to
behave (they are the equivalent of the ``add_to_position_key`` of last the blog
post).

Let's look how this code is processed. This will also give a glance
at how we work in this code. This particular piece can be found
on a branch in pypy/jit/metainterp/test/test_loop.py
and can be run with ./test_all.py jit/metainterp/test/test_loop.py -k test_example -s --view from pypy directory. The -s option lets you see the debugging output, while
--view will show you some graphs. So, let's look at graphs in order:

And the same picture with a bit of zoom for the first block:

This is the call graph of an interpreter loop, nothing magic so far. This is an
intermediate representation of translation toolchain input. If you look around
you can follow how the opcodes are dispatched (with a chain of ifs) and helpers
called. Next graph is very boring, because it's a bit lower level representation
of the same thing (you exit with q or escape btw :).

When we exit the graph viewer, we can see the trace generated by interpreting
this graph with a given bytecode (variable code in paste above). It's something
like:

It's entering JIT, doing some primitive operations for bytecode dispatching
and repeating the loop. Note that at the end of the interpreted loop
(not to be confused with the interpreter loop), we see int_sub [3, 3]
which resets the bytecode position to the beginning. At this time JIT
(instructed by can_enter_jit hint) notices that all green variables
are the same (here only i),
hence we can compile the efficient loop from this point.

The loop contains 3 additions and a check (for i < 100), exactly
the same as our interpreted program would do, but completely without
interpretation overhead!

As you might have noticed, there is no assembler involved so far. All of this
instruction execution is done directly, in pure python. In fact, the
code for executing instructions is located in jit/backend/llgraph
which directly interprets instructions. This is by far simpler (and easier
to debug) than x86 assembler.

And this is basically it: the very simple interpreter and a jit for it.
Of course we actually can generate assembler for that. Also the missing
piece is optimizing the generated graphs. While for this example,
by removing the interpretetation overhead, we're done, with more complex
examples it's important to further optimize traces. Hopefully this and
how we actually generate assembler will be topics for next blog posts.

Cheers,
fijal

The previous post about our JIT explained a bit from the 1000 km
perspective how the tracing JIT would approach a language like Python.

I would like to step a bit inside and give a zoom to some of its features that
are already working.
While probably not the most innovative, I think it's very nice to look
at the way we work with the JIT and what tools we use.

The main cool thing is that you can work on and try the JIT (including trying
it on the Python interpreter!) without even generating a single bit of
assembler. How? Let's start with something very simple. Let's take
a simple interpreter for language X.

Language X has 3 opcodes: CO_INCREASE, CO_DECREASE and CO_JUMP_BACK_3.
CO_INCREASE increase the accumulator by one, CO_DECREASE decrease
it by one, CO_JUMP_BACK_3 jump 3 opcodes back, if the accumulator is smaller
than 100 (this is only to maintain some halting conditions possible).
The interpreter for language X looks like this::

All very simple code, expect the jitdriver hints, which instruct JIT how to
behave (they are the equivalent of the ``add_to_position_key`` of last the blog
post).

Let's look how this code is processed. This will also give a glance
at how we work in this code. This particular piece can be found
on a branch in pypy/jit/metainterp/test/test_loop.py
and can be run with ./test_all.py jit/metainterp/test/test_loop.py -k test_example -s --view from pypy directory. The -s option lets you see the debugging output, while
--view will show you some graphs. So, let's look at graphs in order:

And the same picture with a bit of zoom for the first block:

This is the call graph of an interpreter loop, nothing magic so far. This is an
intermediate representation of translation toolchain input. If you look around
you can follow how the opcodes are dispatched (with a chain of ifs) and helpers
called. Next graph is very boring, because it's a bit lower level representation
of the same thing (you exit with q or escape btw :).

When we exit the graph viewer, we can see the trace generated by interpreting
this graph with a given bytecode (variable code in paste above). It's something
like:

It's entering JIT, doing some primitive operations for bytecode dispatching
and repeating the loop. Note that at the end of the interpreted loop
(not to be confused with the interpreter loop), we see int_sub [3, 3]
which resets the bytecode position to the beginning. At this time JIT
(instructed by can_enter_jit hint) notices that all green variables
are the same (here only i),
hence we can compile the efficient loop from this point.

The loop contains 3 additions and a check (for i < 100), exactly
the same as our interpreted program would do, but completely without
interpretation overhead!

As you might have noticed, there is no assembler involved so far. All of this
instruction execution is done directly, in pure python. In fact, the
code for executing instructions is located in jit/backend/llgraph
which directly interprets instructions. This is by far simpler (and easier
to debug) than x86 assembler.

And this is basically it: the very simple interpreter and a jit for it.
Of course we actually can generate assembler for that. Also the missing
piece is optimizing the generated graphs. While for this example,
by removing the interpretetation overhead, we're done, with more complex
examples it's important to further optimize traces. Hopefully this and
how we actually generate assembler will be topics for next blog posts.