Saturday, June 26, 2010

"Blackhole" interpreter

Hi all,

Here are a few words about the JIT's "great speedup in compiling
time" advertized on the PyPy 1.3 release (see the
previous blog post).
The exact meaning behind these words needs a fair bit of
explanation, so here it is in case you are interested.

If you download a version of PyPy 1.3 that includes a JIT
compiler, you get an executable that could be qualified as rather
fat: it actually contains three interpreters. You have on the
one hand the regular Python interpreter. It is here because it's
not possible to JIT-compile every single piece of Python code you
try to run; only the most executed loops are JIT-compiled. They
are JIT-compiled with a tracing interpreter that operates one
level down. This is the second interpreter. This tracing step
is quite slow, but it's all right because it's only invoked on
the most executed loops (on the order of 100 to 1000 times in
total in a run of a Python script that takes anyway seconds or
minutes to run).

So apart from the JIT compilation itself, we have two worlds in
which the execution proceeds: either by regular interpretation,
or by the execution of assembler code generated by the JIT
compiler. And of course, we need to be able to switch from one
world to the other quickly: during regular interpretation we have
to detect if we already have generated assembler for this piece
of code and if so, jump to it; and during execution of the
assembler, when a "guard" fails, i.e. when we meet a path of
execution for which we did not produce assembler, then we need to
switch back to regular interpretation (or occasionally invoke the
JIT compiler again).

Let us consider the cost of switching from one world to another.
During regular interpretation, if we detect that we already have
assembler corresponding to this Python loop, then we just jump to
it instead of interpreting the Python loop. This is fairly
cheap, as it involves just one fast extra check per Python loop.
The reverse is harder because "guard" failures can occur at any
point in time: it is possible that the bit of assembler that we
already executed so far corresponds to running the first 4 Python
opcodes of the loop and a half. The guard that failed just now
is somewhere in the middle of interpreting that opcode -- say,
multiplying these two Python objects.

It's almost impossible to just "jump" at the right place in the
code of the regular interpreter -- how do you jump inside a
regular function compiled in C, itself in a call chain, resuming
execution of the function from somewhere in the middle?

So here is the important new bit in PyPy 1.3. Previously, what
we would do is invoke the JIT compiler again in order to follow
what needs to happen between the guard failure and the real end
of the Python opcode. We would then throw away the trace
generated, as the only purpose was to finish running the current
opcode. We call this "blackhole interpretation". After the end
of the Python opcode, we can jump to the regular interpreter
easily.

Doing so was straightforward, but slow, in case it needs to be
done very often (as in the case in some examples, but not all).
In PyPy 1.3, this blackhole interpretation step has been
redesigned as a time-critical component, and that's where the
third interpreter comes from. It is an interpreter that works
like the JIT compiler, but without the overhead of tracing (e.g.
it does not need to box all values). It was designed from the
ground up for the sole purpose of finishing the execution of the
current Python opcode. The bytecode format that it interprets is
also new, designed for that purpose, and the JIT compiler itself
(the second interpreter) was adapted to it.
The old bytecode format in PyPy 1.2 is gone
(it was more suited for the JIT compiler, but less for blackhole
interpretation).

In summary, it was a lot of changes in the most front-end-ish
parts of the JIT compiler, even though it was mostly hidden
changes. I hope that this longish blog post helped bring it a
bit more to the light :-)

Hi all,

Here are a few words about the JIT's "great speedup in compiling
time" advertized on the PyPy 1.3 release (see the
previous blog post).
The exact meaning behind these words needs a fair bit of
explanation, so here it is in case you are interested.

If you download a version of PyPy 1.3 that includes a JIT
compiler, you get an executable that could be qualified as rather
fat: it actually contains three interpreters. You have on the
one hand the regular Python interpreter. It is here because it's
not possible to JIT-compile every single piece of Python code you
try to run; only the most executed loops are JIT-compiled. They
are JIT-compiled with a tracing interpreter that operates one
level down. This is the second interpreter. This tracing step
is quite slow, but it's all right because it's only invoked on
the most executed loops (on the order of 100 to 1000 times in
total in a run of a Python script that takes anyway seconds or
minutes to run).

So apart from the JIT compilation itself, we have two worlds in
which the execution proceeds: either by regular interpretation,
or by the execution of assembler code generated by the JIT
compiler. And of course, we need to be able to switch from one
world to the other quickly: during regular interpretation we have
to detect if we already have generated assembler for this piece
of code and if so, jump to it; and during execution of the
assembler, when a "guard" fails, i.e. when we meet a path of
execution for which we did not produce assembler, then we need to
switch back to regular interpretation (or occasionally invoke the
JIT compiler again).

Let us consider the cost of switching from one world to another.
During regular interpretation, if we detect that we already have
assembler corresponding to this Python loop, then we just jump to
it instead of interpreting the Python loop. This is fairly
cheap, as it involves just one fast extra check per Python loop.
The reverse is harder because "guard" failures can occur at any
point in time: it is possible that the bit of assembler that we
already executed so far corresponds to running the first 4 Python
opcodes of the loop and a half. The guard that failed just now
is somewhere in the middle of interpreting that opcode -- say,
multiplying these two Python objects.

It's almost impossible to just "jump" at the right place in the
code of the regular interpreter -- how do you jump inside a
regular function compiled in C, itself in a call chain, resuming
execution of the function from somewhere in the middle?

So here is the important new bit in PyPy 1.3. Previously, what
we would do is invoke the JIT compiler again in order to follow
what needs to happen between the guard failure and the real end
of the Python opcode. We would then throw away the trace
generated, as the only purpose was to finish running the current
opcode. We call this "blackhole interpretation". After the end
of the Python opcode, we can jump to the regular interpreter
easily.

Doing so was straightforward, but slow, in case it needs to be
done very often (as in the case in some examples, but not all).
In PyPy 1.3, this blackhole interpretation step has been
redesigned as a time-critical component, and that's where the
third interpreter comes from. It is an interpreter that works
like the JIT compiler, but without the overhead of tracing (e.g.
it does not need to box all values). It was designed from the
ground up for the sole purpose of finishing the execution of the
current Python opcode. The bytecode format that it interprets is
also new, designed for that purpose, and the JIT compiler itself
(the second interpreter) was adapted to it.
The old bytecode format in PyPy 1.2 is gone
(it was more suited for the JIT compiler, but less for blackhole
interpretation).

In summary, it was a lot of changes in the most front-end-ish
parts of the JIT compiler, even though it was mostly hidden
changes. I hope that this longish blog post helped bring it a
bit more to the light :-)

I fear not. The bytecode set is quite custom, made to represent RPython code, which is at the level (roughly speaking) of Java -- with a few additional instructions to guide the JIT compiler. The latest version uses a register-based machine, which is more convenient than a Java-like stack-based approach starting from the control flow graphs of RPython functions. It has three independent sets of registers: integers, pointers, and floating-point (pointers are different from integers at this level because the GC needs to track them and possibly move them). Register numbers are encoded in one byte, so there is room for 256 registers of each kind, but in practice doing a simple register allocation step on each graph means that no bytecode ends up using more than ~15 registers. A few parts are needed only by the JIT compiler and not by the blackhole interpreter; these are encoded "off-line" to avoid slowing down the blackhole interpreter.

Well, I could talk at length about all the details of the format, but in truth there is nothing very deep there :-) See the comments in http://codespeak.net/svn/pypy/trunk/pypy/jit/codewriter/codewriter.py as well as the tests like test/test_flatten.py and test/test_regalloc.py.

Only at the next iteration of the loop. However, you have to have at least ~1000 iterations before it happens.

There is a variety of tools that we use for inspecting generated loops. There is no programmable interface from python yet, but there are some external tools.

Run: PYPYJITLOG=jit-log-opt:log pypy

and you'll get a file log which contains all the loops. There are tools in the source checkout pypy/jit/tool, loopviewer.py, showstats.py and traceviewer.py which can help you viewing those loops. They'll contain debug_merge_points which are with info about python opcodes (including functions and file), but they can span several functions. Have fun :)

Jaegermonkey is a method-by-method compiler that Tracemonkey uses *before* the tracing compiler enters in action. In pypy, this is equivalent to the normal Python interpreter that profiles your loops to find the hot ones, with the obvious difference that Jaegermonkey is a compiler, while ours is an interpreter.

The blackhole interpreter is something that it's used internally by our tracing jit compiler, and AFAIK it has no equivalent in tracemonkey.