Thursday, February 16, 2012

Thank to all the people who donated to the py3k proposal, we managed to collect enough money to start to work on the first step. This is a quick summary of what I did since I began working on this.
First of all, many thanks to Amaury Forgeot d'Arc, who started the py3k branch months ago, and already implemented lots of features including e.g. switching to "unicode everywhere" and the int/long unification, making my job considerably easier :-)
I started to work on the branch at the last Leysin sprint together with Romain Guillebert, where we worked on various syntactical changes such as extended tuple unpacking and keyword-only arguments. Working on such features is a good way to learn about a lot of the layers which the PyPy Python interpreter is composed of, because often you have to touch the tokenizer, the parser, the ast builder, the compiler and finally the interpreter.
Then I worked on improving our test machinery in various way, e.g. by optimizing the initialization phase of the object space created by tests, which considerably speeds up small test runs, and adding the possibility to automatically run our tests against CPython 3, to ensure that what we are not trying to fix a test which is meant to fail :-). I also setup our buildbot to run the py3k tests nightly, so that we can have an up to date overview of what is left to do.
Finally I started to look at all the tests in the interpreter/ directory, trying to unmangle the mess of failing tests. Lots of tests were failing because of simple syntax errors (e.g., by using the no longer valid except Exception, e syntax or the old print statement), others for slightly more complex reasons like unicode vs bytes or the now gone int/long distinction. Others were failing simply because they relied on new features, such as the new lexical exception handlers.
To give some numbers, at some point in january we had 1621 failing tests in the branch, while today we are under 1000 (to be exact: 999, and this is why I've waited until today to post the status update :-)).
Before ending this blog post, I would like to thank once again all the people who donated to PyPy, who let me to do this wonderful job. That's all for now, I'll post more updates soon.
cheers, Antonio

Thank to all the people who donated to the py3k proposal, we managed to collect enough money to start to work on the first step. This is a quick summary of what I did since I began working on this.
First of all, many thanks to Amaury Forgeot d'Arc, who started the py3k branch months ago, and already implemented lots of features including e.g. switching to "unicode everywhere" and the int/long unification, making my job considerably easier :-)
I started to work on the branch at the last Leysin sprint together with Romain Guillebert, where we worked on various syntactical changes such as extended tuple unpacking and keyword-only arguments. Working on such features is a good way to learn about a lot of the layers which the PyPy Python interpreter is composed of, because often you have to touch the tokenizer, the parser, the ast builder, the compiler and finally the interpreter.
Then I worked on improving our test machinery in various way, e.g. by optimizing the initialization phase of the object space created by tests, which considerably speeds up small test runs, and adding the possibility to automatically run our tests against CPython 3, to ensure that what we are not trying to fix a test which is meant to fail :-). I also setup our buildbot to run the py3k tests nightly, so that we can have an up to date overview of what is left to do.
Finally I started to look at all the tests in the interpreter/ directory, trying to unmangle the mess of failing tests. Lots of tests were failing because of simple syntax errors (e.g., by using the no longer valid except Exception, e syntax or the old print statement), others for slightly more complex reasons like unicode vs bytes or the now gone int/long distinction. Others were failing simply because they relied on new features, such as the new lexical exception handlers.
To give some numbers, at some point in january we had 1621 failing tests in the branch, while today we are under 1000 (to be exact: 999, and this is why I've waited until today to post the status update :-)).
Before ending this blog post, I would like to thank once again all the people who donated to PyPy, who let me to do this wonderful job. That's all for now, I'll post more updates soon.
cheers, Antonio

Monday, February 13, 2012

Part 4 of Comparing Partial Evaluation to Tracing

This is the fourth and final blog post in a series about comparing partial evaluation and
tracing. We've come a long way: In the first post of the series I showed an interpreter for a small flow-graph
language together with a partial evaluator it. In the second post I showed how a tracer for
the same language works and how it relates to both execution and to partial
evaluation. The third post described an optimizer for traces.

In this final post we can compare and contrast the two different approaches of
tracing and partial evaluation by means of an example. The programs in the flow
chart language seen so far have been rather small, so I want to give an example
of a larger program: an interpreter for an extremely simple bytecode
instruction set. I will look at how the partial evaluator deals with that
interpreter, and
what the tracer does with it. The code for
that, as well as all the code of the series can be found here: http://paste.pocoo.org/show/550282/ (some small
additions have been made, such as a nicer way to print traces).

A Bytecode Interpreter

Writing programs in the flow graph language is painful, but I still want to give
an example that is a bit more interesting than the tiny ones that we've seen so
far. The example is an interpreter for the bytecode of a very trivial
register-based language. The language has four registers, one of which is an
accumulator on which all the actual operations are performed.

The opcodes of the language are:

jump_if_a, jumps to a target address when the accumulator is non-zero

mov_a_r0, mov_a_r1, mov_a_r2 move the value of the accumulator to
the respective register

mov_r0_a, mov_r1_a, mov_r2_a move the value of a register to
the accumulator

add_r0_to_a, add_r1_to_a, add_r2_to_a add the value of the
register to the accumulator

decr_a decrement the accumulator

return_a stop the program and print the accumulator

The interpreter has a main loop that reads the opcode at the current program
counter, does a (lengthy) dispatch to the right bytecode via a series of if
statements and then executes the right opcode. Afterwards the next opcode is
treated equivalently.

Here is a part of the source code in the flow graph language. As pseudocode:

The bytecode_loop block is the main dispatch loop. It reads an opcode out of the
bytecode list at the program counter position, then has a long series of if
statements that compares the current opcode to the various existing opcodes.
The full code of the interpreter can be found under the link above.

The bytecodes of the interpreter don't really permit hugely complex
programs, but it can be used to write a program that computes the square of a
number with the following program:

Partially Evaluating the Bytecode Interpreter

The partial evaluator from the first blog post can be easily used to partially
evaluate the bytecode interpreter. The static input is the bytecode for
computing the square and the initial program counter value, as given above. The
dynamic input are the content of the accumulator (the number to be squared).
This can be done as follows:

I.e. lots of blocks that do nothing but jump to another block, interspersed with
some blocks that contain an actual operation. I cleaned the output up manually
and got something like the following (this sort of cleanup is something a good
partial evaluation system would do itself, after partial evaluation has
occurred):

What do we see here? The partial evaluator has generated a block bytecode_loop1,
which corresponds to the initialization opcodes mov_a_r0 and mov_a_r1 together
with one iteration of the loop. Then it either jumps to a copy of the main loop
(label op_jump_if_a_jump1) or to block bytecode_loop11, which prints the result
and then stops. The residual code does exactly what the bytecode did: It
squares the accumulator then prints that. All the uses of the bytecode and
pc variable are gone.

Why did the partial evaluator produce two copies of the main loop that
look the same? The reason for that is that in the second copy, the additional
static information target = 2 is known, where target is a variable in
the interpreter source that stores the jump target, for very brief periods of
time. This additional static information does not have any effect on the
residual code, so the same code is uselessly generated twice. This is an
example of overspecialization.

Tracing the Interpreter

In this section we will look at what happens if we try to trace the interpreter.
The naive way of doing that yields traces that are not very useful, because they
abort after one iteration. We will look at a way of avoiding this problem. The
problems described in this section are at the core of the paper Tracing the
meta-level: PyPy's tracing JIT compiler (that paper uses a slightly more
advanced version of the bytecode interpreter as an example).

To trace the interpreter, it is useful to change the bytecode_loop block from above
to always promote the bytecode and the pc variables, because without
knowing them the trace produced is not really interesting. This is similar to
making these variables static in the partial evaluation example above:

To trace the interpreter we would start naively at the bytecode_loop label, because
that's the label in the interpreter that is jumped to most often (which a
profiler could establish easily). The following command can be used for that
(this output prints traces in a slightly more readable way than in previous blog
posts):

These traces are very short. They start with promoting the bytecode and the
pc, followed by the execution of the opcode mov_r0_a, which is the
one at position 2 in the given bytecode. Then they increment the pc and
loop back to the beginning. Looking at the optimized trace, it is clear that the
trace is essentially useless. It will run only for one iteration, because in the
second iteration the pc is 3, thus the guard_value at the beginning
will fail.

This problem can be solved by tracing more than just one iteration of the
bytecode dispatch loop, which is called meta-tracing. To get this behaviour, in
this simple example it is enough to start (and thus end) tracing at a different
label, op_jump_if_a_jump. This label is hit when the interpreter executes a
jump_if_a bytecode and the jump is taken. In a loop on the level of the
executed bytecode program there is one such jump. Thus tracing from this label,
a full loop in the bytecode program is traced, containing potentially many
iterations of the bytecode dispatch loop in the control flow graph language.

That looks better. The trace corresponds to the interpreter running all the
bytecodes in the loop of the squaring function in the example bytecode above.
The optimized code starts with
two guards (checking that the bytecode is still the one for the squaring
function, checking that the pc is 2) and then only does the operations
that actually do the computation. No bytecode dispatching is performed, thus the
interpretation overhead is fully removed, apart from the two guard_value
operations at the beginning.

Many of the assignments in the trace are superfluous, e.g. all the copying back
and forth between registers r1, r1, r2 and accumulator a. This
could be easily solved by an even more intelligent optimization utilizing SSA
form.

Conclusion About the Interpreter

Both partial evaluation and meta-tracing can be used to transform the example
bytecode computing a square into a form that shows the essential computation
that is going on, without the interpretation overhead. The naive partial evaluator
produces lots of extra blocks that just jump around, which could be solved with
a post-processing step. The tracer by itself produces uselessly short traces,
but with a simple trick of starting the trace at a different point the results
become a lot better.

In a real meta-tracing system, the meta-tracer would need a way for the author
of the interpreter
to mark which bytecode corresponds to a backward jump. It would also need better
integration with the interpreter to start tracing automatically, as well as
cache the traces. Additionally, it would have to deal better with guards that fail a
lot, attaching new traces to the failing guards. However, all that is "just"
engineering on top of the ideas presented in this series of blog posts.

High-Level Conclusion

Some concluding high-level thoughts about the similarities of tracing and
partial evaluation: Tracing and partial evaluation try to tackle a similar
problem, that of automatically reducing the interpreter overhead, their
approaches are slightly different though.

Tracing is very close to normal evaluation, only keeping some extra information
in the process. But then, the optimizer that is used in a tracer
is again very similar in structure to a partial evaluator. The task of the
optimizer is much simpler though, because it does not need to deal with control
flow at all, just a linear list of operations.

So in a sense tracing is taking those parts of partial evaluation that work (the
"just evaluate those things that you can, and leave the others") and replacing
the parts that don't (controlling unfolding) by a much more pragmatic mechanism.
That mechanism observes actual execution runs of the program to choose control
flow paths that are typical. At the same time, the tracer's focus is on loops,
because they are where most programs spend significant amounts of time.

Another point of view of tracing is that it is a form of partial evaluation that
replaces the control components of a partial evaluator with an oracle (the
actual execution runs) that provide the information which paths to look at.

Already in the quite trivial interpreter here the effects of this are visible.
The simple partial evaluator over-specializes the loop and produces two
identical versions of it, that aren't different. The tracer doesn't, and it
also generates only code for the loop itself, not for the initialization
opcodes.

That's it for this series. To those that made it, thanks for following along.
Also thanks to Samuele and Sven, who consistently gave me good feedback on the
posts before I put them here.

Part 4 of Comparing Partial Evaluation to Tracing

This is the fourth and final blog post in a series about comparing partial evaluation and
tracing. We've come a long way: In the first post of the series I showed an interpreter for a small flow-graph
language together with a partial evaluator it. In the second post I showed how a tracer for
the same language works and how it relates to both execution and to partial
evaluation. The third post described an optimizer for traces.

In this final post we can compare and contrast the two different approaches of
tracing and partial evaluation by means of an example. The programs in the flow
chart language seen so far have been rather small, so I want to give an example
of a larger program: an interpreter for an extremely simple bytecode
instruction set. I will look at how the partial evaluator deals with that
interpreter, and
what the tracer does with it. The code for
that, as well as all the code of the series can be found here: http://paste.pocoo.org/show/550282/ (some small
additions have been made, such as a nicer way to print traces).

A Bytecode Interpreter

Writing programs in the flow graph language is painful, but I still want to give
an example that is a bit more interesting than the tiny ones that we've seen so
far. The example is an interpreter for the bytecode of a very trivial
register-based language. The language has four registers, one of which is an
accumulator on which all the actual operations are performed.

The opcodes of the language are:

jump_if_a, jumps to a target address when the accumulator is non-zero

mov_a_r0, mov_a_r1, mov_a_r2 move the value of the accumulator to
the respective register

mov_r0_a, mov_r1_a, mov_r2_a move the value of a register to
the accumulator

add_r0_to_a, add_r1_to_a, add_r2_to_a add the value of the
register to the accumulator

decr_a decrement the accumulator

return_a stop the program and print the accumulator

The interpreter has a main loop that reads the opcode at the current program
counter, does a (lengthy) dispatch to the right bytecode via a series of if
statements and then executes the right opcode. Afterwards the next opcode is
treated equivalently.

Here is a part of the source code in the flow graph language. As pseudocode:

The bytecode_loop block is the main dispatch loop. It reads an opcode out of the
bytecode list at the program counter position, then has a long series of if
statements that compares the current opcode to the various existing opcodes.
The full code of the interpreter can be found under the link above.

The bytecodes of the interpreter don't really permit hugely complex
programs, but it can be used to write a program that computes the square of a
number with the following program:

Partially Evaluating the Bytecode Interpreter

The partial evaluator from the first blog post can be easily used to partially
evaluate the bytecode interpreter. The static input is the bytecode for
computing the square and the initial program counter value, as given above. The
dynamic input are the content of the accumulator (the number to be squared).
This can be done as follows:

I.e. lots of blocks that do nothing but jump to another block, interspersed with
some blocks that contain an actual operation. I cleaned the output up manually
and got something like the following (this sort of cleanup is something a good
partial evaluation system would do itself, after partial evaluation has
occurred):

What do we see here? The partial evaluator has generated a block bytecode_loop1,
which corresponds to the initialization opcodes mov_a_r0 and mov_a_r1 together
with one iteration of the loop. Then it either jumps to a copy of the main loop
(label op_jump_if_a_jump1) or to block bytecode_loop11, which prints the result
and then stops. The residual code does exactly what the bytecode did: It
squares the accumulator then prints that. All the uses of the bytecode and
pc variable are gone.

Why did the partial evaluator produce two copies of the main loop that
look the same? The reason for that is that in the second copy, the additional
static information target = 2 is known, where target is a variable in
the interpreter source that stores the jump target, for very brief periods of
time. This additional static information does not have any effect on the
residual code, so the same code is uselessly generated twice. This is an
example of overspecialization.

Tracing the Interpreter

In this section we will look at what happens if we try to trace the interpreter.
The naive way of doing that yields traces that are not very useful, because they
abort after one iteration. We will look at a way of avoiding this problem. The
problems described in this section are at the core of the paper Tracing the
meta-level: PyPy's tracing JIT compiler (that paper uses a slightly more
advanced version of the bytecode interpreter as an example).

To trace the interpreter, it is useful to change the bytecode_loop block from above
to always promote the bytecode and the pc variables, because without
knowing them the trace produced is not really interesting. This is similar to
making these variables static in the partial evaluation example above:

To trace the interpreter we would start naively at the bytecode_loop label, because
that's the label in the interpreter that is jumped to most often (which a
profiler could establish easily). The following command can be used for that
(this output prints traces in a slightly more readable way than in previous blog
posts):

These traces are very short. They start with promoting the bytecode and the
pc, followed by the execution of the opcode mov_r0_a, which is the
one at position 2 in the given bytecode. Then they increment the pc and
loop back to the beginning. Looking at the optimized trace, it is clear that the
trace is essentially useless. It will run only for one iteration, because in the
second iteration the pc is 3, thus the guard_value at the beginning
will fail.

This problem can be solved by tracing more than just one iteration of the
bytecode dispatch loop, which is called meta-tracing. To get this behaviour, in
this simple example it is enough to start (and thus end) tracing at a different
label, op_jump_if_a_jump. This label is hit when the interpreter executes a
jump_if_a bytecode and the jump is taken. In a loop on the level of the
executed bytecode program there is one such jump. Thus tracing from this label,
a full loop in the bytecode program is traced, containing potentially many
iterations of the bytecode dispatch loop in the control flow graph language.

That looks better. The trace corresponds to the interpreter running all the
bytecodes in the loop of the squaring function in the example bytecode above.
The optimized code starts with
two guards (checking that the bytecode is still the one for the squaring
function, checking that the pc is 2) and then only does the operations
that actually do the computation. No bytecode dispatching is performed, thus the
interpretation overhead is fully removed, apart from the two guard_value
operations at the beginning.

Many of the assignments in the trace are superfluous, e.g. all the copying back
and forth between registers r1, r1, r2 and accumulator a. This
could be easily solved by an even more intelligent optimization utilizing SSA
form.

Conclusion About the Interpreter

Both partial evaluation and meta-tracing can be used to transform the example
bytecode computing a square into a form that shows the essential computation
that is going on, without the interpretation overhead. The naive partial evaluator
produces lots of extra blocks that just jump around, which could be solved with
a post-processing step. The tracer by itself produces uselessly short traces,
but with a simple trick of starting the trace at a different point the results
become a lot better.

In a real meta-tracing system, the meta-tracer would need a way for the author
of the interpreter
to mark which bytecode corresponds to a backward jump. It would also need better
integration with the interpreter to start tracing automatically, as well as
cache the traces. Additionally, it would have to deal better with guards that fail a
lot, attaching new traces to the failing guards. However, all that is "just"
engineering on top of the ideas presented in this series of blog posts.

High-Level Conclusion

Some concluding high-level thoughts about the similarities of tracing and
partial evaluation: Tracing and partial evaluation try to tackle a similar
problem, that of automatically reducing the interpreter overhead, their
approaches are slightly different though.

Tracing is very close to normal evaluation, only keeping some extra information
in the process. But then, the optimizer that is used in a tracer
is again very similar in structure to a partial evaluator. The task of the
optimizer is much simpler though, because it does not need to deal with control
flow at all, just a linear list of operations.

So in a sense tracing is taking those parts of partial evaluation that work (the
"just evaluate those things that you can, and leave the others") and replacing
the parts that don't (controlling unfolding) by a much more pragmatic mechanism.
That mechanism observes actual execution runs of the program to choose control
flow paths that are typical. At the same time, the tracer's focus is on loops,
because they are where most programs spend significant amounts of time.

Another point of view of tracing is that it is a form of partial evaluation that
replaces the control components of a partial evaluator with an oracle (the
actual execution runs) that provide the information which paths to look at.

Already in the quite trivial interpreter here the effects of this are visible.
The simple partial evaluator over-specializes the loop and produces two
identical versions of it, that aren't different. The tracer doesn't, and it
also generates only code for the loop itself, not for the initialization
opcodes.

That's it for this series. To those that made it, thanks for following along.
Also thanks to Samuele and Sven, who consistently gave me good feedback on the
posts before I put them here.

Friday, February 10, 2012

We're pleased to announce the 1.8 release of PyPy. As habitual this
release brings a lot of bugfixes, together with performance and memory
improvements over the 1.7 release. The main highlight of the release
is the introduction of list strategies which makes homogenous lists
more efficient both in terms of performance and memory. This release
also upgrades us from Python 2.7.1 compatibility to 2.7.2. Otherwise
it's "business as usual" in the sense that performance improved
roughly 10% on average since the previous release.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for
CPython 2.7. It's fast (pypy 1.8 and cpython 2.7.1 performance comparison)
due to its integrated tracing JIT compiler.

This release supports x86 machines running Linux 32/64, Mac OS X 32/64 or
Windows 32. Windows 64 work has been stalled, we would welcome a volunteer
to handle that.

Highlights

List strategies. Now lists that contain only ints or only floats should
be as efficient as storing them in a binary-packed array. It also improves
the JIT performance in places that use such lists. There are also special
strategies for unicode and string lists.

As usual, numerous performance improvements. There are many examples
of python constructs that now should be faster; too many to list them.

Bugfixes and compatibility fixes with CPython.

Windows fixes.

NumPy effort progress; for the exact list of things that have been done,
consult the numpy status page. A tentative list of things that has
been done:

multi dimensional arrays

various sizes of dtypes

a lot of ufuncs

a lot of other minor changes

Right now the numpy module is available under both numpy and numpypy
names. However, because it's incomplete, you have to import numpypy first
before doing any imports from numpy.

New JIT hooks that allow you to hook into the JIT process from your python
program. There is a brief overview of what they offer.

Standard library upgrade from 2.7.1 to 2.7.2.

Ongoing work

As usual, there is quite a bit of ongoing work that either didn't make it to
the release or is not ready yet. Highlights include:

Since the last release there was a significant breakthrough in PyPy's
fundraising. We now have enough funds to work on first stages of numpypy
and py3k. We would like to thank again to everyone who donated.

It's also probably worth noting, we're considering donations for the
Software Transactional Memory project. You can read more about our plans

Cheers,
The PyPy Team

We're pleased to announce the 1.8 release of PyPy. As habitual this
release brings a lot of bugfixes, together with performance and memory
improvements over the 1.7 release. The main highlight of the release
is the introduction of list strategies which makes homogenous lists
more efficient both in terms of performance and memory. This release
also upgrades us from Python 2.7.1 compatibility to 2.7.2. Otherwise
it's "business as usual" in the sense that performance improved
roughly 10% on average since the previous release.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for
CPython 2.7. It's fast (pypy 1.8 and cpython 2.7.1 performance comparison)
due to its integrated tracing JIT compiler.

This release supports x86 machines running Linux 32/64, Mac OS X 32/64 or
Windows 32. Windows 64 work has been stalled, we would welcome a volunteer
to handle that.

Highlights

List strategies. Now lists that contain only ints or only floats should
be as efficient as storing them in a binary-packed array. It also improves
the JIT performance in places that use such lists. There are also special
strategies for unicode and string lists.

As usual, numerous performance improvements. There are many examples
of python constructs that now should be faster; too many to list them.

Bugfixes and compatibility fixes with CPython.

Windows fixes.

NumPy effort progress; for the exact list of things that have been done,
consult the numpy status page. A tentative list of things that has
been done:

multi dimensional arrays

various sizes of dtypes

a lot of ufuncs

a lot of other minor changes

Right now the numpy module is available under both numpy and numpypy
names. However, because it's incomplete, you have to import numpypy first
before doing any imports from numpy.

New JIT hooks that allow you to hook into the JIT process from your python
program. There is a brief overview of what they offer.

Standard library upgrade from 2.7.1 to 2.7.2.

Ongoing work

As usual, there is quite a bit of ongoing work that either didn't make it to
the release or is not ready yet. Highlights include:

Since the last release there was a significant breakthrough in PyPy's
fundraising. We now have enough funds to work on first stages of numpypy
and py3k. We would like to thank again to everyone who donated.

It's also probably worth noting, we're considering donations for the
Software Transactional Memory project. You can read more about our plans

Wednesday, February 8, 2012

Laurence Tratt from King's College London has written a long and detailed introduction to the goals and significance of RPython over on his blog. Laurie has been implementing his Converge Language in RPython in the last months. He is one of the first people external to the PyPy team who have pushed a sizeable RPython-based VM quite far, adding and tuning JIT hints. The post describes some of that work and his impressions of RPython and PyPy.

"RPython, to my mind, is an astonishing project. It has, almost single-handedly, opened up an entirely new approach to VM implementation. As my experience shows, creating a decent RPython VM is not a huge amount of work (despite some frustrations). In short: never again do new languages need come with unusably slow VMs. That the the PyPy / RPython team have shown that these ideas scale up to a fast implementation of a large, real-world language (Python) is another feather in their cap."

Laurence Tratt from King's College London has written a long and detailed introduction to the goals and significance of RPython over on his blog. Laurie has been implementing his Converge Language in RPython in the last months. He is one of the first people external to the PyPy team who have pushed a sizeable RPython-based VM quite far, adding and tuning JIT hints. The post describes some of that work and his impressions of RPython and PyPy.

"RPython, to my mind, is an astonishing project. It has, almost single-handedly, opened up an entirely new approach to VM implementation. As my experience shows, creating a decent RPython VM is not a huge amount of work (despite some frustrations). In short: never again do new languages need come with unusably slow VMs. That the the PyPy / RPython team have shown that these ideas scale up to a fast implementation of a large, real-world language (Python) is another feather in their cap."

Tuesday, February 7, 2012

Part 3 of Comparing Partial Evaluation to Tracing

This is the third blog post in a series about comparing partial evaluation and
tracing. In the first post of the series I introduced a small flow-graph
language together with an interpreter for it. Then I showed a partial evaluator
for the language. In the second post of the series I showed how a tracer for
the same language works and how it relates to both execution and to partial
evaluation. Then I added support for promotion to that tracer.

In this post I will show how to optimize the traces that are produced by the
tracer and compare the structure of the optimizer to that of partial
evaluation.

Optimizing Traces

In the last post we saw how to produce a linear trace with guards by
interpreting a control flow graph program in a special mode. A trace always end with
a loop statement, which jumps to the beginning. The tracer is just logging
the operations that are done while interpreting, so the trace can contain
superfluous operations. On the other hand, the trace also contains some of the
runtime values through promotions and some decisions made on them which can be
exploited by optimization. An example for this is the trace produced by the
promotion example from the last post:

After the guard_value(x, 5, ...) operation, x is know to be 5: If
it isn't 5, execution falls back to the interpreter. Therefore, operations
on x after the guard can be constant-folded. To do that sort of
constant-folding,
an extra optimization step is needed. That optimization step walks along the
trace, remembers which variables are constants and what their values are using a
partial environment. The opimizer removes operations that have only constant
arguments and leaves the others in the trace. This process is actually
remarkably similar to partial evaluation: Some variables are known to be
constants, operations on only constant arguments are optimized away, the rest
remains.

Just like partial evaluation! It even reuses the helper functions presolve
from the partial evaluator and a partial environment PEnv. When the
arguments of the operation are known constants in the partial environment, the
operation can be executed at optimization time and removed from the trace.
Otherwise, the operation has to stay in the output trace. The result variable
(as in the partial evaluator) needs to be removed from the partial environment,
because it was just overwritten by an unknown result.

When the variable that is being guarded is actually known to be a constant, we
can remove the guard. Note that it is not possible that the guard of that
constant fails: The tracer recorded the operation while running with real
values, therefore the guards have to succeed for values the optimizer
discovers to be constant.

guard_false is slightly different from guard_true: after the former we
know that the argument is actually 0. After guard_true we only know that
it is not equal to zero, but not which precise value it has.

Another point to note in the optimization of guards is that the second argument
of the guard operation, which was so far always just an empty list, is now
replaced by the partial environment PEnv. I will discuss further down why
this is needed.

Optimizing guard_value is very similar, except that it really gives precise
information about the variable involved:

This operation is the main way how the optimizer gains constant variables that
it then exploits to do constant-folding on later operations. This is a chief
difference from partial evaluation: There the optimizer knows the value of some
variables from the start. When optimizing traces, at the beginning the value of
no variable is known. Knowledge about some variables is only later gained
through guards.

Now we are missing what happens with the loop statement. In principle, it is
turned into a loop statement again. However, at the loop statement a few
additional operations need to be emitted. The reason is that we optimized away
operations and thus assignments when the result value of the variable was a
constant. That means the involved variable still potentially has some older
value. The next iteration of the loop would continue with this older value,
which is obviously wrong. Therefore we need to emit some assignments before the
loop statement, one per entry in the partial environment:

That's all the code of the optimizer. While the basic structure is quite similar to partial evaluation,
it's a lot less complex as well. What made the partial evaluator hard was that
it needs to deal with control flow statements and with making sure that code is
reused if the same block is partially evaluated with the same constants. Here,
all these complexities go away. The tracer has already removed all control flow
and replaced it with guards and one loop operation at the end. Thus, the
optimizer can simply do one pass over the operations, removing some (with some
extra care around the loop statement).

With this machinery in place, we can optimize the trace from the promotion
example of the last post:

As intended, the operations on x after the guard_value have all been
removed. However, some additional assignments (to x, x2, x3) at the end have been generated as
well. The assignments look superfluous, but the optimizer does not have
enough information to easily recognize this. That can be fixed, but only at the
cost of additional complexity. (A real system would transform the trace into
static single assignment form to answer such questions.)

Resuming to the Interpreter

Why does the code above need to add the partial environment to
the guards that cannot be optimized away? The reason is related to why we needed
to generate assignments before the loop statement. The problem is that the optimizer
removes assignments to variables when it knows the values of these variables.
That means that when switching back from running the optimized trace to the
interpreter, a number of variables are not updated in the environment, making
the execution in the interpreter incorrect.

In the example above, this applies to the variables x2 and x3. When the
second guard fails, they have not been assigned in the optimized case.
Therefore, the guard lists them and their (always constant) values.

When switching back these assignments need to be made. Thus we need to adapt the
resume_interp function from the last blog post as follows:

On resuming, the ResumeVars (a former partial environment) are simply added
back to the normal environment before going back to the interpreter.

The data attached to guards about what needs to be done to resume to the
interpreter when the guard fails is often a very complex part of a tracing
system. The data can become big, yet most guards never fail. Therefore, most
real systems try hard to compress the attached data or try to share it between
subsequent guards.

Summary

In this post we have shown how to optimize traces by applying a variant of the
partial evaluation principle: Perform all the operations that have only constant
arguments, leave the others alone. However, optimizing traces is much simpler,
because no control flow is involved. All the questions about control flow have
already been solved by the tracing component.

In the next and final post of the series I will show a larger example of how
tracing and partial evaluation can be used to optimize a small bytecode
interpreter.

Part 3 of Comparing Partial Evaluation to Tracing

This is the third blog post in a series about comparing partial evaluation and
tracing. In the first post of the series I introduced a small flow-graph
language together with an interpreter for it. Then I showed a partial evaluator
for the language. In the second post of the series I showed how a tracer for
the same language works and how it relates to both execution and to partial
evaluation. Then I added support for promotion to that tracer.

In this post I will show how to optimize the traces that are produced by the
tracer and compare the structure of the optimizer to that of partial
evaluation.

Optimizing Traces

In the last post we saw how to produce a linear trace with guards by
interpreting a control flow graph program in a special mode. A trace always end with
a loop statement, which jumps to the beginning. The tracer is just logging
the operations that are done while interpreting, so the trace can contain
superfluous operations. On the other hand, the trace also contains some of the
runtime values through promotions and some decisions made on them which can be
exploited by optimization. An example for this is the trace produced by the
promotion example from the last post:

After the guard_value(x, 5, ...) operation, x is know to be 5: If
it isn't 5, execution falls back to the interpreter. Therefore, operations
on x after the guard can be constant-folded. To do that sort of
constant-folding,
an extra optimization step is needed. That optimization step walks along the
trace, remembers which variables are constants and what their values are using a
partial environment. The opimizer removes operations that have only constant
arguments and leaves the others in the trace. This process is actually
remarkably similar to partial evaluation: Some variables are known to be
constants, operations on only constant arguments are optimized away, the rest
remains.

Just like partial evaluation! It even reuses the helper functions presolve
from the partial evaluator and a partial environment PEnv. When the
arguments of the operation are known constants in the partial environment, the
operation can be executed at optimization time and removed from the trace.
Otherwise, the operation has to stay in the output trace. The result variable
(as in the partial evaluator) needs to be removed from the partial environment,
because it was just overwritten by an unknown result.

When the variable that is being guarded is actually known to be a constant, we
can remove the guard. Note that it is not possible that the guard of that
constant fails: The tracer recorded the operation while running with real
values, therefore the guards have to succeed for values the optimizer
discovers to be constant.

guard_false is slightly different from guard_true: after the former we
know that the argument is actually 0. After guard_true we only know that
it is not equal to zero, but not which precise value it has.

Another point to note in the optimization of guards is that the second argument
of the guard operation, which was so far always just an empty list, is now
replaced by the partial environment PEnv. I will discuss further down why
this is needed.

Optimizing guard_value is very similar, except that it really gives precise
information about the variable involved:

This operation is the main way how the optimizer gains constant variables that
it then exploits to do constant-folding on later operations. This is a chief
difference from partial evaluation: There the optimizer knows the value of some
variables from the start. When optimizing traces, at the beginning the value of
no variable is known. Knowledge about some variables is only later gained
through guards.

Now we are missing what happens with the loop statement. In principle, it is
turned into a loop statement again. However, at the loop statement a few
additional operations need to be emitted. The reason is that we optimized away
operations and thus assignments when the result value of the variable was a
constant. That means the involved variable still potentially has some older
value. The next iteration of the loop would continue with this older value,
which is obviously wrong. Therefore we need to emit some assignments before the
loop statement, one per entry in the partial environment:

That's all the code of the optimizer. While the basic structure is quite similar to partial evaluation,
it's a lot less complex as well. What made the partial evaluator hard was that
it needs to deal with control flow statements and with making sure that code is
reused if the same block is partially evaluated with the same constants. Here,
all these complexities go away. The tracer has already removed all control flow
and replaced it with guards and one loop operation at the end. Thus, the
optimizer can simply do one pass over the operations, removing some (with some
extra care around the loop statement).

With this machinery in place, we can optimize the trace from the promotion
example of the last post:

As intended, the operations on x after the guard_value have all been
removed. However, some additional assignments (to x, x2, x3) at the end have been generated as
well. The assignments look superfluous, but the optimizer does not have
enough information to easily recognize this. That can be fixed, but only at the
cost of additional complexity. (A real system would transform the trace into
static single assignment form to answer such questions.)

Resuming to the Interpreter

Why does the code above need to add the partial environment to
the guards that cannot be optimized away? The reason is related to why we needed
to generate assignments before the loop statement. The problem is that the optimizer
removes assignments to variables when it knows the values of these variables.
That means that when switching back from running the optimized trace to the
interpreter, a number of variables are not updated in the environment, making
the execution in the interpreter incorrect.

In the example above, this applies to the variables x2 and x3. When the
second guard fails, they have not been assigned in the optimized case.
Therefore, the guard lists them and their (always constant) values.

When switching back these assignments need to be made. Thus we need to adapt the
resume_interp function from the last blog post as follows:

On resuming, the ResumeVars (a former partial environment) are simply added
back to the normal environment before going back to the interpreter.

The data attached to guards about what needs to be done to resume to the
interpreter when the guard fails is often a very complex part of a tracing
system. The data can become big, yet most guards never fail. Therefore, most
real systems try hard to compress the attached data or try to share it between
subsequent guards.

Summary

In this post we have shown how to optimize traces by applying a variant of the
partial evaluation principle: Perform all the operations that have only constant
arguments, leave the others alone. However, optimizing traces is much simpler,
because no control flow is involved. All the questions about control flow have
already been solved by the tracing component.

In the next and final post of the series I will show a larger example of how
tracing and partial evaluation can be used to optimize a small bytecode
interpreter.

Wednesday, February 1, 2012

In this post I want to give an update on the status of the ARM backend for PyPy's JIT and describe some of the issues and details of the backend.

Current Status

It has been a more than a year that I have been working on the ARM backend. Now it is in a shape, that we can measure meaningful numbers and also ask for some feedback. Since the last post about the backend we have added support floating point operations as well as for PyPy's framework GC's. Another area of work was to keep up with the constant improvements done in the main development branch, such as out-of-line guards, labels, etc. It has been possible for about a year to cross-translate the PyPy Python interpreter and other interpreters such as Pyrolog, with a JIT, to run benchmarks on ARM. Up until now there remained some hard to track bugs that would cause the interpreter to crash with a segmentation fault in certain cases when running with the JIT on ARM. Lately it was possible to run all benchmarks without problems, but when running the translation toolchain itself it would crash. During the last PyPy sprint in Leysin Armin and I managed to fix several of these hard to track bugs in the ARM backend with the result that, it is now possible to run the PyPy translator on ARM itself (at least unless until it runs out of memory), which is a kind of litmus test for the backend itself and used to crash before. Just to point it out, we are not able to complete a PyPy translation on ARM, because on the hardware we have currently available there is not enough memory. But up to the point we run out of memory the JIT does not hit any issues.

Implementation Details

The hardware requirements to run the JIT on ARM follow those for Ubuntu on ARM which targets ARMv7 with a VFP unit running in little endian mode. The JIT can be translated without floating point support, but there might be a few places that need to be fixed to fully work in this setting. We are targeting the ARM instruction set, because at least at the time we decided to use it seemed to be the best choice in terms of speed while having some size overhead compared to the Thumb2 instruction set. It appears that the Thumb2 instruction set should give comparable speed with better code density but has a few restriction on the number of registers available and the use of conditional execution. Also the implementation is a bit easier using a fixed width instruction set and we can use the full set of registers in the generated code when using the ARM instruction set.

The calling convention on ARM

The calling convention on ARM uses 4 of the general purpose registers to pass arguments to functions, further arguments are passed on the stack. The presence of a floating point unit is not required for ARM cores, for this reason there are different ways of handling floats with relation to the calling convention. There is a so called soft-float calling convention that is independent of the presence of a floating point unit. For this calling convention floating point arguments to functions are stored in the general purpose registers and on the stack. Passing floats around this way works with software and hardware floating point implementations. But in presence of a floating point unit it produces some overhead, because floating point numbers need to be moved from the floating point unit to the core registers to do a call and moved back to the floating point registers by the callee. The alternative calling convention is the so-called hard-float calling convention which requires the presence of a floating point unit but has the advantage of getting rid of the overhead of moving floating point values around when performing a call. Although it would be better in the long term to support the hard-float calling convention, we need to be able to interoperate with external code compiled for the operating system we are running on. For this reason at the moment we only support the soft-float to interoperate with external code. We implemented and tested the backend on a BeagleBoard-xM with a Cortex-A8 processor running Ubuntu 11.04 for ARM.

Translating for ARM

The toolchain used to translate PyPy currently is based on a Scratchbox2. Scratchbox2 is a cross-compiling environment. Development had stopped for a while, but it seems to have revived again. We run a 32-bit Python interpreter on the host system and perform all calls to the compiler using a Scratchbox2 based environment. A description on how to setup the cross translation toolchain can be found here.

Results

The current results on ARM, as shown in the graph below, show that the JIT currently gives a speedup of about 3.5 times compared to CPython on ARM. The benchmarks were run on the before mentioned BeagleBoard-xM with a 1GHz ARM Cortex-A8 processor and 512MB of memory. The operating system on the board is Ubuntu 11.04 for ARM. We measured the PyPy interpreter with the JIT enabled and disabled comparing each to CPython Python 2.7.1+ (r271:86832) for ARM. The graph shows the speedup or slowdown of both PyPy versions for the different benchmarks from our benchmark suite normalized to the runtime of CPython. The data used for the graph can be seen below.

The speedup is less than the speedup of 5.2 times we currently get on x86 on our own benchmark suite (see http://speed.pypy.org for details). There are several possible reasons for this. Comparing the results for the interpreter without the JIT on ARM and x86 suggests that the interpreter generated by PyPy, without the JIT, has a worse performance when compared to CPython that it does on x86. Also it is quite possible that the code we are generating with the JIT is not yet optimal. Also there are some architectural constraints produce some overhead. One of these differences is the handling of constants, most ARM instructions only support 8 bit (that can be shifted) immediate values, larger constants need to be loaded into a register, something that is not necessary on x86.

Benchmark

PyPy JIT

PyPy no JIT

ai

0.484439780047

3.72756749625

chaos

0.0807291691934

2.2908692212

crypto_pyaes

0.0711114832245

3.30112318509

django

0.0977743245519

2.56779947601

fannkuch

0.210423735698

2.49163632938

float

0.154275334675

2.12053281495

go

0.330483034202

5.84628320479

html5lib

0.629264389862

3.60333138526

meteor-contest

0.984747426912

2.93838610037

nbody_modified

0.236969593082

1.40027234936

pyflate-fast

0.367447191807

2.72472422146

raytrace-simple

0.0290527461437

1.97270054339

richards

0.034575573553

3.29767342015

slowspitfire

0.786642551908

3.7397367403

spambayes

0.660324379456

3.29059863111

spectral-norm

0.063610783731

4.01788986233

spitfire

0.43617131165

2.72050579076

spitfire_cstringio

0.255538702134

1.7418593111

telco

0.102918930413

3.86388866047

twisted_iteration

0.122723986805

4.33632475491

twisted_names

2.42367797135

2.99878698076

twisted_pb

1.30991837431

4.48877805486

twisted_tcp

0.927033354055

2.8161624665

waf

1.02059811932

1.03793427321

The next steps and call for help

Although there probably still are some remaining issues which have not surfaced yet, the JIT backend for ARM is working. Before we can merge the backend into the main development line there are some things that we would like to do first, in particular it we are looking for a way to run the all PyPy tests to verify that things work on ARM before we can merge. Additionally there are some other longterm ideas. To do this we are looking for people willing to help, either by contributing to implement the open features or that can help us with hardware to test.

The incomplete list of open topics:

We are looking for a better way to translate PyPy for ARM, than the one describe above. I am not sure if there currently is hardware with enough memory to directly translate PyPy on an ARM based system, this would require between 1.5 or 2 Gig of memory. A fully QEMU based approach could also work, instead of Scratchbox2 that uses QEMU under the hood.

Test the JIT on different hardware.

Experiment with the JIT settings to find the optimal thresholds for ARM.

Continuous integration: We are looking for a way to run the PyPy test suite to make sure everything works as expected on ARM, here QEMU also might provide an alternative.

A long term plan would be to port the backend to ARMv5 ISA and improve the support for systems without a floating point unit. This would require to implement the ISA and create different code paths and improve the instruction selection depending on the target architecture.

Review of the generated machine code the JIT generates on ARM to see if the instruction selection makes sense for ARM.

In this post I want to give an update on the status of the ARM backend for PyPy's JIT and describe some of the issues and details of the backend.

Current Status

It has been a more than a year that I have been working on the ARM backend. Now it is in a shape, that we can measure meaningful numbers and also ask for some feedback. Since the last post about the backend we have added support floating point operations as well as for PyPy's framework GC's. Another area of work was to keep up with the constant improvements done in the main development branch, such as out-of-line guards, labels, etc. It has been possible for about a year to cross-translate the PyPy Python interpreter and other interpreters such as Pyrolog, with a JIT, to run benchmarks on ARM. Up until now there remained some hard to track bugs that would cause the interpreter to crash with a segmentation fault in certain cases when running with the JIT on ARM. Lately it was possible to run all benchmarks without problems, but when running the translation toolchain itself it would crash. During the last PyPy sprint in Leysin Armin and I managed to fix several of these hard to track bugs in the ARM backend with the result that, it is now possible to run the PyPy translator on ARM itself (at least unless until it runs out of memory), which is a kind of litmus test for the backend itself and used to crash before. Just to point it out, we are not able to complete a PyPy translation on ARM, because on the hardware we have currently available there is not enough memory. But up to the point we run out of memory the JIT does not hit any issues.

Implementation Details

The hardware requirements to run the JIT on ARM follow those for Ubuntu on ARM which targets ARMv7 with a VFP unit running in little endian mode. The JIT can be translated without floating point support, but there might be a few places that need to be fixed to fully work in this setting. We are targeting the ARM instruction set, because at least at the time we decided to use it seemed to be the best choice in terms of speed while having some size overhead compared to the Thumb2 instruction set. It appears that the Thumb2 instruction set should give comparable speed with better code density but has a few restriction on the number of registers available and the use of conditional execution. Also the implementation is a bit easier using a fixed width instruction set and we can use the full set of registers in the generated code when using the ARM instruction set.

The calling convention on ARM

The calling convention on ARM uses 4 of the general purpose registers to pass arguments to functions, further arguments are passed on the stack. The presence of a floating point unit is not required for ARM cores, for this reason there are different ways of handling floats with relation to the calling convention. There is a so called soft-float calling convention that is independent of the presence of a floating point unit. For this calling convention floating point arguments to functions are stored in the general purpose registers and on the stack. Passing floats around this way works with software and hardware floating point implementations. But in presence of a floating point unit it produces some overhead, because floating point numbers need to be moved from the floating point unit to the core registers to do a call and moved back to the floating point registers by the callee. The alternative calling convention is the so-called hard-float calling convention which requires the presence of a floating point unit but has the advantage of getting rid of the overhead of moving floating point values around when performing a call. Although it would be better in the long term to support the hard-float calling convention, we need to be able to interoperate with external code compiled for the operating system we are running on. For this reason at the moment we only support the soft-float to interoperate with external code. We implemented and tested the backend on a BeagleBoard-xM with a Cortex-A8 processor running Ubuntu 11.04 for ARM.

Translating for ARM

The toolchain used to translate PyPy currently is based on a Scratchbox2. Scratchbox2 is a cross-compiling environment. Development had stopped for a while, but it seems to have revived again. We run a 32-bit Python interpreter on the host system and perform all calls to the compiler using a Scratchbox2 based environment. A description on how to setup the cross translation toolchain can be found here.

Results

The current results on ARM, as shown in the graph below, show that the JIT currently gives a speedup of about 3.5 times compared to CPython on ARM. The benchmarks were run on the before mentioned BeagleBoard-xM with a 1GHz ARM Cortex-A8 processor and 512MB of memory. The operating system on the board is Ubuntu 11.04 for ARM. We measured the PyPy interpreter with the JIT enabled and disabled comparing each to CPython Python 2.7.1+ (r271:86832) for ARM. The graph shows the speedup or slowdown of both PyPy versions for the different benchmarks from our benchmark suite normalized to the runtime of CPython. The data used for the graph can be seen below.

The speedup is less than the speedup of 5.2 times we currently get on x86 on our own benchmark suite (see http://speed.pypy.org for details). There are several possible reasons for this. Comparing the results for the interpreter without the JIT on ARM and x86 suggests that the interpreter generated by PyPy, without the JIT, has a worse performance when compared to CPython that it does on x86. Also it is quite possible that the code we are generating with the JIT is not yet optimal. Also there are some architectural constraints produce some overhead. One of these differences is the handling of constants, most ARM instructions only support 8 bit (that can be shifted) immediate values, larger constants need to be loaded into a register, something that is not necessary on x86.

Benchmark

PyPy JIT

PyPy no JIT

ai

0.484439780047

3.72756749625

chaos

0.0807291691934

2.2908692212

crypto_pyaes

0.0711114832245

3.30112318509

django

0.0977743245519

2.56779947601

fannkuch

0.210423735698

2.49163632938

float

0.154275334675

2.12053281495

go

0.330483034202

5.84628320479

html5lib

0.629264389862

3.60333138526

meteor-contest

0.984747426912

2.93838610037

nbody_modified

0.236969593082

1.40027234936

pyflate-fast

0.367447191807

2.72472422146

raytrace-simple

0.0290527461437

1.97270054339

richards

0.034575573553

3.29767342015

slowspitfire

0.786642551908

3.7397367403

spambayes

0.660324379456

3.29059863111

spectral-norm

0.063610783731

4.01788986233

spitfire

0.43617131165

2.72050579076

spitfire_cstringio

0.255538702134

1.7418593111

telco

0.102918930413

3.86388866047

twisted_iteration

0.122723986805

4.33632475491

twisted_names

2.42367797135

2.99878698076

twisted_pb

1.30991837431

4.48877805486

twisted_tcp

0.927033354055

2.8161624665

waf

1.02059811932

1.03793427321

The next steps and call for help

Although there probably still are some remaining issues which have not surfaced yet, the JIT backend for ARM is working. Before we can merge the backend into the main development line there are some things that we would like to do first, in particular it we are looking for a way to run the all PyPy tests to verify that things work on ARM before we can merge. Additionally there are some other longterm ideas. To do this we are looking for people willing to help, either by contributing to implement the open features or that can help us with hardware to test.

The incomplete list of open topics:

We are looking for a better way to translate PyPy for ARM, than the one describe above. I am not sure if there currently is hardware with enough memory to directly translate PyPy on an ARM based system, this would require between 1.5 or 2 Gig of memory. A fully QEMU based approach could also work, instead of Scratchbox2 that uses QEMU under the hood.

Test the JIT on different hardware.

Experiment with the JIT settings to find the optimal thresholds for ARM.

Continuous integration: We are looking for a way to run the PyPy test suite to make sure everything works as expected on ARM, here QEMU also might provide an alternative.

A long term plan would be to port the backend to ARMv5 ISA and improve the support for systems without a floating point unit. This would require to implement the ISA and create different code paths and improve the instruction selection depending on the target architecture.

Review of the generated machine code the JIT generates on ARM to see if the instruction selection makes sense for ARM.