Thursday, September 23, 2010

This year we had a record of two and a half applications (one was on a cross
section of PyPy and numpy) accepted for the Google
SoC program. Since it ended a couple of weeks ago, we wanted to present the results that
were achieved. All three projects were completed successfully, although the rate
of success varied quite a bit.

The Numpy proposal progress significantly on making numpy compatible with
PyPy's CPython's extension module support, but failed to bring PyPy's numpy
implementation into a usable shape (which is a somewhat ambitious goal, one
might argue). The experiments done during the projects are living on the
micronumpy branch.

The Fast ctypes proposal did some useful experiments on how to JIT external
calls from PyPy to C, however, the actual code as of now is not very
interesting and it's quite far from providing a full ctypes replacement (or
equivalent).

Definitely the most successful proposal was a 64bit (x86_64) backend for PyPy's
JIT. It not only includes working 64bit JIT (merged into PyPy trunk), but also
a working asmgcc for x86_64 linux platform, that makes it possible to run the JIT
on this architecture with our advanced garbage collectors. One can say that
x64_64 is now no longer a second-class citizen for PyPy, although it definitely
didn't receive as much testing as the x86 platform. Expect this to be a major
selling point for the next PyPy release :-)

Cheers,
fijal & the PyPy team

Hello.

This year we had a record of two and a half applications (one was on a cross
section of PyPy and numpy) accepted for the Google
SoC program. Since it ended a couple of weeks ago, we wanted to present the results that
were achieved. All three projects were completed successfully, although the rate
of success varied quite a bit.

The Numpy proposal progress significantly on making numpy compatible with
PyPy's CPython's extension module support, but failed to bring PyPy's numpy
implementation into a usable shape (which is a somewhat ambitious goal, one
might argue). The experiments done during the projects are living on the
micronumpy branch.

The Fast ctypes proposal did some useful experiments on how to JIT external
calls from PyPy to C, however, the actual code as of now is not very
interesting and it's quite far from providing a full ctypes replacement (or
equivalent).

Definitely the most successful proposal was a 64bit (x86_64) backend for PyPy's
JIT. It not only includes working 64bit JIT (merged into PyPy trunk), but also
a working asmgcc for x86_64 linux platform, that makes it possible to run the JIT
on this architecture with our advanced garbage collectors. One can say that
x64_64 is now no longer a second-class citizen for PyPy, although it definitely
didn't receive as much testing as the x86 platform. Expect this to be a major
selling point for the next PyPy release :-)

Wednesday, September 22, 2010

This blog post is a successor to the one about escape analysis in PyPy's
JIT. The examples from there will be continued here. This post is a bit
science-fictiony. The algorithm that PyPy currently uses is significantly more
complex and much harder than the one that is described here. The resulting
behaviour is very similar, however, so we will use the simpler version (and we
might switch to that at some point in the actual implementation).

In the last blog post we described how escape analysis can be used to remove
many of the allocations of short-lived objects and many of the type dispatches
that are present in a non-optimized trace. In this post we will improve the
optimization to also handle more cases.

To understand some more what the optimization described in the last blog post
can achieve, look at the following figure:

The figure shows a trace before optimization, together with the lifetime of
various kinds of objects created in the trace. It is executed from top to
bottom. At the bottom, a jump is used to execute the same loop another time.
For clarity, the figure shows two iterations of the loop.
The loop is executed until one of the guards in the trace fails, and the
execution is aborted.

Some of the operations within this trace are new operations, which each create a
new instance of some class. These instances are used for a while, e.g. by
calling methods on them, reading and writing their fields. Some of these
instances escape, which means that they are stored in some globally accessible
place or are passed into a function.

Together with the new operations, the figure shows the lifetimes of the
created objects. Objects in category 1 live for a while, and are then just not
used any more. The creation of these objects is removed by the
optimization described in the last blog post.

Objects in category 2 live for a while and then escape. The optimization of the
last post deals with them too: the new that creates them and
the field accesses are deferred, until the point where the object escapes.

The objects in category 3 and 4 are in principle like the objects in category 1
and 2. They are created, live for a while, but are then passed as an argument
to the jump operation. In the next iteration they can either die (category
3) or escape (category 4).

The optimization of the last post considered the passing of an object along a
jump to be equivalent to escaping. It was thus treating objects in category 3
and 4 like those in category 2.

The improved optimization described in this post will make it possible to deal
better with objects in category 3 and 4. This will have two consequences: on
the one hand, more allocations are removed from the trace (which is clearly
good). As a side-effect of this, the traces will also be type-specialized.

Optimizing Across the Jump

Let's look at the final trace obtained in the last post for the example loop.
The final trace was much better than the original one, because many allocations
were removed from it. However, it also still contained allocations:

The two new BoxedIntegers stored in p15 and p10 are passed into
the next iteration of the loop. The next iteration will check that they are
indeed BoxedIntegers, read their intval fields and then not use them
any more. Thus those instances are in category 3.

In its current state the loop
allocates two BoxedIntegers at the end of every iteration, that then die
very quickly in the next iteration. In addition, the type checks at the start
of the loop are superfluous, at least after the first iteration.

The reason why we cannot optimize the remaining allocations away is because
their lifetime crosses the jump. To improve the situation, a little trick is
needed. The trace above represents a loop, i.e. the jump at the end jumps to
the beginning. Where in the loop the jump occurs is arbitrary, since the loop
can only be left via failing guards anyway. Therefore it does not change the
semantics of the loop to put the jump at another point into the trace and we
can move the jump operation just above the allocation of the objects that
appear in the current jump. This needs some care, because the arguments to
jump are all currently live variables, thus they need to be adapted.

If we do that for our example trace above, the trace looks like this:

Now the lifetime of the remaining allocations no longer crosses the jump, and
we can run our escape analysis a second time, to get the following trace:

This result is now really good. The code performs the same operations than
the original code, but using direct CPU arithmetic and no boxing, as opposed to
the original version which used dynamic dispatching and boxing.

Looking at the final trace it is also completely clear that specialization has
happened. The trace corresponds to the situation in which the trace was
originally recorded, which happened to be a loop where BoxedIntegers were
used. The now resulting loop does not refer to the BoxedInteger class at
all any more, but it still has the same behaviour. If the original loop had
used BoxedFloats, the final loop would use float_* operations
everywhere instead (or even be very different, if the object model had
user-defined classes).

Entering the Loop

The approach of placing the jump at some other point in the loop leads to
one additional complication that we glossed over so far. The beginning of the
original loop corresponds to a point in the original program, namely the
while loop in the function f from the last post.

Now recall that in a VM that uses a tracing JIT, all programs start by being
interpreted. This means that when f is executed by the interpreter, it is
easy to go from the interpreter to the first version of the compiled loop.
After the jump is moved and the escape analysis optimization is applied a
second time, this is no longer easily possible. In particular, the new loop
expects two integers as input arguments, while the old one expected two
instances.

To make it possible to enter the loop directly from the intepreter, there
needs to be some additional code that enters the loop by taking as input
arguments what is available to the interpreter, i.e. two instances. This
additional code corresponds to one iteration of the loop, which is thus
peeled off:

Summary

The optimization described in this post can be used to optimize away
allocations in category 3 and improve allocations in category 4, by deferring
them until they are no longer avoidable. A side-effect of these optimizations
is also that the optimized loops are specialized for the types of the variables
that are used inside them.

This blog post is a successor to the one about escape analysis in PyPy's
JIT. The examples from there will be continued here. This post is a bit
science-fictiony. The algorithm that PyPy currently uses is significantly more
complex and much harder than the one that is described here. The resulting
behaviour is very similar, however, so we will use the simpler version (and we
might switch to that at some point in the actual implementation).

In the last blog post we described how escape analysis can be used to remove
many of the allocations of short-lived objects and many of the type dispatches
that are present in a non-optimized trace. In this post we will improve the
optimization to also handle more cases.

To understand some more what the optimization described in the last blog post
can achieve, look at the following figure:

The figure shows a trace before optimization, together with the lifetime of
various kinds of objects created in the trace. It is executed from top to
bottom. At the bottom, a jump is used to execute the same loop another time.
For clarity, the figure shows two iterations of the loop.
The loop is executed until one of the guards in the trace fails, and the
execution is aborted.

Some of the operations within this trace are new operations, which each create a
new instance of some class. These instances are used for a while, e.g. by
calling methods on them, reading and writing their fields. Some of these
instances escape, which means that they are stored in some globally accessible
place or are passed into a function.

Together with the new operations, the figure shows the lifetimes of the
created objects. Objects in category 1 live for a while, and are then just not
used any more. The creation of these objects is removed by the
optimization described in the last blog post.

Objects in category 2 live for a while and then escape. The optimization of the
last post deals with them too: the new that creates them and
the field accesses are deferred, until the point where the object escapes.

The objects in category 3 and 4 are in principle like the objects in category 1
and 2. They are created, live for a while, but are then passed as an argument
to the jump operation. In the next iteration they can either die (category
3) or escape (category 4).

The optimization of the last post considered the passing of an object along a
jump to be equivalent to escaping. It was thus treating objects in category 3
and 4 like those in category 2.

The improved optimization described in this post will make it possible to deal
better with objects in category 3 and 4. This will have two consequences: on
the one hand, more allocations are removed from the trace (which is clearly
good). As a side-effect of this, the traces will also be type-specialized.

Optimizing Across the Jump

Let's look at the final trace obtained in the last post for the example loop.
The final trace was much better than the original one, because many allocations
were removed from it. However, it also still contained allocations:

The two new BoxedIntegers stored in p15 and p10 are passed into
the next iteration of the loop. The next iteration will check that they are
indeed BoxedIntegers, read their intval fields and then not use them
any more. Thus those instances are in category 3.

In its current state the loop
allocates two BoxedIntegers at the end of every iteration, that then die
very quickly in the next iteration. In addition, the type checks at the start
of the loop are superfluous, at least after the first iteration.

The reason why we cannot optimize the remaining allocations away is because
their lifetime crosses the jump. To improve the situation, a little trick is
needed. The trace above represents a loop, i.e. the jump at the end jumps to
the beginning. Where in the loop the jump occurs is arbitrary, since the loop
can only be left via failing guards anyway. Therefore it does not change the
semantics of the loop to put the jump at another point into the trace and we
can move the jump operation just above the allocation of the objects that
appear in the current jump. This needs some care, because the arguments to
jump are all currently live variables, thus they need to be adapted.

If we do that for our example trace above, the trace looks like this:

Now the lifetime of the remaining allocations no longer crosses the jump, and
we can run our escape analysis a second time, to get the following trace:

This result is now really good. The code performs the same operations than
the original code, but using direct CPU arithmetic and no boxing, as opposed to
the original version which used dynamic dispatching and boxing.

Looking at the final trace it is also completely clear that specialization has
happened. The trace corresponds to the situation in which the trace was
originally recorded, which happened to be a loop where BoxedIntegers were
used. The now resulting loop does not refer to the BoxedInteger class at
all any more, but it still has the same behaviour. If the original loop had
used BoxedFloats, the final loop would use float_* operations
everywhere instead (or even be very different, if the object model had
user-defined classes).

Entering the Loop

The approach of placing the jump at some other point in the loop leads to
one additional complication that we glossed over so far. The beginning of the
original loop corresponds to a point in the original program, namely the
while loop in the function f from the last post.

Now recall that in a VM that uses a tracing JIT, all programs start by being
interpreted. This means that when f is executed by the interpreter, it is
easy to go from the interpreter to the first version of the compiled loop.
After the jump is moved and the escape analysis optimization is applied a
second time, this is no longer easily possible. In particular, the new loop
expects two integers as input arguments, while the old one expected two
instances.

To make it possible to enter the loop directly from the intepreter, there
needs to be some additional code that enters the loop by taking as input
arguments what is available to the interpreter, i.e. two instances. This
additional code corresponds to one iteration of the loop, which is thus
peeled off:

Summary

The optimization described in this post can be used to optimize away
allocations in category 3 and improve allocations in category 4, by deferring
them until they are no longer avoidable. A side-effect of these optimizations
is also that the optimized loops are specialized for the types of the variables
that are used inside them.

Monday, September 13, 2010

The goal of a just-in-time compiler for a dynamic language is obviously to
improve the speed of the language over an implementation of the language that
uses interpretation. The first goal of a JIT is thus to remove the
interpretation overhead, i.e. the overhead of bytecode (or AST) dispatch and the
overhead of the interpreter's data structures, such as operand stack etc. The
second important problem that any JIT for a dynamic language needs to solve is
how to deal with the overhead of boxing of primitive types and of type
dispatching. Those are problems that are usually not present in statically typed
languages.

Boxing of primitive types means that dynamic languages need to be able to handle
all objects, even integers, floats, etc. in the same way as user-defined
instances. Thus those primitive types are usually boxed, i.e. a small
heap-structure is allocated for them, that contains the actual value.

Type dispatching is the process of finding the concrete implementation that is
applicable to the objects at hand when doing a generic operation at hand. An
example would be the addition of two objects: The addition needs to check what
the concrete objects are that should be added are, and choose the implementation
that is fitting for them.

Last year, we wrote a blog post and a paper about how PyPy's meta-JIT
approach works. These explain how the meta-tracing JIT can remove the overhead
of bytecode dispatch. In this post (and probably a followup) we want to explain
how the traces that are produced by our meta-tracing JIT are then optimized to
also remove some of the overhead more closely associated to dynamic languages,
such as boxing overhead and type dispatching. The most important technique to
achieve this is a form of escape analysis that we call virtual objects.
This is best explained via an example.

Running Example

For the purpose of this blog post, we are going to use a very simple object
model, that just supports an integer and a float type. The objects support only
two operations, add, which adds two objects (promoting ints to floats in a
mixed addition) and is_positive, which returns whether the number is greater
than zero. The implementation of add uses classical Smalltalk-like
double-dispatching. These classes could be part of the implementation of a very
simple interpreter written in RPython.

Using these classes to implement arithmetic shows the basic problem that a
dynamic language implementation has. All the numbers are instances of either
BoxedInteger or BoxedFloat, thus they consume space on the heap. Performing many
arithmetic operations produces lots of garbage quickly, thus putting pressure on
the garbage collector. Using double dispatching to implement the numeric tower
needs two method calls per arithmetic operation, which is costly due to the
method dispatch.

To understand the problems more directly, let us consider a simple function that
uses the object model:

The loop iterates y times, and computes something in the process. To
understand the reason why executing this function is slow, here is the trace
that is produced by the tracing JIT when executing the function with y
being a BoxedInteger:

The trace is inefficient for a couple of reasons. One problem is that it checks
repeatedly and redundantly for the class of the objects around, using a
guard_class instruction. In addition, some new BoxedInteger instances are
constructed using the new operation, only to be used once and then forgotten
a bit later. In the next section, we will see how this can be improved upon,
using escape analysis.

Virtual Objects

The main insight to improve the code shown in the last section is that some of
the objects created in the trace using a new operation don't survive very
long and are collected by the garbage collector soon after their allocation.
Moreover, they are used only inside the loop, thus we can easily prove that
nobody else in the program stores a reference to them. The
idea for improving the code is thus to analyze which objects never escape the
loop and may thus not be allocated at all.

This process is called escape analysis. The escape analysis of
our tracing JIT works by using virtual objects: The trace is walked from
beginning to end and whenever a new operation is seen, the operation is
removed and a virtual object is constructed. The virtual object summarizes the
shape of the object that is allocated at this position in the original trace,
and is used by the escape analysis to improve the trace. The shape describes
where the values that would be stored in the fields of the allocated objects
come from. Whenever the optimizer sees a setfield that writes into a virtual
object, that shape summary is thus updated and the operation can be removed.
When the optimizer encounters a getfield from a virtual, the result is read
from the virtual object, and the operation is also removed.

In the example from last section, the following operations would produce two
virtual objects, and be completely removed from the optimized trace:

The guard_class operations can be removed, because the classes of p5 and
p6 are known to be BoxedInteger. The getfield_gc operations can be removed
and i7 and i8 are just replaced by i4 and -100. Thus the only
remaining operation in the optimized trace would be:

i9 = int_add(i4, -100)

The rest of the trace is optimized similarly.

So far we have only described what happens when virtual objects are used in
operations that read and write their fields. When the virtual object is used in
any other operation, it cannot stay virtual. For example, when a virtual object
is stored in a globally accessible place, the object needs to actually be
allocated, as it will live longer than one iteration of the loop.

This is what happens at the end of the trace above, when the jump operation
is hit. The arguments of the jump are at this point virtual objects. Before the
jump is emitted, they are forced. This means that the optimizers produces code
that allocates a new object of the right type and sets its fields to the field
values that the virtual object has. This means that instead of the jump, the
following operations are emitted:

Note how the operations for creating these two instances has been moved down the
trace. It looks like for these operations we actually didn't win much, because
the objects are still allocated at the end. However, the optimization was still
worthwhile even in this case, because some operations that have been performed
on the forced virtual objects have been removed (some getfield_gc operations
and guard_class operations).

The optimized trace contains only two allocations, instead of the original five,
and only three guard_class operations, from the original seven.

Summary

In this blog post we described how simple escape analysis within the scope of
one loop works. This optimizations reduces the allocation of many intermediate
data structures that become garbage quickly in an interpreter. It also removes a
lot of the type dispatching overhead. In a later post, we will explain how this
optimization can be improved further.

The goal of a just-in-time compiler for a dynamic language is obviously to
improve the speed of the language over an implementation of the language that
uses interpretation. The first goal of a JIT is thus to remove the
interpretation overhead, i.e. the overhead of bytecode (or AST) dispatch and the
overhead of the interpreter's data structures, such as operand stack etc. The
second important problem that any JIT for a dynamic language needs to solve is
how to deal with the overhead of boxing of primitive types and of type
dispatching. Those are problems that are usually not present in statically typed
languages.

Boxing of primitive types means that dynamic languages need to be able to handle
all objects, even integers, floats, etc. in the same way as user-defined
instances. Thus those primitive types are usually boxed, i.e. a small
heap-structure is allocated for them, that contains the actual value.

Type dispatching is the process of finding the concrete implementation that is
applicable to the objects at hand when doing a generic operation at hand. An
example would be the addition of two objects: The addition needs to check what
the concrete objects are that should be added are, and choose the implementation
that is fitting for them.

Last year, we wrote a blog post and a paper about how PyPy's meta-JIT
approach works. These explain how the meta-tracing JIT can remove the overhead
of bytecode dispatch. In this post (and probably a followup) we want to explain
how the traces that are produced by our meta-tracing JIT are then optimized to
also remove some of the overhead more closely associated to dynamic languages,
such as boxing overhead and type dispatching. The most important technique to
achieve this is a form of escape analysis that we call virtual objects.
This is best explained via an example.

Running Example

For the purpose of this blog post, we are going to use a very simple object
model, that just supports an integer and a float type. The objects support only
two operations, add, which adds two objects (promoting ints to floats in a
mixed addition) and is_positive, which returns whether the number is greater
than zero. The implementation of add uses classical Smalltalk-like
double-dispatching. These classes could be part of the implementation of a very
simple interpreter written in RPython.

Using these classes to implement arithmetic shows the basic problem that a
dynamic language implementation has. All the numbers are instances of either
BoxedInteger or BoxedFloat, thus they consume space on the heap. Performing many
arithmetic operations produces lots of garbage quickly, thus putting pressure on
the garbage collector. Using double dispatching to implement the numeric tower
needs two method calls per arithmetic operation, which is costly due to the
method dispatch.

To understand the problems more directly, let us consider a simple function that
uses the object model:

The loop iterates y times, and computes something in the process. To
understand the reason why executing this function is slow, here is the trace
that is produced by the tracing JIT when executing the function with y
being a BoxedInteger:

The trace is inefficient for a couple of reasons. One problem is that it checks
repeatedly and redundantly for the class of the objects around, using a
guard_class instruction. In addition, some new BoxedInteger instances are
constructed using the new operation, only to be used once and then forgotten
a bit later. In the next section, we will see how this can be improved upon,
using escape analysis.

Virtual Objects

The main insight to improve the code shown in the last section is that some of
the objects created in the trace using a new operation don't survive very
long and are collected by the garbage collector soon after their allocation.
Moreover, they are used only inside the loop, thus we can easily prove that
nobody else in the program stores a reference to them. The
idea for improving the code is thus to analyze which objects never escape the
loop and may thus not be allocated at all.

This process is called escape analysis. The escape analysis of
our tracing JIT works by using virtual objects: The trace is walked from
beginning to end and whenever a new operation is seen, the operation is
removed and a virtual object is constructed. The virtual object summarizes the
shape of the object that is allocated at this position in the original trace,
and is used by the escape analysis to improve the trace. The shape describes
where the values that would be stored in the fields of the allocated objects
come from. Whenever the optimizer sees a setfield that writes into a virtual
object, that shape summary is thus updated and the operation can be removed.
When the optimizer encounters a getfield from a virtual, the result is read
from the virtual object, and the operation is also removed.

In the example from last section, the following operations would produce two
virtual objects, and be completely removed from the optimized trace:

The guard_class operations can be removed, because the classes of p5 and
p6 are known to be BoxedInteger. The getfield_gc operations can be removed
and i7 and i8 are just replaced by i4 and -100. Thus the only
remaining operation in the optimized trace would be:

i9 = int_add(i4, -100)

The rest of the trace is optimized similarly.

So far we have only described what happens when virtual objects are used in
operations that read and write their fields. When the virtual object is used in
any other operation, it cannot stay virtual. For example, when a virtual object
is stored in a globally accessible place, the object needs to actually be
allocated, as it will live longer than one iteration of the loop.

This is what happens at the end of the trace above, when the jump operation
is hit. The arguments of the jump are at this point virtual objects. Before the
jump is emitted, they are forced. This means that the optimizers produces code
that allocates a new object of the right type and sets its fields to the field
values that the virtual object has. This means that instead of the jump, the
following operations are emitted:

Note how the operations for creating these two instances has been moved down the
trace. It looks like for these operations we actually didn't win much, because
the objects are still allocated at the end. However, the optimization was still
worthwhile even in this case, because some operations that have been performed
on the forced virtual objects have been removed (some getfield_gc operations
and guard_class operations).

The optimized trace contains only two allocations, instead of the original five,
and only three guard_class operations, from the original seven.

Summary

In this blog post we described how simple escape analysis within the scope of
one loop works. This optimizations reduces the allocation of many intermediate
data structures that become garbage quickly in an interpreter. It also removes a
lot of the type dispatching overhead. In a later post, we will explain how this
optimization can be improved further.