Saturday, January 29, 2011

In the past few months, I have been developing as a part of my master thesis
the ARM backend for the the PyPy JIT, in the arm-backend branch. Currently, it is still work in progress: all integer and object operations are working and
the support for floating point is also under development.
ARM processors are very widely used, beeing deployed in servers, some netbooks
and mainly mobile devices such as phones and tablets. One of our goals is to be
able to run PyPy on phones, specially on Android. Currently is not yet possible
to translate and compile PyPy for Android automatically, but there has been
some work on using Android's NDK to compile PyPy's generated C code.
The JIT Backend targets the application profile of the ARMv7 instruction set
architecture which is found for example in the Cortex-A8 processors used in many Android powered devices and in Apple's A4 processors built into the latest iOS devices. To develop and
test the backend we are using a BeagleBoard-xM which has a 1 GHz ARM
Cortex-A8 and 512 MB of RAM running the ARM port of Ubuntu 10.10.
Currently on Linux it is possible to translate and cross-compile PyPy's Python
interpreter as well as other interpreters with the ARM JIT backend enabled
using Scratchbox 2 to provide a build environment and the GNU ARM cross
compilation toolchain. So far the backend only supports the Boehm garbage
collector which does not produce the best results combined with the JIT, but we
plan to add support for the other GCs in the future, doing so should increase
the performance of PyPy on ARM.
While still debugging the last issues with the backend we already can run some
simple benchmarks on Pyrolog, a prolog interpreter written in RPython.
Even using Boehm as the GC the results look very promising. In the benchmarks
we compare Pyrolog to SWI-Prolog, a prolog interpreter written in C, which
is available from the package repositories for Ubuntu's ARM port.
The benchmarks can be found in the pyrolog-bench repository.

Benchmark

SWI-Prolog in ms.

Pyrolog in ms.

Speedup

iterate

60.0

6.0

10.0

iterate_assert

130.0

6.0

21.67

iterate_call

3310.0

5.0

662.0

iterate_cut

60.0

359.0

0.16713

iterate_exception

4950.0

346.0

14.306

iterate_failure

400.0

127.0

3.1496

iterate_findall

740.0

No res.

iterate_if

140.0

6.0

23.333

The iterate_call benchmark, which constructs a predicate and calls it at
runtime, with a speedup of 662 times over SWI-Prolog is an example where the
JIT can show its strength. The Pyrolog interpreter and the JIT treat
dynamically defined predicates as static ones and can generate optimezed code
in both cases. Whereas SWI only compiles statically defined rules and has to
fall back to interpretation on dynamic ones.
For simple benchmarks running on PyPy's Python intepreter we see some speedups
over CPython, but we still need to debug the backend bit more before we can
show numbers on more complex benchmarks. So, stay tuned.

In the past few months, I have been developing as a part of my master thesis
the ARM backend for the the PyPy JIT, in the arm-backend branch. Currently, it is still work in progress: all integer and object operations are working and
the support for floating point is also under development.
ARM processors are very widely used, beeing deployed in servers, some netbooks
and mainly mobile devices such as phones and tablets. One of our goals is to be
able to run PyPy on phones, specially on Android. Currently is not yet possible
to translate and compile PyPy for Android automatically, but there has been
some work on using Android's NDK to compile PyPy's generated C code.
The JIT Backend targets the application profile of the ARMv7 instruction set
architecture which is found for example in the Cortex-A8 processors used in many Android powered devices and in Apple's A4 processors built into the latest iOS devices. To develop and
test the backend we are using a BeagleBoard-xM which has a 1 GHz ARM
Cortex-A8 and 512 MB of RAM running the ARM port of Ubuntu 10.10.
Currently on Linux it is possible to translate and cross-compile PyPy's Python
interpreter as well as other interpreters with the ARM JIT backend enabled
using Scratchbox 2 to provide a build environment and the GNU ARM cross
compilation toolchain. So far the backend only supports the Boehm garbage
collector which does not produce the best results combined with the JIT, but we
plan to add support for the other GCs in the future, doing so should increase
the performance of PyPy on ARM.
While still debugging the last issues with the backend we already can run some
simple benchmarks on Pyrolog, a prolog interpreter written in RPython.
Even using Boehm as the GC the results look very promising. In the benchmarks
we compare Pyrolog to SWI-Prolog, a prolog interpreter written in C, which
is available from the package repositories for Ubuntu's ARM port.
The benchmarks can be found in the pyrolog-bench repository.

Benchmark

SWI-Prolog in ms.

Pyrolog in ms.

Speedup

iterate

60.0

6.0

10.0

iterate_assert

130.0

6.0

21.67

iterate_call

3310.0

5.0

662.0

iterate_cut

60.0

359.0

0.16713

iterate_exception

4950.0

346.0

14.306

iterate_failure

400.0

127.0

3.1496

iterate_findall

740.0

No res.

iterate_if

140.0

6.0

23.333

The iterate_call benchmark, which constructs a predicate and calls it at
runtime, with a speedup of 662 times over SWI-Prolog is an example where the
JIT can show its strength. The Pyrolog interpreter and the JIT treat
dynamically defined predicates as static ones and can generate optimezed code
in both cases. Whereas SWI only compiles statically defined rules and has to
fall back to interpretation on dynamic ones.
For simple benchmarks running on PyPy's Python intepreter we see some speedups
over CPython, but we still need to debug the backend bit more before we can
show numbers on more complex benchmarks. So, stay tuned.

Friday, January 21, 2011

If you ever considered contributing to PyPy, but never did so far, this is a
good moment to start! :-)

Recently, we merged the fast-forward branch which brings Python 2.7
compatibility, with the plan of releasing a new version of PyPy as soon as all
tests pass.

However, at the moment there are still quite a few of failing tests because
of new 2.7 features that have not been implemented yet: many of them are easy
to fix, and doing it represents a good way to get confidence with the code
base, for those who are interested in it. Michael Foord wrote a little howto
explaining the workflow for running lib-python tests.

Thus, if you are willing to join us in the effort of having a PyPy compatible
with Python 2.7, probably the most sensible option is to come on the #PyPy IRC
channel on Freenode, so we can coordinate each other not to fix the same test
twice.

Moreover, if you are a student and are considering participating in the next
Google Summer of Code this is a good time to get into pypy. You have the
opportunity to get a good understanding of pypy for when you decide what you
would like to work on over the summer.

If you ever considered contributing to PyPy, but never did so far, this is a
good moment to start! :-)

Recently, we merged the fast-forward branch which brings Python 2.7
compatibility, with the plan of releasing a new version of PyPy as soon as all
tests pass.

However, at the moment there are still quite a few of failing tests because
of new 2.7 features that have not been implemented yet: many of them are easy
to fix, and doing it represents a good way to get confidence with the code
base, for those who are interested in it. Michael Foord wrote a little howto
explaining the workflow for running lib-python tests.

Thus, if you are willing to join us in the effort of having a PyPy compatible
with Python 2.7, probably the most sensible option is to come on the #PyPy IRC
channel on Freenode, so we can coordinate each other not to fix the same test
twice.

Moreover, if you are a student and are considering participating in the next
Google Summer of Code this is a good time to get into pypy. You have the
opportunity to get a good understanding of pypy for when you decide what you
would like to work on over the summer.

Tuesday, January 11, 2011

Recently, the jit-unroll-loops branch was merged. It implements the
idea described in
Using Escape Analysis Across Loop Boundaries for Specialization.
That post does only talk about virtuals, but the idea turned out
to be more far reaching. After the metainterpreter produces a trace,
several optimizations are applied to the trace before it is turned
into binary code. Removing allocations is only one of them. There are also
for instance

Heap optimizations that removes memory accesses by reusing results
previously read from or written to the same location.

Reusing of the results of pure operations if the same pure
operation is executed twice.

Removal of redundant guards.

...

A lot of these optimizations are in one way or another removing
operations form the trace and/or reusing previous results. All of these
optimizations could benefit from being able to operate across loop
boundaries. Not only in the sense that operations operating on loop
invariants could be moved out of the loop entirely. But also that
results produced at the end of an iteration could be reused at the
beginning of the next even if there are no loop invariants involved.

This is achieved by unrolling the trace into two iterations, and
letting the optimizer work on this two-iteration-trace.
The optimizer will now be able to optimize the second iteration more than the
first since it can reuse results from the first iteration. The
optimized version of the first iteration we call the preamble and the
optimized version of the second iteration we call the loop. The
preamble will end with a jump to the loop, while the loop will end
with a jump to itself. This means that the preamble will be executed
once for the first iteration, the loop will be executed for all following
iterations.

Sqrt example

Here is an example of a Python implementation of sqrt using a fairly
simple algorithm

defsqrt(y,n=10000):x=y/2whilen>0:n-=1x=(x+y/x)/2returnx

If it is called with sqrt(1234.0),
a fairly long trace is produced. From this trace
the optimizer creates
the
following preamble (Loop 1) and loop (Loop 0)

Looking at the preamble, it starts by making sure that it is not
currently being profiled, the guard
on i5, and that the function object have not been changed
since the trace was made, the guard on p3. Somewhat
intermixed with that, the
integer variable n is unboxed, by making sure p11
points to an integer object and reading out the integer value from
that object.
These operations are not needed in the
loop (and have been removed from it) as emitting the same guards again
would be redundant and n becomes a virtual before the
end of the preamble.

After that the two floating point variables x and y
are unboxed. Again this is only needed in the preamble. Note how the
unboxed value of y, called f23, is passed unchanged
from the preamble to the loop in arguments of the jump
to allow it to be reused. It will not become a virtual
since it is never changed within the loop.

Following that is the actual calculations performed in the loop in
form of floating point operations (since the function was called with
a float argument). These appear in both the loop
and the preamble.

Finally there are some tests checking if a signal was received
(such as when the user presses ctrl-C) and thus should execute some
signal handler or if we need to hand over to another thread. This is
implemented with a counter that is decreased once every iteration. It
will go below zero after some specific number of iterations, tunable by
sys.setcheckinterval. The counter is read from and written to
some global location where it also can be made negative by a C-level
signal handler.

Bridges

When a guard fails often enough, the meta-interpreter is started again
to produce a new trace starting at the failing guard. The tracing is
continued until a previously compiled loop is entered. This could
either be the the same loop that contains the failing guard
or some completely different loop. If it is the same loop, executing
the preamble again maybe be unnecessary.
It is preferable to end the bridge with a jump directly to
the loop. To achieve this the optimizer tries to produce short
preambles that are inlined at the end of bridges allowing
them to jump directly to the loop. Inlining is better than jumping to
a common preamble because most of the inlined short preamble can
typically be removed again by the optimizer.
Creating such a short
preamble is however not always possible. Bridges jumping to loops for which
no short preamble can be generated have to end with a jump to the
full preamble instead.

The short preamble is created by comparing the operations in the
preamble with the operations in the loop. The
operations that are in the preamble but not in the loop
are moved to the short preamble whenever it is safe to move them to
the front of the operations remaining. In other words, the full preamble
is equivalent to the short preamble followed by one iteration of the
loop.

This much has currently been implemented. To give the full picture
here, there are two more features that
hopefully will be implemented in the near future.
The first is to replace the full preamble, used by the interpreter
when it reaches a compiled loop, with the short preamble.
This is currently not done and is probably not as straight forward as
it might first seem. The problem is where to resume interpreting on a
guard failure. However, implementing that should save some
memory. Not only
because the preamble will become smaller, but mainly because the
guards will appear either in the loop or in the preamble, but not
in both (as they do now). That means there will only be a single bridge and
not potentially two copies once the guards are traced.

The sqrt example above would with a short preamble result in a trace
like this

If it is executed long enough, the last guard will be traced to form a
bridge. The trace will inherit the virtuals from its parent. This can
be used to optimize away the part of the inlined short preamble
that deals with virtuals. The resulting bridge should look
something like

Here the first paragraph comes from the traced bridge and the second
is what remains of the short preamble after optimization. The
box p4 is
not a virtual (it contains a pointer to y which is never
changed), and it is only virtuals
that the bridge inherit from it's parents. This is why the last two
operations currently cannot be removed.

Each time the short preamble is inlined, a new copy of each of the
guards in it is generated. Typically the short preamble is inlined in
several places and thus there will be several copies of each of those
guards.
If they fail often enough bridges
from them will be traced (as with all guards). But since there
typically are several copies of each guard the same bridge
will be generated in
several places. To prevent this, mini-bridges from the inlined guards
are produced already during the inlining. These mini-bridges contain
nothing but a jump to the preamble.

The mini-bridges needs the arguments of the preamble to be able
to jump to it. These arguments contain among other things, boxed
versions of the
variables x and y. Those variables are virtuals in
the loop, and have to be allocated. Currently those allocations
are placed in front of the inlined guard. Moving those allocations into
the mini-bridges is the second feature that
hopefully will be implemented in the near future.
After this feature is
implemented, the result should look something like

Multiple specialized versions

Floating point operations were generated in the trace above
because sqrt was called with a float argument. If it is
instead called with an int argument, integer operations will be generated. The
somewhat more complex situations is when both int's and float's are
used as arguments. Then the jit need to generate multiple versions of
the same loop, specialized in different ways. The details, given
below, on how this is achieved is somewhat involved. For the casual
reader it would make perfect sense to skip to the next section here.

Consider the case when sqrt is first called with a float
argument (but with n small enough not to generate the
bridge). Then the trace shown above will be
generated. If sqrt is now called with an int argument, the
guard in the preamble testing that the type of the input object is float
will fail:

guard_nonnull_class(p12, 17652552, descr=<Guard10>)

It will fail every iteration, so soon enough a bridge will be
generated from this guard in the preamble. This guard will end with a
jump to the same loop, and the optimizer will try to inline
the short preamble at the end of it. This will however fail
since now there are two guards on p12. One that makes sure it
is an int and and one that makes sure it is a float. The optimizer
will detect that the second guard will always fail and mark the bridge
as invalid. Invalid loops are not passed on to the backend for
compilation.

If a loop is detected to be invalid while inlining the short preamble,
the metainterpreter will continue to trace for yet another
iteration of the loop. This new trace can be compiled as above and
will produce a new loop with a new preamble that are now specialized
for int arguments instead of float arguments. The bridge that
previously became invalid will now be tried again. This time inlining
the short preamble of the new loop instead. This will produce a set of
traces connected like this

(click for some hairy details)

The height of the boxes is this figure represents how many instructions
they contain (presuming the missing features from the previous section
are implemented). Loop 0 is specialized for floats and it's preamble have
been split into two boxes at the failing guard. Loop 2 is specialized
for ints and is larger than Loop 0. This is mainly because the integer
division in python does not map to the integer division of the
machine, but have to be implemented with several instructions (integer
division in python truncates its result towards minus
infinity, while the the machine integer division truncates towards
0). Also the height of the bridge is about the same as the height of
Loop 2. This is because it contains a full iteration of the loop.

A More Advanced Example

Let's conclude with an example that is a bit more advanced, where this unrolling
approach actually outperforms the previous approach. Consider
making a
fixed-point
implementation of the square root using 16 bit's of decimals. This can be
done using the same implementation
of sqrt but calling it with an object of a class representing
such fixed-point real numbers:

Below is a table comparing the runtime of the sqrt function above with
different argument types on different python interpreters. Pypy 1.4.1
was released before the optimizations described in this post were in place
while they are in place in the
nightly
build from January 5,
denoted pypy in the table. There are also the running time for the same
algorithms implemented in C and compiled with "gcc -O3
-march=native". Tests were executed on a 2.53GHz Intel Core2
processor with n=100000000 iterations.
Comparing the integer versions with C may be considered a
bit unfair because of the more advanced integer division operator in
python. The left part of this table shows runtimes of sqrt in
a program containing a single call to sqrt (i.e. only a single
specialized version of the loop is needed). The right part shows the
runtime of sqrt when it has been called with a different
type of argument before.

First call

Second call

float

int

Fix16

float

int

Fix16

cpython

28.18 s

22.13 s

779.04 s

28.07 s

22.21 s

767.03 s

pypy 1.4.1

1.20 s

6.49 s

11.31 s

1.20 s

6.54 s

11.23 s

pypy

1.20 s

6.44 s

6.78 s

1.19 s

6.26 s

6.79 s

gcc

1.15 s

1.82 s

1.89 s

1.15 s

1.82 s

1.89 s

For this to work in the last case, when Fix16 is the argument type in
the second type,
the trace_limit had to be increased from its default value to prevent
the metainterpreter from aborting while tracing the second version of
the loop. Also sys.setcheckinterval(1000000) were used to prevent the
bridge from being generated. With the bridge the performance of the
last case is significantly worse. Maybe because the optimizer currently
fails to generate a short preamble for it. But the slowdown
seems too big for that to be the only explanation. Below are the runtimes
numbers with checkinterval set to its default value of 100:

First call

Second call

float

int

Fix16

float

int

Fix16

cpython

28.71 s

22.09 s

781.86 s

28.28 s

21.92 s

761.59 s

pypy 1.4.1

1.21 s

6.48 s

11.22 s

1.72 s

7.58 s

12.18 s

pypy

1.21 s

6.27 s

7.22 s

1.20 s

6.29 s

90.47 s

Conclusions

Even though we are seeing speedups in a variety of different small
benchmarks, more complicated examples are not affected much by these
optimizations. It might partly be because larger examples have longer
and more complicated loops, and thus allowing optimizations to operate
across loop boundary will have a smaller relative effect. Another problem is
that with more complicated examples there will be more bridges, and bridges
are currently not handled very well (most of the time all virtuals are
forced at the end of the bridge as explained above). But moving those
forcings into the mini bridges should fix that.

Recently, the jit-unroll-loops branch was merged. It implements the
idea described in
Using Escape Analysis Across Loop Boundaries for Specialization.
That post does only talk about virtuals, but the idea turned out
to be more far reaching. After the metainterpreter produces a trace,
several optimizations are applied to the trace before it is turned
into binary code. Removing allocations is only one of them. There are also
for instance

Heap optimizations that removes memory accesses by reusing results
previously read from or written to the same location.

Reusing of the results of pure operations if the same pure
operation is executed twice.

Removal of redundant guards.

...

A lot of these optimizations are in one way or another removing
operations form the trace and/or reusing previous results. All of these
optimizations could benefit from being able to operate across loop
boundaries. Not only in the sense that operations operating on loop
invariants could be moved out of the loop entirely. But also that
results produced at the end of an iteration could be reused at the
beginning of the next even if there are no loop invariants involved.

This is achieved by unrolling the trace into two iterations, and
letting the optimizer work on this two-iteration-trace.
The optimizer will now be able to optimize the second iteration more than the
first since it can reuse results from the first iteration. The
optimized version of the first iteration we call the preamble and the
optimized version of the second iteration we call the loop. The
preamble will end with a jump to the loop, while the loop will end
with a jump to itself. This means that the preamble will be executed
once for the first iteration, the loop will be executed for all following
iterations.

Sqrt example

Here is an example of a Python implementation of sqrt using a fairly
simple algorithm

defsqrt(y,n=10000):x=y/2whilen>0:n-=1x=(x+y/x)/2returnx

If it is called with sqrt(1234.0),
a fairly long trace is produced. From this trace
the optimizer creates
the
following preamble (Loop 1) and loop (Loop 0)

Looking at the preamble, it starts by making sure that it is not
currently being profiled, the guard
on i5, and that the function object have not been changed
since the trace was made, the guard on p3. Somewhat
intermixed with that, the
integer variable n is unboxed, by making sure p11
points to an integer object and reading out the integer value from
that object.
These operations are not needed in the
loop (and have been removed from it) as emitting the same guards again
would be redundant and n becomes a virtual before the
end of the preamble.

After that the two floating point variables x and y
are unboxed. Again this is only needed in the preamble. Note how the
unboxed value of y, called f23, is passed unchanged
from the preamble to the loop in arguments of the jump
to allow it to be reused. It will not become a virtual
since it is never changed within the loop.

Following that is the actual calculations performed in the loop in
form of floating point operations (since the function was called with
a float argument). These appear in both the loop
and the preamble.

Finally there are some tests checking if a signal was received
(such as when the user presses ctrl-C) and thus should execute some
signal handler or if we need to hand over to another thread. This is
implemented with a counter that is decreased once every iteration. It
will go below zero after some specific number of iterations, tunable by
sys.setcheckinterval. The counter is read from and written to
some global location where it also can be made negative by a C-level
signal handler.

Bridges

When a guard fails often enough, the meta-interpreter is started again
to produce a new trace starting at the failing guard. The tracing is
continued until a previously compiled loop is entered. This could
either be the the same loop that contains the failing guard
or some completely different loop. If it is the same loop, executing
the preamble again maybe be unnecessary.
It is preferable to end the bridge with a jump directly to
the loop. To achieve this the optimizer tries to produce short
preambles that are inlined at the end of bridges allowing
them to jump directly to the loop. Inlining is better than jumping to
a common preamble because most of the inlined short preamble can
typically be removed again by the optimizer.
Creating such a short
preamble is however not always possible. Bridges jumping to loops for which
no short preamble can be generated have to end with a jump to the
full preamble instead.

The short preamble is created by comparing the operations in the
preamble with the operations in the loop. The
operations that are in the preamble but not in the loop
are moved to the short preamble whenever it is safe to move them to
the front of the operations remaining. In other words, the full preamble
is equivalent to the short preamble followed by one iteration of the
loop.

This much has currently been implemented. To give the full picture
here, there are two more features that
hopefully will be implemented in the near future.
The first is to replace the full preamble, used by the interpreter
when it reaches a compiled loop, with the short preamble.
This is currently not done and is probably not as straight forward as
it might first seem. The problem is where to resume interpreting on a
guard failure. However, implementing that should save some
memory. Not only
because the preamble will become smaller, but mainly because the
guards will appear either in the loop or in the preamble, but not
in both (as they do now). That means there will only be a single bridge and
not potentially two copies once the guards are traced.

The sqrt example above would with a short preamble result in a trace
like this

If it is executed long enough, the last guard will be traced to form a
bridge. The trace will inherit the virtuals from its parent. This can
be used to optimize away the part of the inlined short preamble
that deals with virtuals. The resulting bridge should look
something like

Here the first paragraph comes from the traced bridge and the second
is what remains of the short preamble after optimization. The
box p4 is
not a virtual (it contains a pointer to y which is never
changed), and it is only virtuals
that the bridge inherit from it's parents. This is why the last two
operations currently cannot be removed.

Each time the short preamble is inlined, a new copy of each of the
guards in it is generated. Typically the short preamble is inlined in
several places and thus there will be several copies of each of those
guards.
If they fail often enough bridges
from them will be traced (as with all guards). But since there
typically are several copies of each guard the same bridge
will be generated in
several places. To prevent this, mini-bridges from the inlined guards
are produced already during the inlining. These mini-bridges contain
nothing but a jump to the preamble.

The mini-bridges needs the arguments of the preamble to be able
to jump to it. These arguments contain among other things, boxed
versions of the
variables x and y. Those variables are virtuals in
the loop, and have to be allocated. Currently those allocations
are placed in front of the inlined guard. Moving those allocations into
the mini-bridges is the second feature that
hopefully will be implemented in the near future.
After this feature is
implemented, the result should look something like

Multiple specialized versions

Floating point operations were generated in the trace above
because sqrt was called with a float argument. If it is
instead called with an int argument, integer operations will be generated. The
somewhat more complex situations is when both int's and float's are
used as arguments. Then the jit need to generate multiple versions of
the same loop, specialized in different ways. The details, given
below, on how this is achieved is somewhat involved. For the casual
reader it would make perfect sense to skip to the next section here.

Consider the case when sqrt is first called with a float
argument (but with n small enough not to generate the
bridge). Then the trace shown above will be
generated. If sqrt is now called with an int argument, the
guard in the preamble testing that the type of the input object is float
will fail:

guard_nonnull_class(p12, 17652552, descr=<Guard10>)

It will fail every iteration, so soon enough a bridge will be
generated from this guard in the preamble. This guard will end with a
jump to the same loop, and the optimizer will try to inline
the short preamble at the end of it. This will however fail
since now there are two guards on p12. One that makes sure it
is an int and and one that makes sure it is a float. The optimizer
will detect that the second guard will always fail and mark the bridge
as invalid. Invalid loops are not passed on to the backend for
compilation.

If a loop is detected to be invalid while inlining the short preamble,
the metainterpreter will continue to trace for yet another
iteration of the loop. This new trace can be compiled as above and
will produce a new loop with a new preamble that are now specialized
for int arguments instead of float arguments. The bridge that
previously became invalid will now be tried again. This time inlining
the short preamble of the new loop instead. This will produce a set of
traces connected like this

(click for some hairy details)

The height of the boxes is this figure represents how many instructions
they contain (presuming the missing features from the previous section
are implemented). Loop 0 is specialized for floats and it's preamble have
been split into two boxes at the failing guard. Loop 2 is specialized
for ints and is larger than Loop 0. This is mainly because the integer
division in python does not map to the integer division of the
machine, but have to be implemented with several instructions (integer
division in python truncates its result towards minus
infinity, while the the machine integer division truncates towards
0). Also the height of the bridge is about the same as the height of
Loop 2. This is because it contains a full iteration of the loop.

A More Advanced Example

Let's conclude with an example that is a bit more advanced, where this unrolling
approach actually outperforms the previous approach. Consider
making a
fixed-point
implementation of the square root using 16 bit's of decimals. This can be
done using the same implementation
of sqrt but calling it with an object of a class representing
such fixed-point real numbers:

Below is a table comparing the runtime of the sqrt function above with
different argument types on different python interpreters. Pypy 1.4.1
was released before the optimizations described in this post were in place
while they are in place in the
nightly
build from January 5,
denoted pypy in the table. There are also the running time for the same
algorithms implemented in C and compiled with "gcc -O3
-march=native". Tests were executed on a 2.53GHz Intel Core2
processor with n=100000000 iterations.
Comparing the integer versions with C may be considered a
bit unfair because of the more advanced integer division operator in
python. The left part of this table shows runtimes of sqrt in
a program containing a single call to sqrt (i.e. only a single
specialized version of the loop is needed). The right part shows the
runtime of sqrt when it has been called with a different
type of argument before.

First call

Second call

float

int

Fix16

float

int

Fix16

cpython

28.18 s

22.13 s

779.04 s

28.07 s

22.21 s

767.03 s

pypy 1.4.1

1.20 s

6.49 s

11.31 s

1.20 s

6.54 s

11.23 s

pypy

1.20 s

6.44 s

6.78 s

1.19 s

6.26 s

6.79 s

gcc

1.15 s

1.82 s

1.89 s

1.15 s

1.82 s

1.89 s

For this to work in the last case, when Fix16 is the argument type in
the second type,
the trace_limit had to be increased from its default value to prevent
the metainterpreter from aborting while tracing the second version of
the loop. Also sys.setcheckinterval(1000000) were used to prevent the
bridge from being generated. With the bridge the performance of the
last case is significantly worse. Maybe because the optimizer currently
fails to generate a short preamble for it. But the slowdown
seems too big for that to be the only explanation. Below are the runtimes
numbers with checkinterval set to its default value of 100:

First call

Second call

float

int

Fix16

float

int

Fix16

cpython

28.71 s

22.09 s

781.86 s

28.28 s

21.92 s

761.59 s

pypy 1.4.1

1.21 s

6.48 s

11.22 s

1.72 s

7.58 s

12.18 s

pypy

1.21 s

6.27 s

7.22 s

1.20 s

6.29 s

90.47 s

Conclusions

Even though we are seeing speedups in a variety of different small
benchmarks, more complicated examples are not affected much by these
optimizations. It might partly be because larger examples have longer
and more complicated loops, and thus allowing optimizations to operate
across loop boundary will have a smaller relative effect. Another problem is
that with more complicated examples there will be more bridges, and bridges
are currently not handled very well (most of the time all virtuals are
forced at the end of the bridge as explained above). But moving those
forcings into the mini bridges should fix that.