Monday, November 30, 2009

If you have ever wanted to use CPython extension modules on PyPy,
we want to announce that there is a solution that should be compatible
to quite a bit of the available modules. It is neither new nor written
by us, but works nevertheless great with PyPy.

The trick is to use RPyC, a transparent, symmetric remote procedure
call library written in Python. The idea is to start a
CPython process that hosts the PyQt libraries
and connect to it via TCP to send RPC commands to it.

I tried to run PyQt applications
using it on PyPy and could get quite a bit of the functionality of these
working. Remaining problems include regular segfaults of CPython
because of PyQt-induced memory corruption and bugs because classes
like StandardButtons behave incorrectly when it comes to arithmetical operations.

Changes to RPyC needed to be done to support remote unbound __init__ methods,
shallow call by value for list and dict types (PyQt4 methods want real lists and dicts
as parameters), and callbacks to methods (all remote method objects are wrapped into
small lambda functions to ease the call for PyQt4).

If you want to try RPyC to run the PyQt application of your choice, you just
need to follow these steps. Please report your experience here in the blog
comments or on our mailing list.

Download this patch and apply it to RPyC by running
patch-p1<rpyc-3.0.7-pyqt4-compat.patch in the RPyC directory.

Install RPyc by running pythonsetup.pyinstall as root.

Run the file rpyc/servers/classic_server.py using CPython.

Execute your PyQt application on PyPy.

PyPy will automatically connect to CPython and use its PyQt libraries.

Note that this scheme works with nearly every extension library. Look
at pypy/lib/sip.py on how to add new libraries (you need to create
such a file for every proxied extension module).

Have fun with PyQt

Alexander Schremmer

If you have ever wanted to use CPython extension modules on PyPy,
we want to announce that there is a solution that should be compatible
to quite a bit of the available modules. It is neither new nor written
by us, but works nevertheless great with PyPy.

The trick is to use RPyC, a transparent, symmetric remote procedure
call library written in Python. The idea is to start a
CPython process that hosts the PyQt libraries
and connect to it via TCP to send RPC commands to it.

I tried to run PyQt applications
using it on PyPy and could get quite a bit of the functionality of these
working. Remaining problems include regular segfaults of CPython
because of PyQt-induced memory corruption and bugs because classes
like StandardButtons behave incorrectly when it comes to arithmetical operations.

Changes to RPyC needed to be done to support remote unbound __init__ methods,
shallow call by value for list and dict types (PyQt4 methods want real lists and dicts
as parameters), and callbacks to methods (all remote method objects are wrapped into
small lambda functions to ease the call for PyQt4).

If you want to try RPyC to run the PyQt application of your choice, you just
need to follow these steps. Please report your experience here in the blog
comments or on our mailing list.

Wednesday, November 18, 2009

Recently, thanks to the surprisingly helpful Unhelpful, also known as Andrew Mahone,
we have a decent, if slightly arbitrary, set of performances graphs.
It contains a couple of benchmarks already
seen on this blog as well as some taken from The Great Computer
Language Benchmarks Game. These benchmarks don't even try to represent "real applications"
as they're mostly small algorithmic benchmarks. Interpreters used:

PyPy trunk, revision 69331 with --translation-backendopt-storesink, which is
now on by default

Unladen swallow trunk, r900

CPython 2.6.2 release

Here are the graphs; the benchmarks and the runner script are available

And zoomed in for all benchmarks except binary-trees and fannkuch.

As we can see, PyPy is generally somewhere between the same speed
as CPython to 50x faster (f1int). The places where we're the same
speed as CPython are places where we know we have problems - for example generators are
not sped up by the JIT and they require some work (although not as much by far
as generators & Psyco :-). The glaring inefficiency is in the regex-dna benchmark.
This one clearly demonstrates that our regular expression engine is really,
really, bad and urgently requires attention.

The cool thing here is, that although these benchmarks might not represent
typical python applications, they're not uninteresting. They show
that algorithmic code does not need to be far slower in Python than in C,
so using PyPy one need not worry about algorithmic code being dramatically
slow. As many readers would agree, that kills yet another usage of C in our
lives :-)

Cheers,
fijal

Hello.

Recently, thanks to the surprisingly helpful Unhelpful, also known as Andrew Mahone,
we have a decent, if slightly arbitrary, set of performances graphs.
It contains a couple of benchmarks already
seen on this blog as well as some taken from The Great Computer
Language Benchmarks Game. These benchmarks don't even try to represent "real applications"
as they're mostly small algorithmic benchmarks. Interpreters used:

PyPy trunk, revision 69331 with --translation-backendopt-storesink, which is
now on by default

Unladen swallow trunk, r900

CPython 2.6.2 release

Here are the graphs; the benchmarks and the runner script are available

And zoomed in for all benchmarks except binary-trees and fannkuch.

As we can see, PyPy is generally somewhere between the same speed
as CPython to 50x faster (f1int). The places where we're the same
speed as CPython are places where we know we have problems - for example generators are
not sped up by the JIT and they require some work (although not as much by far
as generators & Psyco :-). The glaring inefficiency is in the regex-dna benchmark.
This one clearly demonstrates that our regular expression engine is really,
really, bad and urgently requires attention.

The cool thing here is, that although these benchmarks might not represent
typical python applications, they're not uninteresting. They show
that algorithmic code does not need to be far slower in Python than in C,
so using PyPy one need not worry about algorithmic code being dramatically
slow. As many readers would agree, that kills yet another usage of C in our
lives :-)

Friday, November 13, 2009

While the Düsseldorf is dwindling off, we put our minds to the task of retelling
our accomplishments. The sprint was mostly about improving the JIT and we
managed to stick to that task (as much as we managed to stick to anything). The
sprint was mostly filled with doing many small things.

Inlining

Carl Friedrich and Samuele started the sprint trying to tame the JIT's inlining.
Until now, the JIT would try to inline everything in a loop (except other loops)
which is what most tracing JITs actually do. This works great if the resulting
trace is of reasonable length, but if not it would result in excessive memory
consumption and code cache problems in the CPU. So far we just had a limit on
the trace size, and we would abort tracing when the limit was reached. This
would happen again and again for the same loop, which is not useful at all. The
new approach introduced is to be more clever when tracing is aborted by marking
the function with the largest contribution to the trace size as non-inlinable. The
next time this loop is traced, it usually then gives a reasonably sized trace.

This gives a problem because now some functions that don't contain loops are not
inlined, which means they never get assembler code for them generated. To remedy
this problem we also make it possible to trace functions from their start (as
opposed to just tracing loops). We do that only for functions that can not be
inlinined (either because they contain loops or they were marked as
non-inlinable as described above).

The result of this is that the Python versiontelco decimal benchmark runs
to completion without having to arbitrarily increase the trace length limit.
It's also about 40% faster than running it on CPython. This is one of the first
non-tiny programs that we speed up.

Reducing GC Pressure

Armin and Anto used some GC instrumentation to find places in pypy-c-jit
that allocate a lot of memory. This is an endlessly surprising exercise, as
usually we don't care too much about allocations of short-lived objects when
writing RPython, as our GCs usually deal well with those. They found a few
places where they could remove allocations, most importantly by making one of
the classes that make up traces smaller.

Optimizing Chains of Guards

Carl Friedrich and Samuele started a simple optimization on the trace level that
removes superfluous guards. A common pattern in a trace is to have stronger
and stronger guards about the same object. As an example, often there is first a
guard that an object is not None, later followed by a guard that it is exactly
of a given class and then even later that it is a precise instance of that
class. This is inefficient, as we can just check the most precise thing in the
place of the first guard, saving us guards (which take memory, as they need resume data).
Maciek, Armin and Anto later improved on that by introducing a new guard that
checks for non-nullity and a specific class in one guard, which allows us to
collapse more chains.

Improving JIT and Exceptions

Armin and Maciek went on a multi-day quest to make the JIT and Python-level
exceptions like each other more. So far, raising and catching exceptions would
make the JIT generate code that has a certain amusement value, but is not really
fast in any way. To improve the situation, they had to dig into the exception
support in the Python interpreter, where they found various inefficiencies. They
also had to rewrite the exceptions module to be in RPython (as opposed to
just pure Python + an old hack). Another problems is that tracebacks give you
access to interpreter frames. This forces the JIT to deoptimize things, as
the JIT keeps some of the frame's content in CPU registers or on the CPU stack,
which reflective access to frames prevents.
Currently we try to improve the simple cases where the traceback is never
actually accessed. This work is not completely finished, but some cases are
already significantly faster.

Moving PyPy to use py.test 1.1

Holger worked on porting PyPy to use the newly released py.test 1.1. PyPy
still uses some very old support code in its testing infrastructure, which makes
this task a bit annoying. He also gave the other PyPy developers a demo of some
of the newer py.test features and we discussed which of them we want to start
using to improve our tests to make them shorter and clearer. One of the things
we want to do eventually is to have less skipped tests than now.

Using a Simple Effect Analysis for the JIT

One of the optimization the JIT does is caching fields that are read out of
structures on the heap. This cache needs to be invalidated at some points, for
example when such a field is written to (as we don't track aliasing much).
Another case is a call in the assembler, as the target function could
arbitrarily change the heap. This of course is imprecise, since most functions
don't actually change the whole heap, and we have an analysis that finds out
which sorts of types of structs and arrays a function can mutate. During the
sprint Carl Friedrich and Samuele integrated this analysis with the JIT, to help
it invalidate caches less aggressively. Later Anto and Carl Friedrich also
ported this support to the CLI version of the JIT.

Miscellaneous

Samuele (with some assistance of Carl Friedrich) set up a buildbot slave on a
Mac Mini at the University. This should let us stabilize on the Max OS X. So far
we still have a number of failing tests, but now we are in a situation to
sanely approach fixing them.

The guinea-pigs that were put into Carl Friedrich's care have been fed (which
was the most important sprint task anyway).

Samuele & Carl Friedrich

While the Düsseldorf is dwindling off, we put our minds to the task of retelling
our accomplishments. The sprint was mostly about improving the JIT and we
managed to stick to that task (as much as we managed to stick to anything). The
sprint was mostly filled with doing many small things.

Inlining

Carl Friedrich and Samuele started the sprint trying to tame the JIT's inlining.
Until now, the JIT would try to inline everything in a loop (except other loops)
which is what most tracing JITs actually do. This works great if the resulting
trace is of reasonable length, but if not it would result in excessive memory
consumption and code cache problems in the CPU. So far we just had a limit on
the trace size, and we would abort tracing when the limit was reached. This
would happen again and again for the same loop, which is not useful at all. The
new approach introduced is to be more clever when tracing is aborted by marking
the function with the largest contribution to the trace size as non-inlinable. The
next time this loop is traced, it usually then gives a reasonably sized trace.

This gives a problem because now some functions that don't contain loops are not
inlined, which means they never get assembler code for them generated. To remedy
this problem we also make it possible to trace functions from their start (as
opposed to just tracing loops). We do that only for functions that can not be
inlinined (either because they contain loops or they were marked as
non-inlinable as described above).

The result of this is that the Python versiontelco decimal benchmark runs
to completion without having to arbitrarily increase the trace length limit.
It's also about 40% faster than running it on CPython. This is one of the first
non-tiny programs that we speed up.

Reducing GC Pressure

Armin and Anto used some GC instrumentation to find places in pypy-c-jit
that allocate a lot of memory. This is an endlessly surprising exercise, as
usually we don't care too much about allocations of short-lived objects when
writing RPython, as our GCs usually deal well with those. They found a few
places where they could remove allocations, most importantly by making one of
the classes that make up traces smaller.

Optimizing Chains of Guards

Carl Friedrich and Samuele started a simple optimization on the trace level that
removes superfluous guards. A common pattern in a trace is to have stronger
and stronger guards about the same object. As an example, often there is first a
guard that an object is not None, later followed by a guard that it is exactly
of a given class and then even later that it is a precise instance of that
class. This is inefficient, as we can just check the most precise thing in the
place of the first guard, saving us guards (which take memory, as they need resume data).
Maciek, Armin and Anto later improved on that by introducing a new guard that
checks for non-nullity and a specific class in one guard, which allows us to
collapse more chains.

Improving JIT and Exceptions

Armin and Maciek went on a multi-day quest to make the JIT and Python-level
exceptions like each other more. So far, raising and catching exceptions would
make the JIT generate code that has a certain amusement value, but is not really
fast in any way. To improve the situation, they had to dig into the exception
support in the Python interpreter, where they found various inefficiencies. They
also had to rewrite the exceptions module to be in RPython (as opposed to
just pure Python + an old hack). Another problems is that tracebacks give you
access to interpreter frames. This forces the JIT to deoptimize things, as
the JIT keeps some of the frame's content in CPU registers or on the CPU stack,
which reflective access to frames prevents.
Currently we try to improve the simple cases where the traceback is never
actually accessed. This work is not completely finished, but some cases are
already significantly faster.

Moving PyPy to use py.test 1.1

Holger worked on porting PyPy to use the newly released py.test 1.1. PyPy
still uses some very old support code in its testing infrastructure, which makes
this task a bit annoying. He also gave the other PyPy developers a demo of some
of the newer py.test features and we discussed which of them we want to start
using to improve our tests to make them shorter and clearer. One of the things
we want to do eventually is to have less skipped tests than now.

Using a Simple Effect Analysis for the JIT

One of the optimization the JIT does is caching fields that are read out of
structures on the heap. This cache needs to be invalidated at some points, for
example when such a field is written to (as we don't track aliasing much).
Another case is a call in the assembler, as the target function could
arbitrarily change the heap. This of course is imprecise, since most functions
don't actually change the whole heap, and we have an analysis that finds out
which sorts of types of structs and arrays a function can mutate. During the
sprint Carl Friedrich and Samuele integrated this analysis with the JIT, to help
it invalidate caches less aggressively. Later Anto and Carl Friedrich also
ported this support to the CLI version of the JIT.

Miscellaneous

Samuele (with some assistance of Carl Friedrich) set up a buildbot slave on a
Mac Mini at the University. This should let us stabilize on the Max OS X. So far
we still have a number of failing tests, but now we are in a situation to
sanely approach fixing them.

Friday, November 6, 2009

The Düsseldorf sprint starts today. Only Samuele and me are there so far, but that should change over the course of the day. We will mostly work on the JIT during this sprint, trying to make it a lot more practical. For that we need to decrease its memory requirements some more and to make it use less aggressive inlining. We will post more as the sprint progresses.

The Düsseldorf sprint starts today. Only Samuele and me are there so far, but that should change over the course of the day. We will mostly work on the JIT during this sprint, trying to make it a lot more practical. For that we need to decrease its memory requirements some more and to make it use less aggressive inlining. We will post more as the sprint progresses.

Tuesday, November 3, 2009

It's maybe a bit late to announce, but there will be PyPy talk
at Rupy conference this weekend in
Poznan. Precisely, I'll be talking mostly about PyPy's JIT and
how to use it. Unfortunately the talk is on Saturday, at 8:30 in the morning.

It's maybe a bit late to announce, but there will be PyPy talk
at Rupy conference this weekend in
Poznan. Precisely, I'll be talking mostly about PyPy's JIT and
how to use it. Unfortunately the talk is on Saturday, at 8:30 in the morning.

Sunday, November 1, 2009

This week I worked on improving the system we use for logging. Well, it was not really a "system" but rather a pile of hacks to measure in custom ways timings and counts and display them. So now, we have a system :-)

The system in question was integrated in the code for the GC and the JIT, which are two independent components as far as the source is concerned. However, we can now display a unified view. Here is for example pypy-c-jit running pystone for (only) 5000 iterations:

The top long bar represents time. The bottom shows two summaries of the total time taken by the various components, and also plays the role of a legend to understand the colors at the top. Shades of red are the GC, shades of green are the JIT.

Here is another picture, this time on pypy-c-jit running 10 iterations of richards:

We have to look more closely at various examples, but a few things immediately show up. One thing is that the GC is put under large pressure by the jit-tracing, jit-optimize and (to a lesser extent) the jit-backend components. So large in fact that the GC takes at least 60-70% of the time there. We will have to do something about it at some point. The other thing is that on richards (and it's likely generally the case), the jit-blackhole component takes a lot of time. "Blackholing" is the operation of recovering from a guard failure in the generated assembler, and falling back to the interpreter. So this is also something we will need to improve.

This week I worked on improving the system we use for logging. Well, it was not really a "system" but rather a pile of hacks to measure in custom ways timings and counts and display them. So now, we have a system :-)

The system in question was integrated in the code for the GC and the JIT, which are two independent components as far as the source is concerned. However, we can now display a unified view. Here is for example pypy-c-jit running pystone for (only) 5000 iterations:

The top long bar represents time. The bottom shows two summaries of the total time taken by the various components, and also plays the role of a legend to understand the colors at the top. Shades of red are the GC, shades of green are the JIT.

Here is another picture, this time on pypy-c-jit running 10 iterations of richards:

We have to look more closely at various examples, but a few things immediately show up. One thing is that the GC is put under large pressure by the jit-tracing, jit-optimize and (to a lesser extent) the jit-backend components. So large in fact that the GC takes at least 60-70% of the time there. We will have to do something about it at some point. The other thing is that on richards (and it's likely generally the case), the jit-blackhole component takes a lot of time. "Blackholing" is the operation of recovering from a guard failure in the generated assembler, and falling back to the interpreter. So this is also something we will need to improve.