Tuesday, November 9, 2010

A snake which bites its tail: PyPy JITting itself

We have to admit: even if we have been writing for years about the fantastic
speedups that the PyPy JIT gives, we, the PyPy developers, still don't use it
for our daily routine. Until today :-).

Readers brave enough to run translate.py to translate PyPy by themselves
surely know that the process takes quite a long time to complete, about a hour
on super-fast hardware and even more on average computers. Unfortunately, it
happened that translate.py was a bad match for our JIT and thus ran much
slower on PyPy than on CPython.

One of the main reasons is that the PyPy translation toolchain makes heavy use
of custom metaclasses, and until few weeks ago metaclasses disabled some of
the central optimizations which make PyPy so fast. During the recent
Düsseldorf sprint, Armin and Carl Friedrich fixed this problem and
re-enabled all the optimizations even in presence of metaclasses.

So, today we decided that it was time to benchmark again PyPy against itself.
First, we tried to translate PyPy using CPython as usual, with the following
command line (on a machine with an "Intel(R) Xeon(R) CPU W3580 @ 3.33GHz" and
12 GB of RAM, running a 32-bit Ubuntu):

Yes, it's not a typo: PyPy is almost two times faster than CPython!
Moreover, we can see that PyPy is faster in each of the individual steps apart
compile_c, which consists in just a call to make to invoke gcc.
The slowdown comes from the fact that the Makefile also contains a lot of
calls to the trackgcroot.py script, which happens to perform badly on PyPy
but we did not investigate why yet.

However, there is also a drawback: on this specific benchmark, PyPy consumes
much more memory than CPython. The reason why the command line above contains
--no-allworkingmodules is that if we include all the modules the
translation crashes when it's complete at 99% because it consumes all the 4GB
of memory which is addressable by a 32-bit process.

A partial explanation if that so far the assembler generated by the PyPy JIT
is immortal, and the memory allocated for it is never reclaimed. This is
clearly bad for a program like translate.py which is divided into several
independent steps, and for which most of the code generated in each step could
be safely be thrown away when it's completed.

If we switch to 64-bit we can address the whole 12 GB of RAM that we have, and
thus translating with all working modules is no longer an issue. This is the
time taken with CPython (note that it does not make sense to compare with the
32-bit CPython translation above, because that one does not include all the
modules):

The results are comparable with the 32-bit case: PyPy is still almost 2 times
faster than CPython. And it also shows that our 64-bit JIT backend is as good
as the 32-bit one. Again, the drawback is in the consumed memory: CPython
used 2.3 GB while PyPy took 8.3 GB.

Overall, the results are impressive: we knew that PyPy can be good at
optimizing small benchmarks and even middle-sized programs, but as far as we
know this is the first example in which it heavily optimizes a huge, real world
application. And, believe us, the PyPy translation toolchain is complex
enough to contains all kinds of dirty tricks and black magic that make Python
lovable and hard to optimize :-).

We have to admit: even if we have been writing for years about the fantastic
speedups that the PyPy JIT gives, we, the PyPy developers, still don't use it
for our daily routine. Until today :-).

Readers brave enough to run translate.py to translate PyPy by themselves
surely know that the process takes quite a long time to complete, about a hour
on super-fast hardware and even more on average computers. Unfortunately, it
happened that translate.py was a bad match for our JIT and thus ran much
slower on PyPy than on CPython.

One of the main reasons is that the PyPy translation toolchain makes heavy use
of custom metaclasses, and until few weeks ago metaclasses disabled some of
the central optimizations which make PyPy so fast. During the recent
Düsseldorf sprint, Armin and Carl Friedrich fixed this problem and
re-enabled all the optimizations even in presence of metaclasses.

So, today we decided that it was time to benchmark again PyPy against itself.
First, we tried to translate PyPy using CPython as usual, with the following
command line (on a machine with an "Intel(R) Xeon(R) CPU W3580 @ 3.33GHz" and
12 GB of RAM, running a 32-bit Ubuntu):

Yes, it's not a typo: PyPy is almost two times faster than CPython!
Moreover, we can see that PyPy is faster in each of the individual steps apart
compile_c, which consists in just a call to make to invoke gcc.
The slowdown comes from the fact that the Makefile also contains a lot of
calls to the trackgcroot.py script, which happens to perform badly on PyPy
but we did not investigate why yet.

However, there is also a drawback: on this specific benchmark, PyPy consumes
much more memory than CPython. The reason why the command line above contains
--no-allworkingmodules is that if we include all the modules the
translation crashes when it's complete at 99% because it consumes all the 4GB
of memory which is addressable by a 32-bit process.

A partial explanation if that so far the assembler generated by the PyPy JIT
is immortal, and the memory allocated for it is never reclaimed. This is
clearly bad for a program like translate.py which is divided into several
independent steps, and for which most of the code generated in each step could
be safely be thrown away when it's completed.

If we switch to 64-bit we can address the whole 12 GB of RAM that we have, and
thus translating with all working modules is no longer an issue. This is the
time taken with CPython (note that it does not make sense to compare with the
32-bit CPython translation above, because that one does not include all the
modules):

The results are comparable with the 32-bit case: PyPy is still almost 2 times
faster than CPython. And it also shows that our 64-bit JIT backend is as good
as the 32-bit one. Again, the drawback is in the consumed memory: CPython
used 2.3 GB while PyPy took 8.3 GB.

Overall, the results are impressive: we knew that PyPy can be good at
optimizing small benchmarks and even middle-sized programs, but as far as we
know this is the first example in which it heavily optimizes a huge, real world
application. And, believe us, the PyPy translation toolchain is complex
enough to contains all kinds of dirty tricks and black magic that make Python
lovable and hard to optimize :-).

For reference, at some point (long ago) I tried to use Psyco to speed up translate.py on CPython; but i didn't make any difference -- I'm guessing it's because we have nested scope variables at a few critical points, which Psyco cannot optimize. Now I no longer have a need for that :-)

Very cool achievement. I'm curious however to know why compile_c section is slower. I thought it was mostly waiting on external programs to run and so should of been similar time cpython? Congratulations!

@cfbolz Well, but you sure can run the 64bit version with the same module list as you did for 32bit... So if running the benchmark again in the same conditions isn't a lot of work, it'd provide yet another interesting data point ;)

Yes I think the word you wanted was "uses" instead of "leaks". The latter implies unforseen problems and errors, the former implies that memory usage hasn't been addressed yet... Just to reiterate - PyPy currently *uses* more memory than CPython.