Wednesday, March 26, 2014

The Raspberry Pi aims to be a low-cost educational tool that anyone can use to learn about electronics and programming. Python and pygame are included in the Pi's programming toolkit. And since last year, thanks in part to sponsorship from the Raspberry Pi Foundation, PyPy also works on the Pi (read more here).

With PyPy working on the Pi, game logic written in Python stands to gain an awesome performance boost. However, the original pygame is a Python C extension. This means it performs poorly on PyPy and negates any speedup in the Python parts of the game code.

One solution to making pygame games run faster on PyPy, and eventually on the Raspberry Pi, comes in the form of pygame_cffi. pygame_cffi uses CFFI to wrap the underlying SDL library instead of a C extension. A few months ago, the Raspberry Pi Foundation sponsored a Cape Town Python User Group hackathon to build a proof-of-concept pygame using CFFI. This hackathon was a success and it produced an early working version of pygame_cffi.

So for the last 5 weeks Raspberry Pi has been funding work on pygame_cffi. The goal was a complete implementation of the core modules. We also wanted benchmarks to illuminate performance differences between pygame_cffi on PyPy and pygame on CPython. We are happy to report that those goals were met. So without further ado, here's a rundown of what works.

Current functionality

Surfaces support all the usual flags for SDL and OpenGL rendering (more about OpenGL below).

With the above-mentioned functionality in place we could get 10+ of the pygame examples to work, and a number of PyWeek games. At the time of writing, if a game doesn't work it is most likely due to an unimplemented transform or draw function. That will be remedied soon.

Performance

In terms of performance, pygame_cffi on PyPy is showing a lot of promise. It beats pygame on CPython by a significant margin in our events processing and collision detection benchmarks, while blit and fill benchmarks perform similarly. The pygame examples we checked also perform better.

However, there is still work to be done to identify and eliminate bottlenecks. On the Raspberry Pi performance is markedly worse compared to pygame (barring collision detection). The PyWeek games we tested also performed slightly worse. Fortunately there is room for improvement in various places.

Invention & Mutable Mamba (x86)

Standard pygame examples (Raspberry Pi)

Here's a summary of some of the benchmarks. Relative speed refers to the frame rate obtained in pygame_cffi on PyPy relative to pygame on CPython.

Benchmark

Relative speed (pypy speedup)

Events (x86)

1.41

Events (Pi)

0.58

N2 collision detection on 100 sprites (x86)

4.14

N2 collision detection on 100 sprites (Pi)

1.01

Blit 100 surfaces (x86)

1.06

Blit 100 surfaces (Pi)

0.60

Invention (x86)

0.95

Mutable Mamba (x86)

0.72

stars example (x86)

1.95

stars example (Pi)

0.84

OpenGL

Some not-so-great news is that PyOpenGL performs poorly on PyPy since PyOpenGL uses ctypes. This translates into a nasty reduction in frame rate for games that use OpenGL surfaces. It might be worthwhile creating a CFFI-powered version of PyOpenGL as well.

Where to now?

Work on pygame_cffi is ongoing. Here are some things that are in the pipeline:

Get pygame_cffi on PyPy to a place where it is consistently faster than pygame on CPython.

Implement the remaining modules and functions, starting with draw and transform.

Improve test coverage.

Reduce the time it takes for CFFI to parse the cdef. This makes the initial pygame import slow.

If you want to contribute you can find pygame_cffi on Github.
Feel free to find us on #pypy on freenode or post issues on github.

Cheers,
Rizmari Versfeld

The Raspberry Pi aims to be a low-cost educational tool that anyone can use to learn about electronics and programming. Python and pygame are included in the Pi's programming toolkit. And since last year, thanks in part to sponsorship from the Raspberry Pi Foundation, PyPy also works on the Pi (read more here).

With PyPy working on the Pi, game logic written in Python stands to gain an awesome performance boost. However, the original pygame is a Python C extension. This means it performs poorly on PyPy and negates any speedup in the Python parts of the game code.

One solution to making pygame games run faster on PyPy, and eventually on the Raspberry Pi, comes in the form of pygame_cffi. pygame_cffi uses CFFI to wrap the underlying SDL library instead of a C extension. A few months ago, the Raspberry Pi Foundation sponsored a Cape Town Python User Group hackathon to build a proof-of-concept pygame using CFFI. This hackathon was a success and it produced an early working version of pygame_cffi.

So for the last 5 weeks Raspberry Pi has been funding work on pygame_cffi. The goal was a complete implementation of the core modules. We also wanted benchmarks to illuminate performance differences between pygame_cffi on PyPy and pygame on CPython. We are happy to report that those goals were met. So without further ado, here's a rundown of what works.

Current functionality

Surfaces support all the usual flags for SDL and OpenGL rendering (more about OpenGL below).

With the above-mentioned functionality in place we could get 10+ of the pygame examples to work, and a number of PyWeek games. At the time of writing, if a game doesn't work it is most likely due to an unimplemented transform or draw function. That will be remedied soon.

Performance

In terms of performance, pygame_cffi on PyPy is showing a lot of promise. It beats pygame on CPython by a significant margin in our events processing and collision detection benchmarks, while blit and fill benchmarks perform similarly. The pygame examples we checked also perform better.

However, there is still work to be done to identify and eliminate bottlenecks. On the Raspberry Pi performance is markedly worse compared to pygame (barring collision detection). The PyWeek games we tested also performed slightly worse. Fortunately there is room for improvement in various places.

Invention & Mutable Mamba (x86)

Standard pygame examples (Raspberry Pi)

Here's a summary of some of the benchmarks. Relative speed refers to the frame rate obtained in pygame_cffi on PyPy relative to pygame on CPython.

Benchmark

Relative speed (pypy speedup)

Events (x86)

1.41

Events (Pi)

0.58

N2 collision detection on 100 sprites (x86)

4.14

N2 collision detection on 100 sprites (Pi)

1.01

Blit 100 surfaces (x86)

1.06

Blit 100 surfaces (Pi)

0.60

Invention (x86)

0.95

Mutable Mamba (x86)

0.72

stars example (x86)

1.95

stars example (Pi)

0.84

OpenGL

Some not-so-great news is that PyOpenGL performs poorly on PyPy since PyOpenGL uses ctypes. This translates into a nasty reduction in frame rate for games that use OpenGL surfaces. It might be worthwhile creating a CFFI-powered version of PyOpenGL as well.

Where to now?

Work on pygame_cffi is ongoing. Here are some things that are in the pipeline:

Get pygame_cffi on PyPy to a place where it is consistently faster than pygame on CPython.

Implement the remaining modules and functions, starting with draw and transform.

Improve test coverage.

Reduce the time it takes for CFFI to parse the cdef. This makes the initial pygame import slow.

If you want to contribute you can find pygame_cffi on Github.
Feel free to find us on #pypy on freenode or post issues on github.

Saturday, March 15, 2014

Here is one of the first full PyPy's
(edit: it was r69967+, but the general list of versions is currently here)
compiled with the new StmGC-c7
library. It has no JIT so far, but it runs some small
single-threaded benchmarks by taking around 40% more time than a
corresponding non-STM, no-JIT version of PyPy. It scales --- up to two
threads only, which is the hard-coded maximum so far in the c7 code.
But the scaling looks perfect in these small benchmarks without
conflict: starting two threads each running a copy of the benchmark
takes almost exactly the same amount of total time, simply using two
cores.

Feel free to try it! It is not actually useful so far, because it is
limited to two cores and CPython is something like 2.5x faster. One of
the important next steps is to re-enable the JIT. Based on our current
understanding of the "40%" figure, we can probably reduce it with
enough efforts; but also, the JIT should be able to easily produce
machine code that suffers a bit less than the interpreter from these
effects. This seems to mean that we're looking at 20%-ish slow-downs
for the future PyPy-STM-JIT.

Interesting times :-)

For reference, this is what you get by downloading the
PyPy binary linked above: a Linux 64 binary (Ubuntu 12.04) that
should behave mostly like a regular PyPy. (One main missing feature is
that destructors are never called.) It uses two cores, but obviously
only if the Python program you run is multithreaded. The only new
built-in feature is with __pypy__.thread.atomic: this gives
you a way to enforce that a block of code runs "atomically", which means
without any operation from any other thread randomly interleaved.

If you want to translate it yourself, you need a trunk version of clang
with three patches applied. That's the number of bugs that we couldn't
find workarounds for, not the total number of bugs we found by (ab)using
the address_space feature...

Stay tuned for more!

Armin & Remi

Hi all,

Here is one of the first full PyPy's
(edit: it was r69967+, but the general list of versions is currently here)
compiled with the new StmGC-c7
library. It has no JIT so far, but it runs some small
single-threaded benchmarks by taking around 40% more time than a
corresponding non-STM, no-JIT version of PyPy. It scales --- up to two
threads only, which is the hard-coded maximum so far in the c7 code.
But the scaling looks perfect in these small benchmarks without
conflict: starting two threads each running a copy of the benchmark
takes almost exactly the same amount of total time, simply using two
cores.

Feel free to try it! It is not actually useful so far, because it is
limited to two cores and CPython is something like 2.5x faster. One of
the important next steps is to re-enable the JIT. Based on our current
understanding of the "40%" figure, we can probably reduce it with
enough efforts; but also, the JIT should be able to easily produce
machine code that suffers a bit less than the interpreter from these
effects. This seems to mean that we're looking at 20%-ish slow-downs
for the future PyPy-STM-JIT.

Interesting times :-)

For reference, this is what you get by downloading the
PyPy binary linked above: a Linux 64 binary (Ubuntu 12.04) that
should behave mostly like a regular PyPy. (One main missing feature is
that destructors are never called.) It uses two cores, but obviously
only if the Python program you run is multithreaded. The only new
built-in feature is with __pypy__.thread.atomic: this gives
you a way to enforce that a block of code runs "atomically", which means
without any operation from any other thread randomly interleaved.

If you want to translate it yourself, you need a trunk version of clang
with three patches applied. That's the number of bugs that we couldn't
find workarounds for, not the total number of bugs we found by (ab)using
the address_space feature...

Friday, March 7, 2014

More progress was made on the NumPy front in the past month. On the compatibility front, we now pass ~130 more tests from NumPy's suite since the end of January. Currently, we pass 2336 tests out of 3265 tests run, with many of the failures representing portions of NumPy that we don't plan to implement in the near future (object dtypes, unicode, etc). There are still some failures that do represent issues, such as special indexing cases and failures to respect subclassed ndarrays in return values, which we do plan to resolve. There are also some unimplemented components and ufuncs remaining which we hope to implement, such as nditer and mtrand. Overall, the most common array functionality should be working.

Additionally, I began to take a look at some of the loops generated by our code. One widely used loop is dot, and we were running about 5x slower than NumPy's C version. I was able to optimize the dot loop and also the general array iterator to get us to ~1.5x NumPy C time on dot operations of various sizes. Further progress in this area could be made by using CFFI to tie into BLAS libraries, when available. Also, work remains in examining traces generated for our other loops and checking for potential optimizations.

To try out PyPy + NumPy, grab a nightly PyPy and install our NumPy fork. Feel free to report comments/issues to IRC, our mailing list, or bug tracker. Thanks to the contributors to the NumPy on PyPy proposal for supporting this work.

Cheers,
Brian

More progress was made on the NumPy front in the past month. On the compatibility front, we now pass ~130 more tests from NumPy's suite since the end of January. Currently, we pass 2336 tests out of 3265 tests run, with many of the failures representing portions of NumPy that we don't plan to implement in the near future (object dtypes, unicode, etc). There are still some failures that do represent issues, such as special indexing cases and failures to respect subclassed ndarrays in return values, which we do plan to resolve. There are also some unimplemented components and ufuncs remaining which we hope to implement, such as nditer and mtrand. Overall, the most common array functionality should be working.

Additionally, I began to take a look at some of the loops generated by our code. One widely used loop is dot, and we were running about 5x slower than NumPy's C version. I was able to optimize the dot loop and also the general array iterator to get us to ~1.5x NumPy C time on dot operations of various sizes. Further progress in this area could be made by using CFFI to tie into BLAS libraries, when available. Also, work remains in examining traces generated for our other loops and checking for potential optimizations.

To try out PyPy + NumPy, grab a nightly PyPy and install our NumPy fork. Feel free to report comments/issues to IRC, our mailing list, or bug tracker. Thanks to the contributors to the NumPy on PyPy proposal for supporting this work.