This corresponds roughly to PyPy 2.3 (not 2.3.1). It requires 64-bit
Linux. More precisely, this release is built for Ubuntu 12.04 to 14.04;
you can also rebuild it
from source by getting the branch stmgc-c7. You need
clang to compile, and you need a patched
version of llvm.

This version's performance can reasonably be compared with a regular
PyPy, where both include the JIT. Thanks for following the meandering progress of PyPy-STM over the past three years --- we're finally getting somewhere really interesting! We cannot thank enough all contributors to the previous PyPy-STM money pot that made this possible. And, although this blog post is focused on the results from that period of time, I have of course to remind you that we're running a second call for donation for future work, which I will briefly mention again later.

A recap of what we did to get there: around the start of the year we found a new model, a "redo-log"-based STM which uses a couple of hardware tricks to not require chasing pointers, giving it (in this context) exceptionally cheap read barriers. This idea was developed over the following months and (relatively) easily integrated with the JIT compiler. The most recent improvements on the Garbage Collection side are closing the gap with a regular PyPy (there is still a bit more to do there). There is some preliminary user documentation.

Today, the result of this is a PyPy-STM that is capable of running pure Python code on multiple threads in parallel, as we will show in the benchmarks that follow. A quick warning: this is only about pure Python code. We didn't try so far to optimize the case where most of the time is spent in external libraries, or even manipulating "raw" memory like array.array or numpy arrays. To some extent there is no point because the approach of CPython works well for this case, i.e. releasing the GIL around the long-running operations in C. Of course it would be nice if such cases worked as well in PyPy-STM --- which they do to some extent; but checking and optimizing that is future work.

As a starting point for our benchmarks, when running code that
only uses one thread, we get a slow-down between 1.2 and 3: at worst,
three times as slow; at best only 20% slower than a regular
PyPy. This worst case has been brought down --it used to be 10x-- by
recent work on "card marking", a useful GC technique that is also
present in the regular PyPy (and about which I don't find any blog post;
maybe we should write one :-) The main remaining issue is fork(), or
any function that creates subprocesses: it works, but is very slow. To
remind you of this fact, it prints a line to stderr when used.

Now the real main part: when you run multithreaded code, it scales very nicely with two
threads, and less-than-linearly but still not badly with three or four
threads. Here is an artificial example:

total = 0
lst1 = ["foo"]
for i in range(100000000):
lst1.append(i)
total += lst1.pop()

We run this code N times, once in each of N threads
(full
benchmark). Run times, best of three:

Number of threads

Regular PyPy (head)

PyPy-STM

N = 1

real 0.92s
user+sys 0.92s

real 1.34s
user+sys 1.34s

N = 2

real 1.77s
user+sys 1.74s

real 1.39s
user+sys 2.47s

N = 3

real 2.57s
user+sys 2.56s

real 1.58s
user+sys 4.106s

N = 4

real 3.38s
user+sys 3.38s

real 1.64s
user+sys 5.35s

(The "real" time is the wall clock time. The "user+sys" time is the
recorded CPU time, which can be larger than the wall clock time if
multiple CPUs run in parallel. This was run on a 4x2 cores machine.
For direct comparison, avoid loops that are so trivial
that the JIT can remove all allocations from them: right now
PyPy-STM does not handle this case well. It has to force a dummy allocation
in such loops, which makes minor collections occur much more frequently.)

Four threads is the limit so far: only four threads can be executed in
parallel. Similarly, the memory usage is limited to 2.5 GB of GC
objects. These two limitations are not hard to increase, but at least
increasing the memory limit requires fighting against more LLVM bugs.
(Include here snark remarks about LLVM.)

Here are some measurements from more real-world benchmarks. This time,
the amount of work is fixed and we parallelize it on T threads. The first benchmark is just running translate.py on a trunk PyPy. The last
three benchmarks are here.

Benchmark

PyPy 2.3

(PyPy head)

PyPy-STM, T=1

T=2

T=3

T=4

translate.py --no-allworkingmodules
(annotation step)

184s

(170s)

386s (2.10x)

n/a

multithread-richards
5000 iterations

24.2s

(16.8s)

52.5s (2.17x)

37.4s (1.55x)

25.9s (1.07x)

32.7s (1.35x)

mandelbrot
divided in 16-18 bands

22.9s

(18.2s)

27.5s (1.20x)

14.4s (0.63x)

10.3s (0.45x)

8.71s (0.38x)

btree

2.26s

(2.00s)

2.01s (0.89x)

2.22s (0.98x)

2.14s (0.95x)

2.42s (1.07x)

This shows various cases that can occur:

The mandelbrot example runs with minimal overhead and very good parallelization.
It's dividing the plane to compute in bands, and each of the T threads receives the
same number of bands.

Richards, a classical benchmark for PyPy (tweaked to run the iterations
in multiple threads), is hard to beat on regular PyPy:
we suspect that the difference is due to the fact that a lot of
paths through the loops don't allocate, triggering the issue already
explained above. Moreover, the speed of Richards was again improved
dramatically recently, in trunk.

The translation benchmark measures the time translate.py
takes to run the first phase only, "annotation" (for now it consumes too much memory
to run translate.py to the end). Moreover the timing starts only after the large number of
subprocesses spawned at the beginning (mostly gcc). This benchmark is not parallel, but we
include it for reference here. The slow-down factor of 2.1x is still too much, but
we have some idea about the reasons: most likely, again the Garbage Collector, missing the regular PyPy's
very fast small-object allocator for old objects. Also, translate.py
is an example of application that could, with
reasonable efforts, be made largely parallel in the future using atomic blocks.

Atomic blocks are also present in the btree benchmark. I'm not completely sure
but it seems that, in this case, the atomic blocks create too many
conflicts between the threads for actual parallization: the base time is very good,
but running more threads does not help at all.

As a summary, PyPy-STM looks already useful to run CPU-bound multithreaded
applications. We are certainly still going to fight slow-downs, but it
seems that there are cases where 2 threads are enough to outperform a regular
PyPy, by a large margin. Please try it out on your own small examples!

And, at the same time, please don't attempt to retrofit threads inside
an existing large program just to benefit from PyPy-STM!
Our goal is not to send everyone down the obscure route of multithreaded
programming and its dark traps. We are going finally to shift our main
focus on the phase 2 of our
research (donations welcome): how to enable a better way of writing multi-core programs.
The starting point is to fix and test atomic blocks. Then we will have to
debug common causes of conflicts and fix them or work around them; and
try to see how common frameworks like Twisted can be adapted.

Lots of work ahead, but lots of work behind too :-)

Armin (thanks Remi as well for the work).

Hi all,

PyPy-STM is now reaching a point where we can say it's good enough to be
a GIL-less Python. (We don't guarantee there are no more bugs, so please
report them :-) The first official STM release:

This corresponds roughly to PyPy 2.3 (not 2.3.1). It requires 64-bit
Linux. More precisely, this release is built for Ubuntu 12.04 to 14.04;
you can also rebuild it
from source by getting the branch stmgc-c7. You need
clang to compile, and you need a patched
version of llvm.

This version's performance can reasonably be compared with a regular
PyPy, where both include the JIT. Thanks for following the meandering progress of PyPy-STM over the past three years --- we're finally getting somewhere really interesting! We cannot thank enough all contributors to the previous PyPy-STM money pot that made this possible. And, although this blog post is focused on the results from that period of time, I have of course to remind you that we're running a second call for donation for future work, which I will briefly mention again later.

A recap of what we did to get there: around the start of the year we found a new model, a "redo-log"-based STM which uses a couple of hardware tricks to not require chasing pointers, giving it (in this context) exceptionally cheap read barriers. This idea was developed over the following months and (relatively) easily integrated with the JIT compiler. The most recent improvements on the Garbage Collection side are closing the gap with a regular PyPy (there is still a bit more to do there). There is some preliminary user documentation.

Today, the result of this is a PyPy-STM that is capable of running pure Python code on multiple threads in parallel, as we will show in the benchmarks that follow. A quick warning: this is only about pure Python code. We didn't try so far to optimize the case where most of the time is spent in external libraries, or even manipulating "raw" memory like array.array or numpy arrays. To some extent there is no point because the approach of CPython works well for this case, i.e. releasing the GIL around the long-running operations in C. Of course it would be nice if such cases worked as well in PyPy-STM --- which they do to some extent; but checking and optimizing that is future work.

As a starting point for our benchmarks, when running code that
only uses one thread, we get a slow-down between 1.2 and 3: at worst,
three times as slow; at best only 20% slower than a regular
PyPy. This worst case has been brought down --it used to be 10x-- by
recent work on "card marking", a useful GC technique that is also
present in the regular PyPy (and about which I don't find any blog post;
maybe we should write one :-) The main remaining issue is fork(), or
any function that creates subprocesses: it works, but is very slow. To
remind you of this fact, it prints a line to stderr when used.

Now the real main part: when you run multithreaded code, it scales very nicely with two
threads, and less-than-linearly but still not badly with three or four
threads. Here is an artificial example:

total = 0
lst1 = ["foo"]
for i in range(100000000):
lst1.append(i)
total += lst1.pop()

We run this code N times, once in each of N threads
(full
benchmark). Run times, best of three:

Number of threads

Regular PyPy (head)

PyPy-STM

N = 1

real 0.92s
user+sys 0.92s

real 1.34s
user+sys 1.34s

N = 2

real 1.77s
user+sys 1.74s

real 1.39s
user+sys 2.47s

N = 3

real 2.57s
user+sys 2.56s

real 1.58s
user+sys 4.106s

N = 4

real 3.38s
user+sys 3.38s

real 1.64s
user+sys 5.35s

(The "real" time is the wall clock time. The "user+sys" time is the
recorded CPU time, which can be larger than the wall clock time if
multiple CPUs run in parallel. This was run on a 4x2 cores machine.
For direct comparison, avoid loops that are so trivial
that the JIT can remove all allocations from them: right now
PyPy-STM does not handle this case well. It has to force a dummy allocation
in such loops, which makes minor collections occur much more frequently.)

Four threads is the limit so far: only four threads can be executed in
parallel. Similarly, the memory usage is limited to 2.5 GB of GC
objects. These two limitations are not hard to increase, but at least
increasing the memory limit requires fighting against more LLVM bugs.
(Include here snark remarks about LLVM.)

Here are some measurements from more real-world benchmarks. This time,
the amount of work is fixed and we parallelize it on T threads. The first benchmark is just running translate.py on a trunk PyPy. The last
three benchmarks are here.

Benchmark

PyPy 2.3

(PyPy head)

PyPy-STM, T=1

T=2

T=3

T=4

translate.py --no-allworkingmodules
(annotation step)

184s

(170s)

386s (2.10x)

n/a

multithread-richards
5000 iterations

24.2s

(16.8s)

52.5s (2.17x)

37.4s (1.55x)

25.9s (1.07x)

32.7s (1.35x)

mandelbrot
divided in 16-18 bands

22.9s

(18.2s)

27.5s (1.20x)

14.4s (0.63x)

10.3s (0.45x)

8.71s (0.38x)

btree

2.26s

(2.00s)

2.01s (0.89x)

2.22s (0.98x)

2.14s (0.95x)

2.42s (1.07x)

This shows various cases that can occur:

The mandelbrot example runs with minimal overhead and very good parallelization.
It's dividing the plane to compute in bands, and each of the T threads receives the
same number of bands.

Richards, a classical benchmark for PyPy (tweaked to run the iterations
in multiple threads), is hard to beat on regular PyPy:
we suspect that the difference is due to the fact that a lot of
paths through the loops don't allocate, triggering the issue already
explained above. Moreover, the speed of Richards was again improved
dramatically recently, in trunk.

The translation benchmark measures the time translate.py
takes to run the first phase only, "annotation" (for now it consumes too much memory
to run translate.py to the end). Moreover the timing starts only after the large number of
subprocesses spawned at the beginning (mostly gcc). This benchmark is not parallel, but we
include it for reference here. The slow-down factor of 2.1x is still too much, but
we have some idea about the reasons: most likely, again the Garbage Collector, missing the regular PyPy's
very fast small-object allocator for old objects. Also, translate.py
is an example of application that could, with
reasonable efforts, be made largely parallel in the future using atomic blocks.

Atomic blocks are also present in the btree benchmark. I'm not completely sure
but it seems that, in this case, the atomic blocks create too many
conflicts between the threads for actual parallization: the base time is very good,
but running more threads does not help at all.

As a summary, PyPy-STM looks already useful to run CPU-bound multithreaded
applications. We are certainly still going to fight slow-downs, but it
seems that there are cases where 2 threads are enough to outperform a regular
PyPy, by a large margin. Please try it out on your own small examples!

And, at the same time, please don't attempt to retrofit threads inside
an existing large program just to benefit from PyPy-STM!
Our goal is not to send everyone down the obscure route of multithreaded
programming and its dark traps. We are going finally to shift our main
focus on the phase 2 of our
research (donations welcome): how to enable a better way of writing multi-core programs.
The starting point is to fix and test atomic blocks. Then we will have to
debug common causes of conflicts and fix them or work around them; and
try to see how common frameworks like Twisted can be adapted.

Yes. Sorry, it doesn't make sense to me. You need to debug with gdb, probably with an executable that has got the debugging symbols. You need to either build it yourself, or recompile the pregenerated sources from: http://cobra.cs.uni-duesseldorf.de/~buildmaster/misc/pypy-c-r72356-stm-jit-SOURCE.txz

If I try virtualenv I get:virtualenv stmtest -p Projekt/pypy-stm-2.3-linux64/bin/pypy Running virtualenv with interpreter Projekt/pypy-stm-2.3-linux64/bin/pypy[forking: for now, this operation can take some time][forking: for now, this operation can take some time]New pypy executable in stmtest/bin/pypy[forking: for now, this operation can take some time]ERROR: The executable stmtest/bin/pypy is not functioningERROR: It thinks sys.prefix is u'/home/ernst' (should be u'/home/ernst/stmtest')ERROR: virtualenv is not compatible with this system or executable

Thanks to the author of the now-deleted comments, we could track and fix a bug that only shows up on some Linux systems. If pypy-stm systematically segfaults at start-up for you too, try the "2.3-r2" release (see update in the post itself).

This is exciting! One minor bug in the actual post: you can describe slowdown / speedup in two different ways, with total time as a percentage of original time, or with time difference as a percentage of original time. You mention a 20% slowdown (clearly using the latter standard) and then a 300% slowdown, which you describe as 3x (suggesting that you use the former standard). To be consistent , you should either describe them as 120% and 300%, respectively (using the former standard), or 20% and 200%, respectively (using the latter standard).

I figured this would be easily parallelizable (low conflicts) but it doesn't seem to be the case - I don't get all my cores pegged using the STM.

bench-threadpool.py, on my machine, gives about the same time for CPython and PyPy-STM, while ordinary PyPy totally smokes them both (even with the GIL :), one order of magnitude difference (20 sec vs 2 sec).

bench-threadpool-naive will crash the STM interpreter on my system. :)

Getting away from threads, CPython will actually beat PyPy in a multi-process scenario by a factor of 2, which I found surprising. CPython does indeed use up all my cores 100% while dealing with a process pool, while PyPy has won't even come close.

For the same workload, PyPy is actually faster running multithreaded with the GIL than multi-process, and fastest running with only 1 thread (expected, with the GIL only being overhead in this scenario).

This is good news. For many of my applications, an important feature in the next phase will be the optimization for [..] the built-in dictionary type, for which we would like accesses and writes using independent keys to be truly independent [..]. My applications are mostly server applications (Twisted-based and others) that store state information on sessions/transactions in a small number of dictionaries that can have hundreds or thousands of entries concurrently, and would be accessed constantly.

@Armin I've pushed a version of bench-queue with a tweakable batch size and concurrency level. Doing the work in batches of, say, 1000 does indeed make it go faster with all implementations.

I've noticed pypy-stm runs have a large variance. It's not like I'm doing scientific measurements here, but for the queue test I'm getting runtimes from ~15 sec to ~27 sec, whereas for example ordinary PyPy is in the range 4.6 sec - 4.9 sec, and CPython ~22.5 - ~24.7, again, relatively close. Again, this is just something I noticed along the way and not the result of serious benchmarking in isolation.

Ooooof. Ok, I found out what is wrong in bench-queue. The issue is pretty technical, but basically if you add "with __pypy__.thread.atomic:" in the main top-level loop in worker(), then it gets vastly faster. On my machine it beats the real-time speed of a regular pypy. See http://bpaste.net/show/450553/

If you guys did a facelift on the website like yours HippyVM I believe the project would gain a lot of momentum, it is unfortunate but true that most company managers would visit it and think it is not industrial quality if an employ comes saying that they should sponsor developing something in PyPy.

this is a question for the guys developing PyPy... i am completely new to Python so please bear with me.

here is what i don't understand: it seems to me that you are reinventing the wheel because doesn't the Oracle or Azul Systems JVM already provide a super performant GC and JIT? even STM is becoming available. and since Jython can run on the JVM, why do PyPy at all?

wouldn't a JVM compliant implementation of Python be more performant than PyPy or CPython?

Having a JIT in the JVM is very different from having a JIT that can understand Python. For proof, the best (and only) implementation of Python on the JVM, Jython, is running at around CPython speed (generally a bit slower). I suspect that STM is similarly not designed for the purposes to which Jython would put it and would thus perform poorly. The only part that would probably work out of the box would be the GC. A more subtle argument against starting from the JVM is that of semantic mismatch. See for example http://www.stups.uni-duesseldorf.de/mediawiki/images/5/51/Pypy.pdf