Monday, November 17, 2014

Tornado without a GIL on PyPy STM

This post is by Konstantin Lopuhin, who tried PyPy STM during the
Warsaw sprint.

Python has a GIL, right? Not quite - PyPy STM is a python implementation
without a GIL, so it can scale CPU-bound work to several cores.
PyPy STM is developed by Armin Rigo and Remi Meier,
and supported by community donations.
You can read more about it in the
docs.

Although PyPy STM is still a work in progress, in many cases it can already
run CPU-bound code faster than regular PyPy, when using multiple cores.
Here we will see how to slightly modify Tornado IO loop to use
transaction
module.
This module is described
in the docs and is really simple to use - please see an example there.
An event loop of Tornado, or any other asynchronous
web server, looks like this (with some simplifications):

We get IO events, and run handlers for all of them, these handlers can
also register new callbacks, which we run too. When using such a framework,
it is very nice to have a guaranty that all handlers are run serially,
so you do not have to put any locks. This is an ideal case for the
transaction module - it gives us guaranties that things appear
to be run serially, so in user code we do not need any locks. We just
need to change the code above to something like:

But this does not scale. The CPU load is at 101-104 %, and we handle 30 %
less request per second. The reason for the slowdown is STM overhead,
which needs to keep track of all writes and reads in order to detect conflicts.
And the reason for using only one core is, obviously, conflicts!
Fortunately, we can see what this conflicts are, if we run code like this
(here 4 is the number of cores to use):

PYPYSTM=stm.log ./primes.py 4

Then we can use print_stm_log.py
to analyse this log. It lists the most expensive conflicts:

There are only three kinds of conflicts, they are described in
stm source,
Here we see that two threads call into external function to get current time,
and we can not rollback any of them, so one of them must wait till the other
transaction finishes.
For now we can hack around this by disabling this timing - this is only
needed for internal profiling in tornado.

If we do it, we get the following results (but see caveats below):

Impl.

req/s

PyPy 2.4

14.4

CPython 2.7

3.2

PyPy-STM 1

9.3

PyPy-STM 2

16.4

PyPy-STM 3

20.4

PyPy STM 4

24.2

As we can see, in this benchmark PyPy STM using just two cores
can beat regular PyPy!
This is not linear scaling, there are still conflicts left, and this
is a very simple example but still, it works!

But its not that simple yet :)

First, these are best-case numbers after long (much longer than for regular
PyPy) warmup. Second, it can sometimes crash (although removing old pyc files
fixes it). Third, benchmark meta-parameters are also tuned.

Here we get relatively good results only when there are a lot of concurrent
clients - as a results, a lot of requests pile up, the server is not keeping
with the load, and transaction module is busy with work running this piled up
requests. If we decrease the number of concurrent clients, results get slightly worse.
Another thing we can tune is how heavy is each request - again, if we ask
primes up to a lower number, then less time is spent doing calculations,
more time is spent in tornado, and results get much worse.

Besides the time.time() conflict described above, there are a lot of others.
The bulk of time is lost in these two conflicts:

The first one is presumably calling into some C function from stdlib, and we get
the same conflict as for time.time() above, but is can be fixed on PyPy
side, as we can be sure that computing sha1 is pure.

It is easy to hack around this one too, just removing etag support, but if
we do it, performance is much worse, only slightly faster than regular PyPy,
with the top conflict being:

83.066s lost in aborts, 0.000s paused (459x STM_CONTENTION_WRITE_WRITE)
File "/home/arigo/hg/pypy/stmgc-c7/lib-python/2.7/_weakrefset.py", line 70, in __contains__
File "/home/arigo/hg/pypy/stmgc-c7/lib-python/2.7/_weakrefset.py", line 70, in __contains__

Comment by Armin: It is unclear why this happens so far. We'll investigate...

The second conflict (without etag tweaks) originates
in the transaction module, from this piece of code:

Comment by Armin: This is a conflict in the transaction module itself; ideally,
it shouldn't have any, but in order to do that we might need a little bit
of support from RPython or C code. So this is pending improvement.

Tornado modification used in this blog post is based on 3.2.dev2.
As of now, the latest version is 4.0.2, and if we
apply
the same changes to this version, then we no longer get any scaling on this benchmark,
and there are no conflicts that take any substantial time.

Comment by Armin: There are two possible reactions to a conflict. We can either
abort one of the two threads, or (depending on the circumstances) just
pause the current thread until the other one commits, after which the
thread will likely be able to continue. The tool ``print_stm_log.py``
did not report conflicts that cause pauses. It has been fixed very
recently. Chances are that on this test it would report long pauses and
point to locations that cause them.

Part 2: a more interesting benchmark: A-star

Although we have seen that PyPy STM is not all moonlight and roses,
it is interesting to see how it works on a more realistic application.

astar.py
is a simple game where several players move on a map
(represented as a list of lists of integers),
build and destroy walls, and ask server to give them
shortest paths between two points
using A-star search, adopted from ActiveState recipie.

The benchmark bench_astar.py
is simulating players, and tries to put the main load on A-star search,
but also does some wall building and destruction. There are no locks
around map modifications, as normal tornado is executing all callbacks
serially, and we can keep this guaranty with atomic blocks of PyPy STM.
This is also an example of a program that is not trivial
to scale to multiple cores with separate processes (assuming
more interesting shared state and logic).

This benchmark is very noisy due to randomness of client interactions
(also it could be not linear), so just lower and upper bounds for
number of requests are reported

Impl.

req/s

PyPy 2.4

5 .. 7

CPython 2.7

0.5 .. 0.9

PyPy-STM 1

2 .. 4

PyPy STM 4

2 .. 6

Clearly this is a very bad benchmark, but still we can see that scaling is worse
and STM overhead is sometimes higher.
The bulk of conflicts come from the transaction module (we have seen it
above):

Although it is definitely not ready for production use, you can already try
to run things, report bugs, and see what is missing in user-facing tools
and libraries.

Benchmarks setup:

Amazon c3.xlarge (4 cores) running Ubuntu 14.04

pypy-c-r74011-stm-jit for the primes benchmark (but it has more bugs
than more recent versions), and
pypy-c-r74378-74379-stm-jit
for astar benchmark (put it inside pypy source checkout at 38c9afbd253c)

This post is by Konstantin Lopuhin, who tried PyPy STM during the
Warsaw sprint.

Python has a GIL, right? Not quite - PyPy STM is a python implementation
without a GIL, so it can scale CPU-bound work to several cores.
PyPy STM is developed by Armin Rigo and Remi Meier,
and supported by community donations.
You can read more about it in the
docs.

Although PyPy STM is still a work in progress, in many cases it can already
run CPU-bound code faster than regular PyPy, when using multiple cores.
Here we will see how to slightly modify Tornado IO loop to use
transaction
module.
This module is described
in the docs and is really simple to use - please see an example there.
An event loop of Tornado, or any other asynchronous
web server, looks like this (with some simplifications):

We get IO events, and run handlers for all of them, these handlers can
also register new callbacks, which we run too. When using such a framework,
it is very nice to have a guaranty that all handlers are run serially,
so you do not have to put any locks. This is an ideal case for the
transaction module - it gives us guaranties that things appear
to be run serially, so in user code we do not need any locks. We just
need to change the code above to something like:

But this does not scale. The CPU load is at 101-104 %, and we handle 30 %
less request per second. The reason for the slowdown is STM overhead,
which needs to keep track of all writes and reads in order to detect conflicts.
And the reason for using only one core is, obviously, conflicts!
Fortunately, we can see what this conflicts are, if we run code like this
(here 4 is the number of cores to use):

PYPYSTM=stm.log ./primes.py 4

Then we can use print_stm_log.py
to analyse this log. It lists the most expensive conflicts:

There are only three kinds of conflicts, they are described in
stm source,
Here we see that two threads call into external function to get current time,
and we can not rollback any of them, so one of them must wait till the other
transaction finishes.
For now we can hack around this by disabling this timing - this is only
needed for internal profiling in tornado.

If we do it, we get the following results (but see caveats below):

Impl.

req/s

PyPy 2.4

14.4

CPython 2.7

3.2

PyPy-STM 1

9.3

PyPy-STM 2

16.4

PyPy-STM 3

20.4

PyPy STM 4

24.2

As we can see, in this benchmark PyPy STM using just two cores
can beat regular PyPy!
This is not linear scaling, there are still conflicts left, and this
is a very simple example but still, it works!

But its not that simple yet :)

First, these are best-case numbers after long (much longer than for regular
PyPy) warmup. Second, it can sometimes crash (although removing old pyc files
fixes it). Third, benchmark meta-parameters are also tuned.

Here we get relatively good results only when there are a lot of concurrent
clients - as a results, a lot of requests pile up, the server is not keeping
with the load, and transaction module is busy with work running this piled up
requests. If we decrease the number of concurrent clients, results get slightly worse.
Another thing we can tune is how heavy is each request - again, if we ask
primes up to a lower number, then less time is spent doing calculations,
more time is spent in tornado, and results get much worse.

Besides the time.time() conflict described above, there are a lot of others.
The bulk of time is lost in these two conflicts:

The first one is presumably calling into some C function from stdlib, and we get
the same conflict as for time.time() above, but is can be fixed on PyPy
side, as we can be sure that computing sha1 is pure.

It is easy to hack around this one too, just removing etag support, but if
we do it, performance is much worse, only slightly faster than regular PyPy,
with the top conflict being:

83.066s lost in aborts, 0.000s paused (459x STM_CONTENTION_WRITE_WRITE)
File "/home/arigo/hg/pypy/stmgc-c7/lib-python/2.7/_weakrefset.py", line 70, in __contains__
File "/home/arigo/hg/pypy/stmgc-c7/lib-python/2.7/_weakrefset.py", line 70, in __contains__

Comment by Armin: It is unclear why this happens so far. We'll investigate...

The second conflict (without etag tweaks) originates
in the transaction module, from this piece of code:

Comment by Armin: This is a conflict in the transaction module itself; ideally,
it shouldn't have any, but in order to do that we might need a little bit
of support from RPython or C code. So this is pending improvement.

Tornado modification used in this blog post is based on 3.2.dev2.
As of now, the latest version is 4.0.2, and if we
apply
the same changes to this version, then we no longer get any scaling on this benchmark,
and there are no conflicts that take any substantial time.

Comment by Armin: There are two possible reactions to a conflict. We can either
abort one of the two threads, or (depending on the circumstances) just
pause the current thread until the other one commits, after which the
thread will likely be able to continue. The tool ``print_stm_log.py``
did not report conflicts that cause pauses. It has been fixed very
recently. Chances are that on this test it would report long pauses and
point to locations that cause them.

Part 2: a more interesting benchmark: A-star

Although we have seen that PyPy STM is not all moonlight and roses,
it is interesting to see how it works on a more realistic application.

astar.py
is a simple game where several players move on a map
(represented as a list of lists of integers),
build and destroy walls, and ask server to give them
shortest paths between two points
using A-star search, adopted from ActiveState recipie.

The benchmark bench_astar.py
is simulating players, and tries to put the main load on A-star search,
but also does some wall building and destruction. There are no locks
around map modifications, as normal tornado is executing all callbacks
serially, and we can keep this guaranty with atomic blocks of PyPy STM.
This is also an example of a program that is not trivial
to scale to multiple cores with separate processes (assuming
more interesting shared state and logic).

This benchmark is very noisy due to randomness of client interactions
(also it could be not linear), so just lower and upper bounds for
number of requests are reported

Impl.

req/s

PyPy 2.4

5 .. 7

CPython 2.7

0.5 .. 0.9

PyPy-STM 1

2 .. 4

PyPy STM 4

2 .. 6

Clearly this is a very bad benchmark, but still we can see that scaling is worse
and STM overhead is sometimes higher.
The bulk of conflicts come from the transaction module (we have seen it
above):

Although it is definitely not ready for production use, you can already try
to run things, report bugs, and see what is missing in user-facing tools
and libraries.

Benchmarks setup:

Amazon c3.xlarge (4 cores) running Ubuntu 14.04

pypy-c-r74011-stm-jit for the primes benchmark (but it has more bugs
than more recent versions), and
pypy-c-r74378-74379-stm-jit
for astar benchmark (put it inside pypy source checkout at 38c9afbd253c)

Correct me if I misunderstood the theory of PyPy-STM, but in the A* test there's nothing that prevents a get() to read the game map while MapChangeHandler.put() is running (that is, while the system is in an incoherent status)?

Shouldn't MapChangeHandler.put() be wrapped in a exclusive write lock, and all the get() handlers be wrapped with a shared read lock?

> Clearly this is a very benchmark" - looks like you've missed a word here ;)

Oh, yes, that word is "bad" :)

> Shouldn't MapChangeHandler.put() be wrapped in a exclusive write lock, and all the get() handlers be wrapped with a shared read lock?

Here all request handlers are already wrapped inside atomic blocks, but this is hidden from us in (modified) tornado. So we do not need any locks (as in normal tornado too, because normal tornado is single threaded). If request handlers conflict, then we just loose performance, not correctness. This is one of the main points of PyPy STM: it can support multithreaded code without needing to use locks.