Tuesday, August 23, 2011

We need Software Transactional Memory

Hi all. Here is (an extract of) a short summary paper about my current position on
Software Transactional Memory as a general tool in the implementation
of Python or Python-like languages. Thanks to people on IRC for discussion on making
this blog post better (lucian, Alex Gaynor, rguillebert, timonator, Da_Blitz).
For the purpose of the present discussion, we are comparing Java with Python
when it comes to multi-threading.

The problem in complex high-level languages

Like Java, the Python language gives guarantees: it is not acceptable
for the Python virtual machine to crash due to incorrect usage of
threads. A primitive operation in Java is something like reading or
writing a field of an object; the corresponding guarantees are along the
lines of: if the program reads a field of an object, and another thread
writes to the same field of the same object, then the program will see
either the old value, or the new value, but not something else entirely,
and the virtual machine will not crash.

Higher-level languages like Python differ from Java by the fact that a
"primitive operation" is far more complex. It may for example involve
looking in several hash maps, perhaps doing updates. In general, it is
completely impossible to map every operation that must be atomic to a
single processor instruction.

Jython: fine-grained locking

This problem has been solved "explicitly" in the Jython interpreter that
runs on top of Java. The solution is explicit in the following sense:
throughout the Jython interpreter, every single operation makes careful
use of Java-level locking mechanisms. This is an application of
"fine-grained locking". For example, operations like attribute lookup,
which need to perform look-ups in a number of hash maps, are protected
by acquiring and releasing locks (in __getattribute__).

A draw-back of this solution is the attention to detail required.
If even one place misses a lock, then there is either a
bug --- and such bugs occur in cases that are increasingly rare and hard
to debug as the previous bugs are fixed --- or we just file it under "differences
from CPython". There is however the risk of
deadlock, if two threads attempt to lock the same objects in different
order.

In practice, the situation is actually not as bad as
I may paint it: the number of locks in Jython is reasonable, and allows for
all the "common cases" to work as expected.
(For the uncommon cases, see below.)

Performance-wise, the Java virtual machine itself comes with locks that
have been heavily optimized over a long period of time, so the
performance is acceptable. However if this solution were coded in C, it
would need a lot of extra work to optimize the locks manually (possibly
introducing more of the subtle bugs).

CPython: coarse-grained locking

CPython, the standard implementation of Python in C, took a different
and simpler approach: it has a single global lock, called the Global
Interpreter Lock (GIL). It uses "coarse-grained locking": the lock is
acquired and released around the whole execution of one bytecode (or
actually a small number of bytecodes, like 100). This solution is
enough to ensure that no two operations can conflict with each other,
because the two bytecodes that invoke them are themselves
serialized by the GIL. It is a solution which avoids --- unlike Jython
--- writing careful lock-acquiring code all over the interpreter. It
also offers even stronger guarantees: every bytecode runs entirely
atomically.

Nowadays, the draw-back of the GIL approach is obvious on multi-core
machines: by serializing the execution of bytecodes, starting multiple
threads does not actually let the interpreter use of more than one core.

PyPy, the Python implementation in Python, takes the same approach so
far.

Existing usage

As we have seen, we have the following situation: the existing Python
language, as CPython implements it, offers very strong guarantees about
multi-threaded usage. It is important to emphasize that most existing
multi-threaded Python programs actually rely on such strong guarantees.
This can be seen for example in a problem that takes a populated list
and does in several threads:

next_item = global_list.pop()

This implicitly relies on the fact that pop() will perform atomic
removal from the list. If two threads try to pop() from the same list
at the same time, then the two operations will occur in one order or the
other; but they will not e.g. return the same object to both threads or
mess up the internal state of the list object.

With such an example in mind, it should be clear that we do not want a
solution to the multi-core issue that involves dropping these strong
guarantees. It is ok however to lower the barrier, as Jython does; but
any Python implementation must offer some guarantees, or not offer
multi-threading at all. This includes the fact that a lot of methods on
built-in types are supposed to be atomic.

(It should be noted that not offering multi-threading at all is actually
also a (partial) solution to the problem. Recently, several "hacks"
have appeared that give a programmer more-or-less transparent access to
multiple independent processes (e.g. multiprocessing). While these provide appropriate
solutions in some context, they are not as widely applicable as
multi-threading. As a typical example, they fail to apply when the
mutiple cores need to process information that cannot be serialized at
all --- a requirement for any data exchange between several processes.)

Here is an example of how Jython's consistency is weaker than CPython's GIL.
It takes uncommon examples to show it, and the fact that it does not work
like a CPython programmer expect them to is generally considered as an
implementation detail. Consider:

Each operation is atomic in the case of CPython, but decomposed in two steps
(which can each be considered atomic) in the case of Jython: reading from the
argument, and then updating the target set. Suppose that initially
set1 = {1}, set2 = {2}, set3 = {3}. On CPython, independently on
the order in which the threads run, we will end up with at least one of the
sets being {1, 2, 3}. On Jython, it is possible that all
three sets end up as containing two items only. The example is a bit
far-fetched but should show that CPython's consistency is strictly stronger
than Jython's.

PyPy

PyPy is a Python interpreter much like CPython or Jython, but the way it
is produced is particular. It is an interpreter written in RPython, a
subset of Python, which gets turned into a complete virtual machine (as
generated C code) automatically by a step called the "translation". In
this context, the trade-offs are different from the ones in CPython and
in Jython: it is possible in PyPy, and even easy, to apply arbitrary
whole-program transformations to the interpreter at "translation-time".

With this in mind, it is possible to imagine a whole-program
transformation that would add locking on every object manipulated in
RPython by the interpreter. This would end up in a situation similar to
Jython. However, it would not automatically solve the issue of
deadlocks, which is avoided in the case of Jython by careful manual
placement of the locks. (In fact, being deadlock-free is a global
program property that cannot be automatically ensured or verified; any
change to Jython can in theory break this property, and thus introduce
subtle deadlocks. The same applies to non-atomicity.)

In fact, we can easily check that if the interpreter accesses (for
both reading and writing)
objects A and B in a bytecode of thread 1, and objects B and A (in the
opposite order) in a bytecode of thread 2 --- and moreover if you need to
have accessed the first object before you can decide that you will need
to access the second object --- then there is no way (apart from the GIL) to avoid
a deadlock while keeping the strong guarantee of atomicity. Indeed, if
both threads have progressed to the middle of the execution of their
bytecode, then A has already been mutated by thread 1 and similarly B
has already been mutated by thread 2. It is not possible to
successfully continue running the threads in that case.

Using Software Transactional Memory

Software Transactional Memory (STM) is an approach that gives a solution
to precisely the above problem. If a thread ended up in a situation
where continuing to run it would be wrong, then we can abort and
rollback. This is similar to the notion of transaction on databases.
In the above example, one or both threads would notice that they are
about to run into troubles and abort. This means more concretely that
they need to have a way to restart execution at the start of the
bytecode, with all the side-effects of what they did so far being either
cancelled or just not committed yet.

We think that this capacity to abort and rollback is the missing piece
of the puzzle of multi-threaded implementations of Python.
Actually, according to the presentation of the problem given
above, it is unavoidable that any solution that wants to offer the
same level of consistency and atomicity as CPython would involve
the capacity of aborting and rolling back --- which means precisely
that STM cannot be avoided.

Ok, but why not settle down with Jython's
approach and put careful locks left and right throughout the interpreter?
Because (1) we would have to consider every operation's atomicity and make decisions
(or steal Jython's) and document them
here;
(2) it would also be really a lot of work, to optimize these locks e.g. with the
JIT as well as the JVM does; and (3) it is not the PyPy way to require manually
tweaking your code everywhere for a feature that should be orthogonal. Point
(3) is probably the most important here: you need to redo the work for every
language you implement in PyPy.
It also implies my own point (4): it is not fun :-)

In more details, the process would work as follows. (This gives an
overview of one possible model; it is possible that a different model
will end up being better.) In every thread:

At the start of a bytecode, we start a "transaction". This means
setting up a thread-local data structure to record a log of what
occurs in the transaction.

We record in the log all objects that are read, as well as the
modifications that we would like to make.

During this time, we detect "read" inconsistencies, shown by the
object's "last-modified" timestamp being later than the start time
of the current transaction, and abort. This prevents the rest of
the code from running with inconsistent values.

If we reach the end of the bytecode without a "read" inconsistency,
then we atomically check for "write" inconsistencies. These are
inconsistencies which arise from concurrent updates to objects
in the other threads --- either our "write" objects, or our "read"
objects.

If no inconsistency is found, we "commit" the transaction by copying
the delayed writes from the log into main memory.

The points at which a transaction starts or ends are exactly the
points at which, in CPython, the Global Interpreter Lock is
respectively acquired and released. If we ignore the fact that (purely for
performance) CPython acquires and releases the GIL only every N bytecodes,
then this means:

Before any bytecode we acquire the GIL (start a transaction), and after
the bytecode we release it (ends the transaction); and

Before doing an external call to the C library or the OS we release the GIL
(ends the transaction) and afterwards re-acquire it (start the next transaction).

So in particular this model is well suited to the STM condition that we cannot
do anything in a transaction that cannot be rolled back, like --- precisely ---
system calls. Indeed, by construction, these system calls occur outside a
transaction, because in CPython they occur with the GIL released.

Performance

A large number of implementation details are still open for now.
From a user's point of view (i.e. the programmer using Python),
the most relevant one is the overall performance impact. We
cannot give precise numbers so far, and we expect the initial
performance to be abysmally bad (maybe 10x slower); however, with
successive improvements to the locking mechanism, to the global
program transformation inserting the locks, to the garbage
collector (GC), and to the Just-in-Time (JIT) compiler, we
believe that it should be possible to get a roughly reasonable
performance (up to maybe 2x slower). For example, the GC can
maintain flags on the objects to know that they did not escape
their creation thread, and do not need any logging; and the JIT
compiler can aggregate several reads or writes to an object into
one. We believe that these are the kind of optimizations that
can give back a lot of the performance lost.

The state of STM

Transactional Memory is itself a relatively old idea, originating
from a 1986 paper by Tom Knight. At first based on hardware
support, the idea of software-only transactional memory (STM) was
popularized in 1995 and has recently been the focus of intense
research.

The approach outlined above --- using STM to form the core of the
implementation of a language --- is new, as far as we know. So
far, most implementations provide STM as a library feature. It
requires explicit usage, often in the form of explicitly
declaring which objects must be protected by STM (object-based
STMs). It is only recently that native STM support has started
to appear, notably in the Clojure language.

STM is described on Wikipedia as an approach that "greatly
simplifies conceptual understanding of multithreaded programs and
helps make programs more maintainable by working in harmony with
existing high-level abstractions such as objects and modules."
We actually think that these benefits are important enough to
warrant being exposed to the Python programmer as well, instead
of being used only internally. This would give the Python
programmer a very simple interface:

with atomic:
<these operations are executed atomically>

(This is an old idea. Funny how back in 2003 people, including me, thought that this was a hack. Now I'm writing a blog post to say "it was not a hack; it's explicitly using locks that is a hack." I'm buying the idea of composability.)

Conclusion

Taking these ideas and applying them in the context of an
implementation of a complex high-level language like Python comes
with its own challanges. In this context, using PyPy makes sense
as both an experimentation platform and as a platform that is
recently gaining attention for its performance. The alternatives
are unattractive: doing it in CPython for example would mean
globally rewriting the interpreter. In PyPy instead, we write it
as a transformation that is applied systematically at translation-time.
Also, PyPy is a general platform for generating fast interpreters
for dynamic languages; the STM implementation in PyPy would work
out of the box for other language implementations as well, instead
of just for Python.

Update:

This is mostly me (Armin Rigo) ranting aloud and trying experiments;
this post should not be confused as meaning that the whole PyPy team
will now spend the next years working on it full-time.
As I said it is orthogonal to the actual Python interpreter, and it is in
any case a feature that can be turned on or off during translation; I know
that in many or most use cases, people are more interested in getting a
fast PyPy rather than one which is twice as slow but scales well.

Nothing I said is really new. For proof, see
Riley and Zilles (2006)
as well as Tabba (2010) who both experimented with Hardware Transactional Memory, turning CPython or PyPy interpreter's GIL into start/end transactions, as I describe here.

Hi all. Here is (an extract of) a short summary paper about my current position on
Software Transactional Memory as a general tool in the implementation
of Python or Python-like languages. Thanks to people on IRC for discussion on making
this blog post better (lucian, Alex Gaynor, rguillebert, timonator, Da_Blitz).
For the purpose of the present discussion, we are comparing Java with Python
when it comes to multi-threading.

The problem in complex high-level languages

Like Java, the Python language gives guarantees: it is not acceptable
for the Python virtual machine to crash due to incorrect usage of
threads. A primitive operation in Java is something like reading or
writing a field of an object; the corresponding guarantees are along the
lines of: if the program reads a field of an object, and another thread
writes to the same field of the same object, then the program will see
either the old value, or the new value, but not something else entirely,
and the virtual machine will not crash.

Higher-level languages like Python differ from Java by the fact that a
"primitive operation" is far more complex. It may for example involve
looking in several hash maps, perhaps doing updates. In general, it is
completely impossible to map every operation that must be atomic to a
single processor instruction.

Jython: fine-grained locking

This problem has been solved "explicitly" in the Jython interpreter that
runs on top of Java. The solution is explicit in the following sense:
throughout the Jython interpreter, every single operation makes careful
use of Java-level locking mechanisms. This is an application of
"fine-grained locking". For example, operations like attribute lookup,
which need to perform look-ups in a number of hash maps, are protected
by acquiring and releasing locks (in __getattribute__).

A draw-back of this solution is the attention to detail required.
If even one place misses a lock, then there is either a
bug --- and such bugs occur in cases that are increasingly rare and hard
to debug as the previous bugs are fixed --- or we just file it under "differences
from CPython". There is however the risk of
deadlock, if two threads attempt to lock the same objects in different
order.

In practice, the situation is actually not as bad as
I may paint it: the number of locks in Jython is reasonable, and allows for
all the "common cases" to work as expected.
(For the uncommon cases, see below.)

Performance-wise, the Java virtual machine itself comes with locks that
have been heavily optimized over a long period of time, so the
performance is acceptable. However if this solution were coded in C, it
would need a lot of extra work to optimize the locks manually (possibly
introducing more of the subtle bugs).

CPython: coarse-grained locking

CPython, the standard implementation of Python in C, took a different
and simpler approach: it has a single global lock, called the Global
Interpreter Lock (GIL). It uses "coarse-grained locking": the lock is
acquired and released around the whole execution of one bytecode (or
actually a small number of bytecodes, like 100). This solution is
enough to ensure that no two operations can conflict with each other,
because the two bytecodes that invoke them are themselves
serialized by the GIL. It is a solution which avoids --- unlike Jython
--- writing careful lock-acquiring code all over the interpreter. It
also offers even stronger guarantees: every bytecode runs entirely
atomically.

Nowadays, the draw-back of the GIL approach is obvious on multi-core
machines: by serializing the execution of bytecodes, starting multiple
threads does not actually let the interpreter use of more than one core.

PyPy, the Python implementation in Python, takes the same approach so
far.

Existing usage

As we have seen, we have the following situation: the existing Python
language, as CPython implements it, offers very strong guarantees about
multi-threaded usage. It is important to emphasize that most existing
multi-threaded Python programs actually rely on such strong guarantees.
This can be seen for example in a problem that takes a populated list
and does in several threads:

next_item = global_list.pop()

This implicitly relies on the fact that pop() will perform atomic
removal from the list. If two threads try to pop() from the same list
at the same time, then the two operations will occur in one order or the
other; but they will not e.g. return the same object to both threads or
mess up the internal state of the list object.

With such an example in mind, it should be clear that we do not want a
solution to the multi-core issue that involves dropping these strong
guarantees. It is ok however to lower the barrier, as Jython does; but
any Python implementation must offer some guarantees, or not offer
multi-threading at all. This includes the fact that a lot of methods on
built-in types are supposed to be atomic.

(It should be noted that not offering multi-threading at all is actually
also a (partial) solution to the problem. Recently, several "hacks"
have appeared that give a programmer more-or-less transparent access to
multiple independent processes (e.g. multiprocessing). While these provide appropriate
solutions in some context, they are not as widely applicable as
multi-threading. As a typical example, they fail to apply when the
mutiple cores need to process information that cannot be serialized at
all --- a requirement for any data exchange between several processes.)

Here is an example of how Jython's consistency is weaker than CPython's GIL.
It takes uncommon examples to show it, and the fact that it does not work
like a CPython programmer expect them to is generally considered as an
implementation detail. Consider:

Each operation is atomic in the case of CPython, but decomposed in two steps
(which can each be considered atomic) in the case of Jython: reading from the
argument, and then updating the target set. Suppose that initially
set1 = {1}, set2 = {2}, set3 = {3}. On CPython, independently on
the order in which the threads run, we will end up with at least one of the
sets being {1, 2, 3}. On Jython, it is possible that all
three sets end up as containing two items only. The example is a bit
far-fetched but should show that CPython's consistency is strictly stronger
than Jython's.

PyPy

PyPy is a Python interpreter much like CPython or Jython, but the way it
is produced is particular. It is an interpreter written in RPython, a
subset of Python, which gets turned into a complete virtual machine (as
generated C code) automatically by a step called the "translation". In
this context, the trade-offs are different from the ones in CPython and
in Jython: it is possible in PyPy, and even easy, to apply arbitrary
whole-program transformations to the interpreter at "translation-time".

With this in mind, it is possible to imagine a whole-program
transformation that would add locking on every object manipulated in
RPython by the interpreter. This would end up in a situation similar to
Jython. However, it would not automatically solve the issue of
deadlocks, which is avoided in the case of Jython by careful manual
placement of the locks. (In fact, being deadlock-free is a global
program property that cannot be automatically ensured or verified; any
change to Jython can in theory break this property, and thus introduce
subtle deadlocks. The same applies to non-atomicity.)

In fact, we can easily check that if the interpreter accesses (for
both reading and writing)
objects A and B in a bytecode of thread 1, and objects B and A (in the
opposite order) in a bytecode of thread 2 --- and moreover if you need to
have accessed the first object before you can decide that you will need
to access the second object --- then there is no way (apart from the GIL) to avoid
a deadlock while keeping the strong guarantee of atomicity. Indeed, if
both threads have progressed to the middle of the execution of their
bytecode, then A has already been mutated by thread 1 and similarly B
has already been mutated by thread 2. It is not possible to
successfully continue running the threads in that case.

Using Software Transactional Memory

Software Transactional Memory (STM) is an approach that gives a solution
to precisely the above problem. If a thread ended up in a situation
where continuing to run it would be wrong, then we can abort and
rollback. This is similar to the notion of transaction on databases.
In the above example, one or both threads would notice that they are
about to run into troubles and abort. This means more concretely that
they need to have a way to restart execution at the start of the
bytecode, with all the side-effects of what they did so far being either
cancelled or just not committed yet.

We think that this capacity to abort and rollback is the missing piece
of the puzzle of multi-threaded implementations of Python.
Actually, according to the presentation of the problem given
above, it is unavoidable that any solution that wants to offer the
same level of consistency and atomicity as CPython would involve
the capacity of aborting and rolling back --- which means precisely
that STM cannot be avoided.

Ok, but why not settle down with Jython's
approach and put careful locks left and right throughout the interpreter?
Because (1) we would have to consider every operation's atomicity and make decisions
(or steal Jython's) and document them
here;
(2) it would also be really a lot of work, to optimize these locks e.g. with the
JIT as well as the JVM does; and (3) it is not the PyPy way to require manually
tweaking your code everywhere for a feature that should be orthogonal. Point
(3) is probably the most important here: you need to redo the work for every
language you implement in PyPy.
It also implies my own point (4): it is not fun :-)

In more details, the process would work as follows. (This gives an
overview of one possible model; it is possible that a different model
will end up being better.) In every thread:

At the start of a bytecode, we start a "transaction". This means
setting up a thread-local data structure to record a log of what
occurs in the transaction.

We record in the log all objects that are read, as well as the
modifications that we would like to make.

During this time, we detect "read" inconsistencies, shown by the
object's "last-modified" timestamp being later than the start time
of the current transaction, and abort. This prevents the rest of
the code from running with inconsistent values.

If we reach the end of the bytecode without a "read" inconsistency,
then we atomically check for "write" inconsistencies. These are
inconsistencies which arise from concurrent updates to objects
in the other threads --- either our "write" objects, or our "read"
objects.

If no inconsistency is found, we "commit" the transaction by copying
the delayed writes from the log into main memory.

The points at which a transaction starts or ends are exactly the
points at which, in CPython, the Global Interpreter Lock is
respectively acquired and released. If we ignore the fact that (purely for
performance) CPython acquires and releases the GIL only every N bytecodes,
then this means:

Before any bytecode we acquire the GIL (start a transaction), and after
the bytecode we release it (ends the transaction); and

Before doing an external call to the C library or the OS we release the GIL
(ends the transaction) and afterwards re-acquire it (start the next transaction).

So in particular this model is well suited to the STM condition that we cannot
do anything in a transaction that cannot be rolled back, like --- precisely ---
system calls. Indeed, by construction, these system calls occur outside a
transaction, because in CPython they occur with the GIL released.

Performance

A large number of implementation details are still open for now.
From a user's point of view (i.e. the programmer using Python),
the most relevant one is the overall performance impact. We
cannot give precise numbers so far, and we expect the initial
performance to be abysmally bad (maybe 10x slower); however, with
successive improvements to the locking mechanism, to the global
program transformation inserting the locks, to the garbage
collector (GC), and to the Just-in-Time (JIT) compiler, we
believe that it should be possible to get a roughly reasonable
performance (up to maybe 2x slower). For example, the GC can
maintain flags on the objects to know that they did not escape
their creation thread, and do not need any logging; and the JIT
compiler can aggregate several reads or writes to an object into
one. We believe that these are the kind of optimizations that
can give back a lot of the performance lost.

The state of STM

Transactional Memory is itself a relatively old idea, originating
from a 1986 paper by Tom Knight. At first based on hardware
support, the idea of software-only transactional memory (STM) was
popularized in 1995 and has recently been the focus of intense
research.

The approach outlined above --- using STM to form the core of the
implementation of a language --- is new, as far as we know. So
far, most implementations provide STM as a library feature. It
requires explicit usage, often in the form of explicitly
declaring which objects must be protected by STM (object-based
STMs). It is only recently that native STM support has started
to appear, notably in the Clojure language.

STM is described on Wikipedia as an approach that "greatly
simplifies conceptual understanding of multithreaded programs and
helps make programs more maintainable by working in harmony with
existing high-level abstractions such as objects and modules."
We actually think that these benefits are important enough to
warrant being exposed to the Python programmer as well, instead
of being used only internally. This would give the Python
programmer a very simple interface:

with atomic:
<these operations are executed atomically>

(This is an old idea. Funny how back in 2003 people, including me, thought that this was a hack. Now I'm writing a blog post to say "it was not a hack; it's explicitly using locks that is a hack." I'm buying the idea of composability.)

Conclusion

Taking these ideas and applying them in the context of an
implementation of a complex high-level language like Python comes
with its own challanges. In this context, using PyPy makes sense
as both an experimentation platform and as a platform that is
recently gaining attention for its performance. The alternatives
are unattractive: doing it in CPython for example would mean
globally rewriting the interpreter. In PyPy instead, we write it
as a transformation that is applied systematically at translation-time.
Also, PyPy is a general platform for generating fast interpreters
for dynamic languages; the STM implementation in PyPy would work
out of the box for other language implementations as well, instead
of just for Python.

Update:

This is mostly me (Armin Rigo) ranting aloud and trying experiments;
this post should not be confused as meaning that the whole PyPy team
will now spend the next years working on it full-time.
As I said it is orthogonal to the actual Python interpreter, and it is in
any case a feature that can be turned on or off during translation; I know
that in many or most use cases, people are more interested in getting a
fast PyPy rather than one which is twice as slow but scales well.

Nothing I said is really new. For proof, see
Riley and Zilles (2006)
as well as Tabba (2010) who both experimented with Hardware Transactional Memory, turning CPython or PyPy interpreter's GIL into start/end transactions, as I describe here.

Implementing STM at a core level is certainly a nice research topic, but I wonder whether it's the best way forward for Python.

STM works well in Haskell because it has the type system to enforce several constraints. Also most data is immutable in Haskell, so threading is mostly safe by default.

Most Python objects are mutable (by default), so users have to be very careful when using multi-threading. STM gives you a nice, composable primitive to protect your critical sections, but it does not tell where your critical sections are.

You dismiss multiprocessing because of serialization issues, but what about multiprocessing within the same process? You have a VM already, so my guess would be that it wouldn't be that hard to implement software processes (a la Erlang). Sure, using message passing may lead to a fair amount of copying, but I it seems to be much easier to implement and easier to use than shared-memory concurrency + STM.

@Thomas Schilling: I don't see how having a "multiprocessing" that uses the same process, rather than different processes, makes a difference. In both cases you need to write your threading code specially and care about explicitly transferring objects via shared memory --- either to another OS thread in the same process, or to a different process altogether.

I'm with illume... look at what Apple has done with blocks. This seems like a very efficient way forward.

Separately, you are missing something about the Java-side.

For many of the data structures in Java there are atomic and non-atomic versions. That is, when you are using a data structure on a single thread, you grab the non-atomic version. This way, you don't pay for the overhead of the locking. But, when you are sharing a data structure between threads, you use the atomic version. As a by-product of history, though it is a nice by-product, you usually get the atomic version by default. That is to say, you have to go looking for trouble by explicitly asking for the non-atomic version.

By baking this into the language, you are forcing a single policy on all programs, rather than letting the programmer choose what policy is going to be best in that scenario. Either that, or they will be forced to put code guards all over the place.

To me, it seems like the language/runtime should provide the most basic of atomic operations, and the run-time library on top should provide the policy. That's the Java approach, in a nutshell. It gives the programmer flexibility and keeps the core runtime simple and easier to optimize.

Granted, you want a high-level language where the programmer doesn't make a lot of these decisions. So... looking at your own arguments... you are expecting an initial 10x performance hit relative to the current GIL-python approach, with hopes of getting it down to 2x performance... If that's the case, why not just stick with the GIL and have Python programmers take advantage of multiprocessing by creating co-operative programs using a message passing API. In some ways, it's a little more TAUP to do it that way, isn't it?

@Anonymous: this case can be handled on a case-by-case basis (e.g. special-casing "prints" to buffer), but it also has a general solution: we turn the transaction into an "inevitable" transaction, i.e. one which cannot fail.

I already have support for this in my demo code, because it is needed to handle the cases where the nesting of the C program is such that setjmp/longjmp can no longer work. The typical example is the RETURN_VALUE bytecode. It starts a transaction, returns to the caller by popping off some C frames, then ends the transaction in the caller. When we return from the C frame of the callee, in the middle of the transaction, we notice that we won't have the setjmp around any longer, so we are not allowed to abort and rollback any more.

Inevitable transactions have the property of being "a bit like" a GIL in the sense that you can only have one in total, and other transactions cannot commit before it does. In case of the RETURN_VALUE, it's a very short transaction so it shouldn't really be a problem. For the case of a user-specified "with atomic:" block, it can make all the other threads pause. Not ideal, but at least better than nothing...

@Sam Wilson: as you know, the PyPy approach is to sacrifice nothing to performance for the user, and get reasonably good (if not exactly Java-level) performance anyway :-)

I should also mention generally that for some programs that I have in mind, using a message-passing API would be a complete rewrite (if it is really possible at all), whereas "just" making them multithreaded can be done. The "translate.py" of PyPy falls into this category. It is a program that heavily use objects within objects within objects in a big non-nicely-separable "mess", and I would not dare to think about how to send parts of this object graph over a messaging API and get back localized updates.

Of course there are also other use cases where you can naturally get a model that plays nicely with message passing.

@nekto0n: nothing particular. The two threads will run the calls in parallel, just like CPython, which calls the send() function without any GIL acquired. What exactly occurs depends on the OS and not on the language.

Funny to see how Python eats itself like an Ouroboros. Wrong design decisions that made concurrency almost impossible, dirty hacks ("dirty" compared to, for example, Erlang's approach to SMP — almost linear scalability with a number of cores with 10-20% static overhead thanks to locks) that PyPy team are trying to do to solve problems introduced by Guido's ignorance, and a lot of Python "programmers" that don't understand what SMP is. Python is a ghetto, for real.

Seems like it should be possible to guarantee performance not much worse than with a GIL.

Am I right in thinking there is a locked section where changes are written to memory? The execution before this is effectively just some speculative computation to to speed up the locked section. If it turns out there's an inconsistency, just execute the locked section as you would normally. If the speculative computation is failing most of the time or is slow, switch to not doing it -- and we are back to GIL performance.

@Armin: Each in-memory process would use its own part of the heap so there would be no locking necessary except during message sending. You also don't need to have a 1-to-1 mapping of OS threads to processes. You could schedule N processes onto M OS threads (preferably chosen to match the number of CPU cores).

Of course, if you don't want a message-passing model (as you mentioned in another comment) then fine.

My argument is just that: STM is difficult to implement, difficult to make fast, and it still isn't that easy to use. A message passing model is much easier to implement and easier to use for end users. (You can still get deadlocks, but you could provide libraries for standard communication patterns which you only have to get right once, like Erlang's OTP.)

I think that there is some confusion here about what the underlying problem that you are trying to solve is.

The underlying (fundamental) problem that transactional memory as a method to replace GIL in Python is trying to solve is: automatic parallelization. That *is* hard.

Mediocre implementations of transactional memory are trivial to implement. Almost anybody can do it. Of course, the performance will be horrible.

If we stick to the idea about the underlying problem (automatic parallelization) and keep it in our minds while thinking, it is clear and utterly obvious that *any* implementation of transactional memory which is slower than serial execution is simply missing the target. The target is, obviously, to run the program faster than serial execution. Otherwise, it would be totally pointless.

Based on this reasoning, it is an *obvious* conclusion that a transactional memory implementation simply cannot be allowed to result in lower performance than serial execution of the code. Allowing lower performance would be completely irrational.

We are humans, not animals. Rationality is our distinctive feature. We have to try to follow rationality.

In light of this, saying that "It is OK for transactional memory to result in 2x slowdown" is irrational. I will write it one more time: accepting 2x slowdown is irrational.

Now, it is crucial to note that there are various kinds of performance measurements. And it is OK to slow down one performance indicator while boosting another performance indicator. For example, in web server environment, it is OK to slow down the delivery of individual web pages by a factor 1.3 - while boosting the number of requests per second by 2.3. That is *rational* and perfectly OK. Also, 3x developer productivity boost would be OK.

Following this example, if transactional memory is allowed to slow down performance of the program (compared to serial execution) by 2x, a person who follows rationally would immediately be drawn to seek for the evidence of a greater-than-2x performance boost in another area of the program.

If PyPy's transactional memory implementation would serialize calls to the Linux kernel (because it is hard to emulate them in user-space), then this alone would prevent some programs to achieve the more-than-2x performance boost. This is because it is impossible to boost program performance (in some areas, given a particular performance indicator) unless the modified program is allowed to call kernel functions out-of-order or in parallel.

-----

Note: I am *not* saying that PyPy should give up. I am just trying to note that you do not seem to know what you are doing. But I may be wrong.

Of course the sentence "It is OK for transactional memory to result in 2x slowdown" was meant "on one thread". As soon as your program uses more than 2 threads, on a more-than-2-CPUs machine, then you win.

- benchmark "iterate": This function is not making calls to other functions. The authors are independently running 16 instances of the "iterate" function on a 16-core CPU using 16 threads. The speedup in respect to unmodified CPython is 7x. The slowdown in respect to 16 CPython processes is 2.2x.

- benchmark "count": This is similar to "iterate". The speedup in respect to unmodified CPython is 4.5x. The slowdown in respect to 16 CPython processes is 3.5x.

- benchmark "pystone": This function is making calls to other functions. 16 instances of the "pystone" function on a 16-core CPU using 16 threads. The speedup in respect to unmodified CPython is 0.9x. The slowdown in respect to 16 CPython processes is 17x.

My analysis:

- iterate: The fact that N instances of this function can run in parallel without any interference can be determined easily. The algorithm to determine this is trivial. (Not to mention, the pointless loop in the function can be replaced by a NOP in a dead-code elimination pass).

- count: same as "iterate".

- pystone: It is not trivial to determine whether multiple instances can run in parallel. So, it should presumably run single-threaded.

- The article is *not* mentioning any real problem that was solved by TM in the case of "iterate", "count" or "pystone". That is logical, since the truth is that there is no real problem to solve here. The benchmark functions can be trivially run in 16 CPython Linux processes - anybody can do that (even your grandma).

My summary:

- In case of the two functions for which it *can* be trivially determined whether their instances can run in parallel, the TM approach results in a 2x-3x slowdown compared to the most basic auto-parallelization algorithm.

- In case of the function for which it *cannot* be trivially determined whether multiple instances can run in parallel, the TM approach running on 4-16 threads achieved 90% (loss of 10%) of the speed of single-threaded CPython without TM. On 1 thread, the TM approach is 2.1x slower.

Bottom line:

Call me crazy, but my conclusion from this article is that TM (at least the TM approach from the article) is not working at all.

Cool to see this happening. What's also cool is the result reported in Rossbach et al's study (http://www.neverworkintheory.org/?p=122): novices using STM did better in simple programming problems than students using traditional mechanisms, even though they thought they had done worse. "Baroque syntax" may be part of the problem; I'm sure the paper's authors would be happy to chat.

⚛, you're missing a very important bit of the paper. In it, the authors say, that the Rock hardware only holds 256 bytes of write-buffer content, while Riley and Zilles¹ determined the average write-buffer size needed for transactions to not fail prematurely would be "less than 640 bytes", which is almost three times as much as Rock offers.

Thus, the big slowdown that the pystone benchmark experiences could be caused by the shortcomings of the TM built into Rock.

I do have to agree, though, that the "benchmarks" used in the paper are not very satisfactory. However, the magical "simple parallelization algorithm" you summon in your comment would break down quite easily shortly after the complexity of the situation increases by just a bit, would it not?

¹ I only briefly glanced over the paper, so if anyone read it more thoroughly, they can feel free to correct me.

I thought Erlang successfully solved this problem years ago? And I don't think anything scales better than it. So why aren't we just copying them? Message passing, where each thread or process share absolutely nothing, is the sanest and safest way to do concurrent and multi-threaded programming. I mean, you don't even have to worry about locking! STM always seemed complicated to me.

@squeaky_pl: thanks for the link. In some way researching this is ultimately doomed: either transactional memory doesn't work, or it does and in 5 or 10 years all CPUs will have good hardware support and will be able to run existing software like CPython with minor changes. :-)

The high-level semantics that the Python VM provides through the GIL are perfect for most programs, and for most programmer's knowledge about concurrency.

What is the purpose of going after the GIL?

If it's just a performance boost on multiple cores, then an GIOL (global IO lock) implemented on the VM, as the GIL is, should be considered. The VM could run several OS threads blocking them on IO and releasing GIL.

If the purpose is to make concurrent programming easy and correct, it can be proven that it is not possible.

Yet, there are alternatives that don't alter the language or the semantics that can be explored.

Erlang-style message passing can be provided through object proxies implemented on top or beneath the VM, so the threads/processes can even run on different computers.