Tuesday, August 30, 2011

Well over a year ago, work was started on the cppyy module which lives in the
reflex-support branch.
Since then, work has progressed at a varying pace and has included a recent
sprint in Düsseldorf, last July.

Let's first take a step back and recap why we're interested in doing this,
given that it is perfectly possible to use C++ through generated bindings and
cpyext.
cppyy makes use of reflection information generated for the C++ classes of
interest, and has that reflection information available at run time.
Therefore, it is able to open up complex C++ types to the JIT in a
conceptually similar manner as simple types are open to it.
This means that it is possible to get rid of a lot of the marshalling layers
when making cross-language calls, resulting in much lower call overhead than
is possible when going through the CPython API, or other methods of wrapping.

There are two problems that need to be solved: C++ language constructs need to
be presented on the Python side in a natural way; and cross-language impedance
mismatches need to be minimized, with some hints of the user if need be.
For the former, the list of mapped features has grown to a set that is
sufficient to do real work.
There is now support for:

builtin, pointer, and array types

namespaces, classes, and inner classes

global functions, global data

static/instance data members and methods

default variables, object return by value

single and multiple (virtual) inheritance

templated classes

basic STL support and pythonizations

basic (non-global) operator mapping

The second problem is harder and will always be an on-going process.
But one of the more important issues has been solved at the recent Düsseldorf
sprint, namely, that of reclaiming C++ objects instantiated from the Python
side by the garbage collector.

Performance has also improved, especially that of the nicer "pythonized"
interface that the user actually sees, although it still misses out on
about a factor of 2.5 in comparison to the lower-level interface (which has
gotten uglier, so you really don't want to use that).
Most of this improvement is due to restructuring so that it plays nicer with
the JIT and libffi, both of which themselves have seen improvements.

Work is currently concentrated on the back-ends: a CINT back-end is underway
and a LLVM/CLang pre-compiled headers (PCH) back-end is planned.
The latter is needed for this code to be released in the wild, rather than
just used in high energy physics (HEP), as that would be easier to support.
Also, within HEP, CLang's PCH are foreseen to be the future format of
reflection information.

At the end of the Düsseldorf sprint, we tried a little code that did something
actually "useful," namely the filling of a histogram with some random values.
We did get it to work, but trying cppyy on a large class library showed
that a good warning system for such things like missing classes was sorely
needed.
That has been added since, and revisiting the histogram example later, here is
an interesting note: the pypy-c run takes 1.5x the amount of time of that
of the compiled, optimized, C++ code.
The run was timed start to finish, including the reflection library loading
and JIT warm-up that is needed in the case of Python, but not for the compiled
C++ code.
However, in HEP, scientists run many short jobs while developing their
analysis codes, before submitting larger jobs on the GRID to run during lunch
time or overnight.
Thus, a more realistic comparison is to include the compilation time needed
for the C++ code and with that, the Python code needs only 55% of the time
required by C++.

The choice of a programming language is often a personal one, and such
arguments like the idea that C++ is hard to use typically do not carry much
weight with the in-crowd that studies quantum field dynamics for fun.
However, getting the prompt with your analysis results back faster is a sure
winner. We hope that cppyy will soon have progressed far enough to make it
useful first to particle physicists and then other uses for wrapping C++
libraries.

Wim Lavrijsen, Carl Friedrich Bolz, Armin Rigo

Well over a year ago, work was started on the cppyy module which lives in the
reflex-support branch.
Since then, work has progressed at a varying pace and has included a recent
sprint in Düsseldorf, last July.

Let's first take a step back and recap why we're interested in doing this,
given that it is perfectly possible to use C++ through generated bindings and
cpyext.
cppyy makes use of reflection information generated for the C++ classes of
interest, and has that reflection information available at run time.
Therefore, it is able to open up complex C++ types to the JIT in a
conceptually similar manner as simple types are open to it.
This means that it is possible to get rid of a lot of the marshalling layers
when making cross-language calls, resulting in much lower call overhead than
is possible when going through the CPython API, or other methods of wrapping.

There are two problems that need to be solved: C++ language constructs need to
be presented on the Python side in a natural way; and cross-language impedance
mismatches need to be minimized, with some hints of the user if need be.
For the former, the list of mapped features has grown to a set that is
sufficient to do real work.
There is now support for:

builtin, pointer, and array types

namespaces, classes, and inner classes

global functions, global data

static/instance data members and methods

default variables, object return by value

single and multiple (virtual) inheritance

templated classes

basic STL support and pythonizations

basic (non-global) operator mapping

The second problem is harder and will always be an on-going process.
But one of the more important issues has been solved at the recent Düsseldorf
sprint, namely, that of reclaiming C++ objects instantiated from the Python
side by the garbage collector.

Performance has also improved, especially that of the nicer "pythonized"
interface that the user actually sees, although it still misses out on
about a factor of 2.5 in comparison to the lower-level interface (which has
gotten uglier, so you really don't want to use that).
Most of this improvement is due to restructuring so that it plays nicer with
the JIT and libffi, both of which themselves have seen improvements.

Work is currently concentrated on the back-ends: a CINT back-end is underway
and a LLVM/CLang pre-compiled headers (PCH) back-end is planned.
The latter is needed for this code to be released in the wild, rather than
just used in high energy physics (HEP), as that would be easier to support.
Also, within HEP, CLang's PCH are foreseen to be the future format of
reflection information.

At the end of the Düsseldorf sprint, we tried a little code that did something
actually "useful," namely the filling of a histogram with some random values.
We did get it to work, but trying cppyy on a large class library showed
that a good warning system for such things like missing classes was sorely
needed.
That has been added since, and revisiting the histogram example later, here is
an interesting note: the pypy-c run takes 1.5x the amount of time of that
of the compiled, optimized, C++ code.
The run was timed start to finish, including the reflection library loading
and JIT warm-up that is needed in the case of Python, but not for the compiled
C++ code.
However, in HEP, scientists run many short jobs while developing their
analysis codes, before submitting larger jobs on the GRID to run during lunch
time or overnight.
Thus, a more realistic comparison is to include the compilation time needed
for the C++ code and with that, the Python code needs only 55% of the time
required by C++.

The choice of a programming language is often a personal one, and such
arguments like the idea that C++ is hard to use typically do not carry much
weight with the in-crowd that studies quantum field dynamics for fun.
However, getting the prompt with your analysis results back faster is a sure
winner. We hope that cppyy will soon have progressed far enough to make it
useful first to particle physicists and then other uses for wrapping C++
libraries.

Tuesday, August 23, 2011

Hi all. Here is (an extract of) a short summary paper about my current position on
Software Transactional Memory as a general tool in the implementation
of Python or Python-like languages. Thanks to people on IRC for discussion on making
this blog post better (lucian, Alex Gaynor, rguillebert, timonator, Da_Blitz).
For the purpose of the present discussion, we are comparing Java with Python
when it comes to multi-threading.

The problem in complex high-level languages

Like Java, the Python language gives guarantees: it is not acceptable
for the Python virtual machine to crash due to incorrect usage of
threads. A primitive operation in Java is something like reading or
writing a field of an object; the corresponding guarantees are along the
lines of: if the program reads a field of an object, and another thread
writes to the same field of the same object, then the program will see
either the old value, or the new value, but not something else entirely,
and the virtual machine will not crash.

Higher-level languages like Python differ from Java by the fact that a
"primitive operation" is far more complex. It may for example involve
looking in several hash maps, perhaps doing updates. In general, it is
completely impossible to map every operation that must be atomic to a
single processor instruction.

Jython: fine-grained locking

This problem has been solved "explicitly" in the Jython interpreter that
runs on top of Java. The solution is explicit in the following sense:
throughout the Jython interpreter, every single operation makes careful
use of Java-level locking mechanisms. This is an application of
"fine-grained locking". For example, operations like attribute lookup,
which need to perform look-ups in a number of hash maps, are protected
by acquiring and releasing locks (in __getattribute__).

A draw-back of this solution is the attention to detail required.
If even one place misses a lock, then there is either a
bug --- and such bugs occur in cases that are increasingly rare and hard
to debug as the previous bugs are fixed --- or we just file it under "differences
from CPython". There is however the risk of
deadlock, if two threads attempt to lock the same objects in different
order.

In practice, the situation is actually not as bad as
I may paint it: the number of locks in Jython is reasonable, and allows for
all the "common cases" to work as expected.
(For the uncommon cases, see below.)

Performance-wise, the Java virtual machine itself comes with locks that
have been heavily optimized over a long period of time, so the
performance is acceptable. However if this solution were coded in C, it
would need a lot of extra work to optimize the locks manually (possibly
introducing more of the subtle bugs).

CPython: coarse-grained locking

CPython, the standard implementation of Python in C, took a different
and simpler approach: it has a single global lock, called the Global
Interpreter Lock (GIL). It uses "coarse-grained locking": the lock is
acquired and released around the whole execution of one bytecode (or
actually a small number of bytecodes, like 100). This solution is
enough to ensure that no two operations can conflict with each other,
because the two bytecodes that invoke them are themselves
serialized by the GIL. It is a solution which avoids --- unlike Jython
--- writing careful lock-acquiring code all over the interpreter. It
also offers even stronger guarantees: every bytecode runs entirely
atomically.

Nowadays, the draw-back of the GIL approach is obvious on multi-core
machines: by serializing the execution of bytecodes, starting multiple
threads does not actually let the interpreter use of more than one core.

PyPy, the Python implementation in Python, takes the same approach so
far.

Existing usage

As we have seen, we have the following situation: the existing Python
language, as CPython implements it, offers very strong guarantees about
multi-threaded usage. It is important to emphasize that most existing
multi-threaded Python programs actually rely on such strong guarantees.
This can be seen for example in a problem that takes a populated list
and does in several threads:

next_item = global_list.pop()

This implicitly relies on the fact that pop() will perform atomic
removal from the list. If two threads try to pop() from the same list
at the same time, then the two operations will occur in one order or the
other; but they will not e.g. return the same object to both threads or
mess up the internal state of the list object.

With such an example in mind, it should be clear that we do not want a
solution to the multi-core issue that involves dropping these strong
guarantees. It is ok however to lower the barrier, as Jython does; but
any Python implementation must offer some guarantees, or not offer
multi-threading at all. This includes the fact that a lot of methods on
built-in types are supposed to be atomic.

(It should be noted that not offering multi-threading at all is actually
also a (partial) solution to the problem. Recently, several "hacks"
have appeared that give a programmer more-or-less transparent access to
multiple independent processes (e.g. multiprocessing). While these provide appropriate
solutions in some context, they are not as widely applicable as
multi-threading. As a typical example, they fail to apply when the
mutiple cores need to process information that cannot be serialized at
all --- a requirement for any data exchange between several processes.)

Here is an example of how Jython's consistency is weaker than CPython's GIL.
It takes uncommon examples to show it, and the fact that it does not work
like a CPython programmer expect them to is generally considered as an
implementation detail. Consider:

Each operation is atomic in the case of CPython, but decomposed in two steps
(which can each be considered atomic) in the case of Jython: reading from the
argument, and then updating the target set. Suppose that initially
set1 = {1}, set2 = {2}, set3 = {3}. On CPython, independently on
the order in which the threads run, we will end up with at least one of the
sets being {1, 2, 3}. On Jython, it is possible that all
three sets end up as containing two items only. The example is a bit
far-fetched but should show that CPython's consistency is strictly stronger
than Jython's.

PyPy

PyPy is a Python interpreter much like CPython or Jython, but the way it
is produced is particular. It is an interpreter written in RPython, a
subset of Python, which gets turned into a complete virtual machine (as
generated C code) automatically by a step called the "translation". In
this context, the trade-offs are different from the ones in CPython and
in Jython: it is possible in PyPy, and even easy, to apply arbitrary
whole-program transformations to the interpreter at "translation-time".

With this in mind, it is possible to imagine a whole-program
transformation that would add locking on every object manipulated in
RPython by the interpreter. This would end up in a situation similar to
Jython. However, it would not automatically solve the issue of
deadlocks, which is avoided in the case of Jython by careful manual
placement of the locks. (In fact, being deadlock-free is a global
program property that cannot be automatically ensured or verified; any
change to Jython can in theory break this property, and thus introduce
subtle deadlocks. The same applies to non-atomicity.)

In fact, we can easily check that if the interpreter accesses (for
both reading and writing)
objects A and B in a bytecode of thread 1, and objects B and A (in the
opposite order) in a bytecode of thread 2 --- and moreover if you need to
have accessed the first object before you can decide that you will need
to access the second object --- then there is no way (apart from the GIL) to avoid
a deadlock while keeping the strong guarantee of atomicity. Indeed, if
both threads have progressed to the middle of the execution of their
bytecode, then A has already been mutated by thread 1 and similarly B
has already been mutated by thread 2. It is not possible to
successfully continue running the threads in that case.

Using Software Transactional Memory

Software Transactional Memory (STM) is an approach that gives a solution
to precisely the above problem. If a thread ended up in a situation
where continuing to run it would be wrong, then we can abort and
rollback. This is similar to the notion of transaction on databases.
In the above example, one or both threads would notice that they are
about to run into troubles and abort. This means more concretely that
they need to have a way to restart execution at the start of the
bytecode, with all the side-effects of what they did so far being either
cancelled or just not committed yet.

We think that this capacity to abort and rollback is the missing piece
of the puzzle of multi-threaded implementations of Python.
Actually, according to the presentation of the problem given
above, it is unavoidable that any solution that wants to offer the
same level of consistency and atomicity as CPython would involve
the capacity of aborting and rolling back --- which means precisely
that STM cannot be avoided.

Ok, but why not settle down with Jython's
approach and put careful locks left and right throughout the interpreter?
Because (1) we would have to consider every operation's atomicity and make decisions
(or steal Jython's) and document them
here;
(2) it would also be really a lot of work, to optimize these locks e.g. with the
JIT as well as the JVM does; and (3) it is not the PyPy way to require manually
tweaking your code everywhere for a feature that should be orthogonal. Point
(3) is probably the most important here: you need to redo the work for every
language you implement in PyPy.
It also implies my own point (4): it is not fun :-)

In more details, the process would work as follows. (This gives an
overview of one possible model; it is possible that a different model
will end up being better.) In every thread:

At the start of a bytecode, we start a "transaction". This means
setting up a thread-local data structure to record a log of what
occurs in the transaction.

We record in the log all objects that are read, as well as the
modifications that we would like to make.

During this time, we detect "read" inconsistencies, shown by the
object's "last-modified" timestamp being later than the start time
of the current transaction, and abort. This prevents the rest of
the code from running with inconsistent values.

If we reach the end of the bytecode without a "read" inconsistency,
then we atomically check for "write" inconsistencies. These are
inconsistencies which arise from concurrent updates to objects
in the other threads --- either our "write" objects, or our "read"
objects.

If no inconsistency is found, we "commit" the transaction by copying
the delayed writes from the log into main memory.

The points at which a transaction starts or ends are exactly the
points at which, in CPython, the Global Interpreter Lock is
respectively acquired and released. If we ignore the fact that (purely for
performance) CPython acquires and releases the GIL only every N bytecodes,
then this means:

Before any bytecode we acquire the GIL (start a transaction), and after
the bytecode we release it (ends the transaction); and

Before doing an external call to the C library or the OS we release the GIL
(ends the transaction) and afterwards re-acquire it (start the next transaction).

So in particular this model is well suited to the STM condition that we cannot
do anything in a transaction that cannot be rolled back, like --- precisely ---
system calls. Indeed, by construction, these system calls occur outside a
transaction, because in CPython they occur with the GIL released.

Performance

A large number of implementation details are still open for now.
From a user's point of view (i.e. the programmer using Python),
the most relevant one is the overall performance impact. We
cannot give precise numbers so far, and we expect the initial
performance to be abysmally bad (maybe 10x slower); however, with
successive improvements to the locking mechanism, to the global
program transformation inserting the locks, to the garbage
collector (GC), and to the Just-in-Time (JIT) compiler, we
believe that it should be possible to get a roughly reasonable
performance (up to maybe 2x slower). For example, the GC can
maintain flags on the objects to know that they did not escape
their creation thread, and do not need any logging; and the JIT
compiler can aggregate several reads or writes to an object into
one. We believe that these are the kind of optimizations that
can give back a lot of the performance lost.

The state of STM

Transactional Memory is itself a relatively old idea, originating
from a 1986 paper by Tom Knight. At first based on hardware
support, the idea of software-only transactional memory (STM) was
popularized in 1995 and has recently been the focus of intense
research.

The approach outlined above --- using STM to form the core of the
implementation of a language --- is new, as far as we know. So
far, most implementations provide STM as a library feature. It
requires explicit usage, often in the form of explicitly
declaring which objects must be protected by STM (object-based
STMs). It is only recently that native STM support has started
to appear, notably in the Clojure language.

STM is described on Wikipedia as an approach that "greatly
simplifies conceptual understanding of multithreaded programs and
helps make programs more maintainable by working in harmony with
existing high-level abstractions such as objects and modules."
We actually think that these benefits are important enough to
warrant being exposed to the Python programmer as well, instead
of being used only internally. This would give the Python
programmer a very simple interface:

with atomic:
<these operations are executed atomically>

(This is an old idea. Funny how back in 2003 people, including me, thought that this was a hack. Now I'm writing a blog post to say "it was not a hack; it's explicitly using locks that is a hack." I'm buying the idea of composability.)

Conclusion

Taking these ideas and applying them in the context of an
implementation of a complex high-level language like Python comes
with its own challanges. In this context, using PyPy makes sense
as both an experimentation platform and as a platform that is
recently gaining attention for its performance. The alternatives
are unattractive: doing it in CPython for example would mean
globally rewriting the interpreter. In PyPy instead, we write it
as a transformation that is applied systematically at translation-time.
Also, PyPy is a general platform for generating fast interpreters
for dynamic languages; the STM implementation in PyPy would work
out of the box for other language implementations as well, instead
of just for Python.

Update:

This is mostly me (Armin Rigo) ranting aloud and trying experiments;
this post should not be confused as meaning that the whole PyPy team
will now spend the next years working on it full-time.
As I said it is orthogonal to the actual Python interpreter, and it is in
any case a feature that can be turned on or off during translation; I know
that in many or most use cases, people are more interested in getting a
fast PyPy rather than one which is twice as slow but scales well.

Nothing I said is really new. For proof, see
Riley and Zilles (2006)
as well as Tabba (2010) who both experimented with Hardware Transactional Memory, turning CPython or PyPy interpreter's GIL into start/end transactions, as I describe here.

Hi all. Here is (an extract of) a short summary paper about my current position on
Software Transactional Memory as a general tool in the implementation
of Python or Python-like languages. Thanks to people on IRC for discussion on making
this blog post better (lucian, Alex Gaynor, rguillebert, timonator, Da_Blitz).
For the purpose of the present discussion, we are comparing Java with Python
when it comes to multi-threading.

The problem in complex high-level languages

Like Java, the Python language gives guarantees: it is not acceptable
for the Python virtual machine to crash due to incorrect usage of
threads. A primitive operation in Java is something like reading or
writing a field of an object; the corresponding guarantees are along the
lines of: if the program reads a field of an object, and another thread
writes to the same field of the same object, then the program will see
either the old value, or the new value, but not something else entirely,
and the virtual machine will not crash.

Higher-level languages like Python differ from Java by the fact that a
"primitive operation" is far more complex. It may for example involve
looking in several hash maps, perhaps doing updates. In general, it is
completely impossible to map every operation that must be atomic to a
single processor instruction.

Jython: fine-grained locking

This problem has been solved "explicitly" in the Jython interpreter that
runs on top of Java. The solution is explicit in the following sense:
throughout the Jython interpreter, every single operation makes careful
use of Java-level locking mechanisms. This is an application of
"fine-grained locking". For example, operations like attribute lookup,
which need to perform look-ups in a number of hash maps, are protected
by acquiring and releasing locks (in __getattribute__).

A draw-back of this solution is the attention to detail required.
If even one place misses a lock, then there is either a
bug --- and such bugs occur in cases that are increasingly rare and hard
to debug as the previous bugs are fixed --- or we just file it under "differences
from CPython". There is however the risk of
deadlock, if two threads attempt to lock the same objects in different
order.

In practice, the situation is actually not as bad as
I may paint it: the number of locks in Jython is reasonable, and allows for
all the "common cases" to work as expected.
(For the uncommon cases, see below.)

Performance-wise, the Java virtual machine itself comes with locks that
have been heavily optimized over a long period of time, so the
performance is acceptable. However if this solution were coded in C, it
would need a lot of extra work to optimize the locks manually (possibly
introducing more of the subtle bugs).

CPython: coarse-grained locking

CPython, the standard implementation of Python in C, took a different
and simpler approach: it has a single global lock, called the Global
Interpreter Lock (GIL). It uses "coarse-grained locking": the lock is
acquired and released around the whole execution of one bytecode (or
actually a small number of bytecodes, like 100). This solution is
enough to ensure that no two operations can conflict with each other,
because the two bytecodes that invoke them are themselves
serialized by the GIL. It is a solution which avoids --- unlike Jython
--- writing careful lock-acquiring code all over the interpreter. It
also offers even stronger guarantees: every bytecode runs entirely
atomically.

Nowadays, the draw-back of the GIL approach is obvious on multi-core
machines: by serializing the execution of bytecodes, starting multiple
threads does not actually let the interpreter use of more than one core.

PyPy, the Python implementation in Python, takes the same approach so
far.

Existing usage

As we have seen, we have the following situation: the existing Python
language, as CPython implements it, offers very strong guarantees about
multi-threaded usage. It is important to emphasize that most existing
multi-threaded Python programs actually rely on such strong guarantees.
This can be seen for example in a problem that takes a populated list
and does in several threads:

next_item = global_list.pop()

This implicitly relies on the fact that pop() will perform atomic
removal from the list. If two threads try to pop() from the same list
at the same time, then the two operations will occur in one order or the
other; but they will not e.g. return the same object to both threads or
mess up the internal state of the list object.

With such an example in mind, it should be clear that we do not want a
solution to the multi-core issue that involves dropping these strong
guarantees. It is ok however to lower the barrier, as Jython does; but
any Python implementation must offer some guarantees, or not offer
multi-threading at all. This includes the fact that a lot of methods on
built-in types are supposed to be atomic.

(It should be noted that not offering multi-threading at all is actually
also a (partial) solution to the problem. Recently, several "hacks"
have appeared that give a programmer more-or-less transparent access to
multiple independent processes (e.g. multiprocessing). While these provide appropriate
solutions in some context, they are not as widely applicable as
multi-threading. As a typical example, they fail to apply when the
mutiple cores need to process information that cannot be serialized at
all --- a requirement for any data exchange between several processes.)

Here is an example of how Jython's consistency is weaker than CPython's GIL.
It takes uncommon examples to show it, and the fact that it does not work
like a CPython programmer expect them to is generally considered as an
implementation detail. Consider:

Each operation is atomic in the case of CPython, but decomposed in two steps
(which can each be considered atomic) in the case of Jython: reading from the
argument, and then updating the target set. Suppose that initially
set1 = {1}, set2 = {2}, set3 = {3}. On CPython, independently on
the order in which the threads run, we will end up with at least one of the
sets being {1, 2, 3}. On Jython, it is possible that all
three sets end up as containing two items only. The example is a bit
far-fetched but should show that CPython's consistency is strictly stronger
than Jython's.

PyPy

PyPy is a Python interpreter much like CPython or Jython, but the way it
is produced is particular. It is an interpreter written in RPython, a
subset of Python, which gets turned into a complete virtual machine (as
generated C code) automatically by a step called the "translation". In
this context, the trade-offs are different from the ones in CPython and
in Jython: it is possible in PyPy, and even easy, to apply arbitrary
whole-program transformations to the interpreter at "translation-time".

With this in mind, it is possible to imagine a whole-program
transformation that would add locking on every object manipulated in
RPython by the interpreter. This would end up in a situation similar to
Jython. However, it would not automatically solve the issue of
deadlocks, which is avoided in the case of Jython by careful manual
placement of the locks. (In fact, being deadlock-free is a global
program property that cannot be automatically ensured or verified; any
change to Jython can in theory break this property, and thus introduce
subtle deadlocks. The same applies to non-atomicity.)

In fact, we can easily check that if the interpreter accesses (for
both reading and writing)
objects A and B in a bytecode of thread 1, and objects B and A (in the
opposite order) in a bytecode of thread 2 --- and moreover if you need to
have accessed the first object before you can decide that you will need
to access the second object --- then there is no way (apart from the GIL) to avoid
a deadlock while keeping the strong guarantee of atomicity. Indeed, if
both threads have progressed to the middle of the execution of their
bytecode, then A has already been mutated by thread 1 and similarly B
has already been mutated by thread 2. It is not possible to
successfully continue running the threads in that case.

Using Software Transactional Memory

Software Transactional Memory (STM) is an approach that gives a solution
to precisely the above problem. If a thread ended up in a situation
where continuing to run it would be wrong, then we can abort and
rollback. This is similar to the notion of transaction on databases.
In the above example, one or both threads would notice that they are
about to run into troubles and abort. This means more concretely that
they need to have a way to restart execution at the start of the
bytecode, with all the side-effects of what they did so far being either
cancelled or just not committed yet.

We think that this capacity to abort and rollback is the missing piece
of the puzzle of multi-threaded implementations of Python.
Actually, according to the presentation of the problem given
above, it is unavoidable that any solution that wants to offer the
same level of consistency and atomicity as CPython would involve
the capacity of aborting and rolling back --- which means precisely
that STM cannot be avoided.

Ok, but why not settle down with Jython's
approach and put careful locks left and right throughout the interpreter?
Because (1) we would have to consider every operation's atomicity and make decisions
(or steal Jython's) and document them
here;
(2) it would also be really a lot of work, to optimize these locks e.g. with the
JIT as well as the JVM does; and (3) it is not the PyPy way to require manually
tweaking your code everywhere for a feature that should be orthogonal. Point
(3) is probably the most important here: you need to redo the work for every
language you implement in PyPy.
It also implies my own point (4): it is not fun :-)

In more details, the process would work as follows. (This gives an
overview of one possible model; it is possible that a different model
will end up being better.) In every thread:

At the start of a bytecode, we start a "transaction". This means
setting up a thread-local data structure to record a log of what
occurs in the transaction.

We record in the log all objects that are read, as well as the
modifications that we would like to make.

During this time, we detect "read" inconsistencies, shown by the
object's "last-modified" timestamp being later than the start time
of the current transaction, and abort. This prevents the rest of
the code from running with inconsistent values.

If we reach the end of the bytecode without a "read" inconsistency,
then we atomically check for "write" inconsistencies. These are
inconsistencies which arise from concurrent updates to objects
in the other threads --- either our "write" objects, or our "read"
objects.

If no inconsistency is found, we "commit" the transaction by copying
the delayed writes from the log into main memory.

The points at which a transaction starts or ends are exactly the
points at which, in CPython, the Global Interpreter Lock is
respectively acquired and released. If we ignore the fact that (purely for
performance) CPython acquires and releases the GIL only every N bytecodes,
then this means:

Before any bytecode we acquire the GIL (start a transaction), and after
the bytecode we release it (ends the transaction); and

Before doing an external call to the C library or the OS we release the GIL
(ends the transaction) and afterwards re-acquire it (start the next transaction).

So in particular this model is well suited to the STM condition that we cannot
do anything in a transaction that cannot be rolled back, like --- precisely ---
system calls. Indeed, by construction, these system calls occur outside a
transaction, because in CPython they occur with the GIL released.

Performance

A large number of implementation details are still open for now.
From a user's point of view (i.e. the programmer using Python),
the most relevant one is the overall performance impact. We
cannot give precise numbers so far, and we expect the initial
performance to be abysmally bad (maybe 10x slower); however, with
successive improvements to the locking mechanism, to the global
program transformation inserting the locks, to the garbage
collector (GC), and to the Just-in-Time (JIT) compiler, we
believe that it should be possible to get a roughly reasonable
performance (up to maybe 2x slower). For example, the GC can
maintain flags on the objects to know that they did not escape
their creation thread, and do not need any logging; and the JIT
compiler can aggregate several reads or writes to an object into
one. We believe that these are the kind of optimizations that
can give back a lot of the performance lost.

The state of STM

Transactional Memory is itself a relatively old idea, originating
from a 1986 paper by Tom Knight. At first based on hardware
support, the idea of software-only transactional memory (STM) was
popularized in 1995 and has recently been the focus of intense
research.

The approach outlined above --- using STM to form the core of the
implementation of a language --- is new, as far as we know. So
far, most implementations provide STM as a library feature. It
requires explicit usage, often in the form of explicitly
declaring which objects must be protected by STM (object-based
STMs). It is only recently that native STM support has started
to appear, notably in the Clojure language.

STM is described on Wikipedia as an approach that "greatly
simplifies conceptual understanding of multithreaded programs and
helps make programs more maintainable by working in harmony with
existing high-level abstractions such as objects and modules."
We actually think that these benefits are important enough to
warrant being exposed to the Python programmer as well, instead
of being used only internally. This would give the Python
programmer a very simple interface:

with atomic:
<these operations are executed atomically>

(This is an old idea. Funny how back in 2003 people, including me, thought that this was a hack. Now I'm writing a blog post to say "it was not a hack; it's explicitly using locks that is a hack." I'm buying the idea of composability.)

Conclusion

Taking these ideas and applying them in the context of an
implementation of a complex high-level language like Python comes
with its own challanges. In this context, using PyPy makes sense
as both an experimentation platform and as a platform that is
recently gaining attention for its performance. The alternatives
are unattractive: doing it in CPython for example would mean
globally rewriting the interpreter. In PyPy instead, we write it
as a transformation that is applied systematically at translation-time.
Also, PyPy is a general platform for generating fast interpreters
for dynamic languages; the STM implementation in PyPy would work
out of the box for other language implementations as well, instead
of just for Python.

Update:

This is mostly me (Armin Rigo) ranting aloud and trying experiments;
this post should not be confused as meaning that the whole PyPy team
will now spend the next years working on it full-time.
As I said it is orthogonal to the actual Python interpreter, and it is in
any case a feature that can be turned on or off during translation; I know
that in many or most use cases, people are more interested in getting a
fast PyPy rather than one which is twice as slow but scales well.

Nothing I said is really new. For proof, see
Riley and Zilles (2006)
as well as Tabba (2010) who both experimented with Hardware Transactional Memory, turning CPython or PyPy interpreter's GIL into start/end transactions, as I describe here.

Thursday, August 18, 2011

We're pleased to announce the 1.6 release of PyPy. This release brings a lot
of bugfixes and performance improvements over 1.5, and improves support for
Windows 32bit and OS X 64bit. This version fully implements Python 2.7.1 and
has beta level support for loading CPython C extensions. You can download it
here:

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for
CPython 2.7.1. It's fast (pypy 1.6 and cpython 2.6.2 performance comparison)
due to its integrated tracing JIT compiler.

This release supports x86 machines running Linux 32/64 or Mac OS X. Windows 32
is beta (it roughly works but a lot of small issues have not been fixed so
far). Windows 64 is not yet supported.

The main topics of this release are speed and stability: on average on
our benchmark suite, PyPy 1.6 is between 20% and 30% faster than PyPy 1.5,
which was already much faster than CPython on our set of benchmarks.

The speed improvements have been made possible by optimizing many of the
layers which compose PyPy. In particular, we improved: the Garbage Collector,
the JIT warmup time, the optimizations performed by the JIT, the quality of
the generated machine code and the implementation of our Python interpreter.

Highlights

fast ctypes: now calls to ctypes functions are seen and optimized
by the JIT, and they are up to 60 times faster than PyPy 1.5 and 10 times
faster than CPython

improved generators(1): simple generators now are inlined into the caller
loop, making performance up to 3.5 times faster than PyPy 1.5.

improved generators(2): thanks to other optimizations, even generators
that are not inlined are between 10% and 20% faster than PyPy 1.5.

faster warmup time for the JIT

JIT support for single floats (e.g., for array('f'))

optimized dictionaries: the internal representation of dictionaries is now
dynamically selected depending on the type of stored objects, resulting in
faster code and smaller memory footprint. For example, dictionaries whose
keys are all strings, or all integers. Other dictionaries are also smaller
due to bugfixes.

JitViewer: this is the first official release which includes the JitViewer,
a web-based tool which helps you to see which parts of your Python code have
been compiled by the JIT, down until the assembler. The jitviewer 0.1 has
already been release and works well with PyPy 1.6.

The CPython extension module API has been improved and now supports many
more extensions. For information on which one are supported, please refer to
our compatibility wiki.

Multibyte encoding support: this was of of the last areas in which we were
still behind CPython, but now we fully support them.

Preliminary support for NumPy: this release includes a preview of a very
fast NumPy module integrated with the PyPy JIT. Unfortunately, this does
not mean that you can expect to take an existing NumPy program and run it on
PyPy, because the module is still unfinished and supports only some of the
numpy API. However, barring some details, what works should be
blazingly fast :-)

Bugfixes: since the 1.5 release we fixed 53 bugs in our bug tracker, not
counting the numerous bugs that were found and reported through other
channels than the bug tracker.

We're pleased to announce the 1.6 release of PyPy. This release brings a lot
of bugfixes and performance improvements over 1.5, and improves support for
Windows 32bit and OS X 64bit. This version fully implements Python 2.7.1 and
has beta level support for loading CPython C extensions. You can download it
here:

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for
CPython 2.7.1. It's fast (pypy 1.6 and cpython 2.6.2 performance comparison)
due to its integrated tracing JIT compiler.

This release supports x86 machines running Linux 32/64 or Mac OS X. Windows 32
is beta (it roughly works but a lot of small issues have not been fixed so
far). Windows 64 is not yet supported.

The main topics of this release are speed and stability: on average on
our benchmark suite, PyPy 1.6 is between 20% and 30% faster than PyPy 1.5,
which was already much faster than CPython on our set of benchmarks.

The speed improvements have been made possible by optimizing many of the
layers which compose PyPy. In particular, we improved: the Garbage Collector,
the JIT warmup time, the optimizations performed by the JIT, the quality of
the generated machine code and the implementation of our Python interpreter.

Highlights

fast ctypes: now calls to ctypes functions are seen and optimized
by the JIT, and they are up to 60 times faster than PyPy 1.5 and 10 times
faster than CPython

improved generators(1): simple generators now are inlined into the caller
loop, making performance up to 3.5 times faster than PyPy 1.5.

improved generators(2): thanks to other optimizations, even generators
that are not inlined are between 10% and 20% faster than PyPy 1.5.

faster warmup time for the JIT

JIT support for single floats (e.g., for array('f'))

optimized dictionaries: the internal representation of dictionaries is now
dynamically selected depending on the type of stored objects, resulting in
faster code and smaller memory footprint. For example, dictionaries whose
keys are all strings, or all integers. Other dictionaries are also smaller
due to bugfixes.

JitViewer: this is the first official release which includes the JitViewer,
a web-based tool which helps you to see which parts of your Python code have
been compiled by the JIT, down until the assembler. The jitviewer 0.1 has
already been release and works well with PyPy 1.6.

The CPython extension module API has been improved and now supports many
more extensions. For information on which one are supported, please refer to
our compatibility wiki.

Multibyte encoding support: this was of of the last areas in which we were
still behind CPython, but now we fully support them.

Preliminary support for NumPy: this release includes a preview of a very
fast NumPy module integrated with the PyPy JIT. Unfortunately, this does
not mean that you can expect to take an existing NumPy program and run it on
PyPy, because the module is still unfinished and supports only some of the
numpy API. However, barring some details, what works should be
blazingly fast :-)

Bugfixes: since the 1.5 release we fixed 53 bugs in our bug tracker, not
counting the numerous bugs that were found and reported through other
channels than the bug tracker.

Friday, August 12, 2011

We're proud to announce the first public release of the jitviewer. As of now,
jitviewer is a slightly internal tool that helps understanding how your Python
source code is compiled by the PyPy's JIT all the way down to machine code.

To install it, you need a very recent version of PyPy
(newer than 9th of August), for example one of the nightly builds:

pip install jitviewer. Note that you need to run the pip
executable which belongs to PyPy, not the globally installed one.

Have a look at the README for how to start it, or try the online demo if
you just want to play with it.

The jitviewer is a web application written with flask and jinja2. If
you have experience with web development and you want to help PyPy, don't
hesitate to contact us, there are plenty of things to improve in it :-).

What does the jitviewer really do?

At the top of the page, you will see the list of pieces of code which has been
compiled by the JIT. You will see entries for both normal loops and for
"entry bridges". This is not the right place to discuss the difference
between those, but you most probably want to look at loops, because usually
it's where most of the time is spent.

Note that for each loop, you will see the name of the function which contains
the first instruction of the loop. However, thanks to the inlining done
by the JIT, it will contain also the code for other functions.

Once you select a loop, the jitviewer shows how the JIT has compiled the
Python source code into assembler in a hierarchical way. It displays four
levels:

Python source code: only the lines shown in azure have been compiled for
this particular loop, the ones in gray have not.

Python bytecode, the one you would get by doing:

def f(a, b):
return a + b
import dis
dis.dis(f)

The opcodes are e.g. LOAD_FAST, LOAD_GLOBAL etc. The opcodes
which are not in bold have been completely optimized aways by the JIT.

Intermediate representation of jit code (IR). This is a combination of
operations (like integer addition, reading fields out of structures) and
guards (which check that the assumptions we made are actually true). Guards
are in red. These operations are "at the same level as C": so, for example,
+ takes two unboxed integers which can be stored into the register
of the CPU.

Assembler: you can see it by clicking on "Show assembler" in the menu on the
right.

Sometimes you'll find that a guard fails often enough that a new piece of
assembler is required to be compiled. This is an alternative path through the
code and it's called a bridge. You can see bridges in the jitviewer when
there is a link next to a guard. For more information about purpose look up
the jit documentation.

I'm still confused

Jitviewer is not perfect when it comes to explaining what's going on. Feel free
to pop up on IRC or send us a mail to the mailing list, we'll try to explain
and/or improve the situation. Consult the contact page for details.

Cheers,
fijal & antocuni

Hello.

We're proud to announce the first public release of the jitviewer. As of now,
jitviewer is a slightly internal tool that helps understanding how your Python
source code is compiled by the PyPy's JIT all the way down to machine code.

To install it, you need a very recent version of PyPy
(newer than 9th of August), for example one of the nightly builds:

pip install jitviewer. Note that you need to run the pip
executable which belongs to PyPy, not the globally installed one.

Have a look at the README for how to start it, or try the online demo if
you just want to play with it.

The jitviewer is a web application written with flask and jinja2. If
you have experience with web development and you want to help PyPy, don't
hesitate to contact us, there are plenty of things to improve in it :-).

What does the jitviewer really do?

At the top of the page, you will see the list of pieces of code which has been
compiled by the JIT. You will see entries for both normal loops and for
"entry bridges". This is not the right place to discuss the difference
between those, but you most probably want to look at loops, because usually
it's where most of the time is spent.

Note that for each loop, you will see the name of the function which contains
the first instruction of the loop. However, thanks to the inlining done
by the JIT, it will contain also the code for other functions.

Once you select a loop, the jitviewer shows how the JIT has compiled the
Python source code into assembler in a hierarchical way. It displays four
levels:

Python source code: only the lines shown in azure have been compiled for
this particular loop, the ones in gray have not.

Python bytecode, the one you would get by doing:

def f(a, b):
return a + b
import dis
dis.dis(f)

The opcodes are e.g. LOAD_FAST, LOAD_GLOBAL etc. The opcodes
which are not in bold have been completely optimized aways by the JIT.

Intermediate representation of jit code (IR). This is a combination of
operations (like integer addition, reading fields out of structures) and
guards (which check that the assumptions we made are actually true). Guards
are in red. These operations are "at the same level as C": so, for example,
+ takes two unboxed integers which can be stored into the register
of the CPU.

Assembler: you can see it by clicking on "Show assembler" in the menu on the
right.

Sometimes you'll find that a guard fails often enough that a new piece of
assembler is required to be compiled. This is an alternative path through the
code and it's called a bridge. You can see bridges in the jitviewer when
there is a link next to a guard. For more information about purpose look up
the jit documentation.

I'm still confused

Jitviewer is not perfect when it comes to explaining what's going on. Feel free
to pop up on IRC or send us a mail to the mailing list, we'll try to explain
and/or improve the situation. Consult the contact page for details.

Tuesday, August 2, 2011

String formatting is probably something you do just about every day in Python,
and never think about. It's so easy, just "%d %d" % (i, i) and you're
done. No thinking about how to size your result buffer, whether your output
has an appropriate NULL byte at the end, or any other details. A C
equivalent might be:

char x[44];
sprintf(x, "%d %d", i, i);

Note that we had to stop for a second and consider how big numbers might get
and overestimate the size (44 = length of the biggest number on 64bit (20) +
1 for the sign * 2 + 1 (for the space) + 1 (NUL byte)), it took the authors of
this post, fijal and alex, 3 tries to get the math right on this :-)

This is fine, except you can't even return x from this function, a more
fair comparison might be:

char *x = malloc(44 * sizeof(char));
sprintf(x, "%d %d", i, i);

x is slightly overallocated in some situations, but that's fine.

But we're not here to just discuss the implementation of string
formatting, we're here to discuss how blazing fast PyPy is at it, with
the new unroll-if-alt branch. Given the Python code:

Run under PyPy, at the head of the unroll-if-alt branch, and
compiled with GCC 4.5.2 at -O4 (other optimization levels were tested,
this produced the best performance). It took 0.85 seconds to
execute under PyPy, and 1.63 seconds with the compiled binary. We
think this demonstrates the incredible potential of dynamic
compilation, GCC is unable to inline or unroll the sprintf call,
because it sits inside of libc.

Which as discussed above, is more comperable to the Python, gives a
result of 1.96 seconds.

Summary of performance:

Platform

GCC (stack)

GCC (malloc)

CPython

PyPy (unroll-if-alt)

Time

1.63s

1.96s

10.2s

0.85s

relative to C

1x

0.83x

0.16x

1.9x

Overall PyPy is almost 2x faster. This is clearly win for dynamic
compilation over static - the sprintf function lives in libc and so
cannot be specializing over the constant string, which has to be parsed
every time it's executed. In the case of PyPy, we specialize
the assembler if we detect the left hand string of the modulo operator
to be constant.

Cheers,
alex & fijal

String formatting is probably something you do just about every day in Python,
and never think about. It's so easy, just "%d %d" % (i, i) and you're
done. No thinking about how to size your result buffer, whether your output
has an appropriate NULL byte at the end, or any other details. A C
equivalent might be:

char x[44];
sprintf(x, "%d %d", i, i);

Note that we had to stop for a second and consider how big numbers might get
and overestimate the size (44 = length of the biggest number on 64bit (20) +
1 for the sign * 2 + 1 (for the space) + 1 (NUL byte)), it took the authors of
this post, fijal and alex, 3 tries to get the math right on this :-)

This is fine, except you can't even return x from this function, a more
fair comparison might be:

char *x = malloc(44 * sizeof(char));
sprintf(x, "%d %d", i, i);

x is slightly overallocated in some situations, but that's fine.

But we're not here to just discuss the implementation of string
formatting, we're here to discuss how blazing fast PyPy is at it, with
the new unroll-if-alt branch. Given the Python code:

Run under PyPy, at the head of the unroll-if-alt branch, and
compiled with GCC 4.5.2 at -O4 (other optimization levels were tested,
this produced the best performance). It took 0.85 seconds to
execute under PyPy, and 1.63 seconds with the compiled binary. We
think this demonstrates the incredible potential of dynamic
compilation, GCC is unable to inline or unroll the sprintf call,
because it sits inside of libc.

Which as discussed above, is more comperable to the Python, gives a
result of 1.96 seconds.

Summary of performance:

Platform

GCC (stack)

GCC (malloc)

CPython

PyPy (unroll-if-alt)

Time

1.63s

1.96s

10.2s

0.85s

relative to C

1x

0.83x

0.16x

1.9x

Overall PyPy is almost 2x faster. This is clearly win for dynamic
compilation over static - the sprintf function lives in libc and so
cannot be specializing over the constant string, which has to be parsed
every time it's executed. In the case of PyPy, we specialize
the assembler if we detect the left hand string of the modulo operator
to be constant.