Note that many of these fixes are for our new beta version of PyPy3.5 on Windows. There may be more unicode problems in the Windows beta version,
especially concerning directory- and file-names with non-ASCII
characters.

On macOS, we recommend you wait for the
Homebrew package to prevent issues with third-party packages. For other supported platforms our downloads are available now.
Thanks to those who reported the issues.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for
CPython 2.7 and CPython 3.5. It’s fast (PyPy and CPython 2.7.x performance comparison)
due to its integrated tracing JIT compiler.
We also welcome developers of other dynamic languages to see what RPython
can do for them.
This PyPy 3.5 release supports:

Note that many of these fixes are for our new beta version of PyPy3.5 on Windows. There may be more unicode problems in the Windows beta version,
especially concerning directory- and file-names with non-ASCII
characters.

On macOS, we recommend you wait for the
Homebrew package to prevent issues with third-party packages. For other supported platforms our downloads are available now.
Thanks to those who reported the issues.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for
CPython 2.7 and CPython 3.5. It’s fast (PyPy and CPython 2.7.x performance comparison)
due to its integrated tracing JIT compiler.
We also welcome developers of other dynamic languages to see what RPython
can do for them.
This PyPy 3.5 release supports:

Monday, December 25, 2017

The PyPy team is proud to release both PyPy2.7 v5.10 (an interpreter supporting
Python 2.7 syntax), and a final PyPy3.5 v5.10 (an interpreter for Python
3.5 syntax). The two releases are both based on much the same codebase, thus
the dual release.

This release is an incremental release with very few new features, the main
feature being the final PyPy3.5 release that works on linux and OS X with beta
windows support. It also includes fixes for vmprof cooperation with greenlets.

Compared to 5.9, the 5.10 release contains mostly bugfixes and small improvements.
We have in the pipeline big new features coming for PyPy 6.0 that did not make
the release cut and should be available within the next couple months.

As always, this release is 100% compatible with the previous one and fixed
several issues and bugs raised by the growing community of PyPy users.
As always, we strongly recommend updating.

There are quite a few important changes that are in the pipeline that did not
make it into the 5.10 release. Most important are speed improvements to cpyext
(which will make numpy and pandas a bit faster) and utf8 branch that changes
internal representation of unicode to utf8, which should help especially the
Python 3.5 version of PyPy.

This release concludes the Mozilla Open Source grant for having a compatible
PyPy 3.5 release and we're very grateful for that. Of course, we will continue
to improve PyPy 3.5 and probably move to 3.6 during the course of 2018.

We would like to thank our donors for the continued support of the PyPy
project.

We would also like to thank our contributors and
encourage new people to join the project. PyPy has many
layers and we need help with all of them: PyPy and RPython documentation
improvements, tweaking popular modules to run on pypy, or general help
with making RPython's JIT even better.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for
CPython 2.7 and CPython 3.5. It's fast (PyPy and CPython 2.7.x performance comparison)
due to its integrated tracing JIT compiler.

We also welcome developers of other dynamic languages to see what RPython
can do for them.

The PyPy team is proud to release both PyPy2.7 v5.10 (an interpreter supporting
Python 2.7 syntax), and a final PyPy3.5 v5.10 (an interpreter for Python
3.5 syntax). The two releases are both based on much the same codebase, thus
the dual release.

This release is an incremental release with very few new features, the main
feature being the final PyPy3.5 release that works on linux and OS X with beta
windows support. It also includes fixes for vmprof cooperation with greenlets.

Compared to 5.9, the 5.10 release contains mostly bugfixes and small improvements.
We have in the pipeline big new features coming for PyPy 6.0 that did not make
the release cut and should be available within the next couple months.

As always, this release is 100% compatible with the previous one and fixed
several issues and bugs raised by the growing community of PyPy users.
As always, we strongly recommend updating.

There are quite a few important changes that are in the pipeline that did not
make it into the 5.10 release. Most important are speed improvements to cpyext
(which will make numpy and pandas a bit faster) and utf8 branch that changes
internal representation of unicode to utf8, which should help especially the
Python 3.5 version of PyPy.

This release concludes the Mozilla Open Source grant for having a compatible
PyPy 3.5 release and we're very grateful for that. Of course, we will continue
to improve PyPy 3.5 and probably move to 3.6 during the course of 2018.

We would like to thank our donors for the continued support of the PyPy
project.

We would also like to thank our contributors and
encourage new people to join the project. PyPy has many
layers and we need help with all of them: PyPy and RPython documentation
improvements, tweaking popular modules to run on pypy, or general help
with making RPython's JIT even better.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for
CPython 2.7 and CPython 3.5. It's fast (PyPy and CPython 2.7.x performance comparison)
due to its integrated tracing JIT compiler.

We also welcome developers of other dynamic languages to see what RPython
can do for them.

Monday, October 30, 2017

I often hear people who are happy because PyPy makes their code 2 times faster
or so. Here is a short personal story which shows PyPy can go well beyond
that.

DISCLAIMER: this is not a silver bullet or a general recipe: it worked in
this particular case, it might not work so well in other cases. But I think it
is still an interesting technique. Moreover, the various steps and
implementations are showed in the same order as I tried them during the
development, so it is a real-life example of how to proceed when optimizing
for PyPy.

Some months ago I played a bit with evolutionary algorithms: the ambitious
plan was to automatically evolve a logic which could control a (simulated)
quadcopter, i.e. a PID controller (spoiler: it doesn't fly).

The idea is to have an initial population of random creatures: at each
generation, the ones with the best fitness survive and reproduce with small,
random variations.

However, for the scope of this post, the actual task at hand is not so
important, so let's jump straight to the code. To drive the quadcopter, a
Creature has a run_step method which runs at each delta_t (full
code):

inputs is a numpy array containing the desired setpoint and the current
position on the Z axis;

outputs is a numpy array containing the thrust to give to the motors. To
start easy, all the 4 motors are constrained to have the same thrust, so
that the quadcopter only travels up and down the Z axis;

self.state contains arbitrary values of unknown size which are passed from
one step to the next;

self.matrix and self.constant contains the actual logic. By putting
the "right" values there, in theory we could get a perfectly tuned PID
controller. These are randomly mutated between generations.

run_step is called at 100Hz (in the virtual time frame of the simulation). At each
generation, we test 500 creatures for a total of 12 virtual seconds each. So,
we have a total of 600,000 executions of run_step at each generation.

At first, I simply tried to run this code on CPython; here is the result:

Ouch! We are ~5.5x slower than CPython. This was kind of expected: numpy is
based on cpyext, which is infamously slow. (Actually, we are working on
that and on the cpyext-avoid-roundtrip branch we are already faster than
CPython, but this will be the subject of another blog post.)

So, let's try to avoid cpyext. The first obvious step is to use numpypy
instead of numpy (actually, there is a hack to use just the micronumpy
part). Let's see if the speed improves:

So, ~2.7 seconds on average: this is 12x faster than PyPy+numpy, and more than
2x faster than the original CPython. At this point, most people would be happy
and go tweeting how PyPy is great.

In general, when talking of CPython vs PyPy, I am rarely satified of a 2x
speedup: I know that PyPy can do much better than this, especially if you
write code which is specifically optimized for the JIT. For a real-life
example, have a look at capnpy benchmarks, in which the PyPy version is
~15x faster than the heavily optimized CPython+Cython version (both have been
written by me, and I tried hard to write the fastest code for both
implementations).

So, let's try to do better. As usual, the first thing to do is to profile and
see where we spend most of the time. Here is the vmprof profile. We spend a
lot of time inside the internals of numpypy, and allocating tons of temporary
arrays to store the results of the various operations.

Also, let's look at the jit traces and search for the function run:
this is loop in which we spend most of the time, and it is composed of 1796
operations. The operations emitted for the line np.dot(...) +
self.constant are listed between lines 1217 and 1456. Here is the excerpt
which calls np.dot(...); most of the ops are cheap, but at line 1232 we
see a call to the RPython function descr_dot; by looking at the
implementation we see that it creates a new W_NDimArray to store the
result, which means it has to do a malloc():

The implementation of the + self.constant part is also interesting:
contrary the former, the call to W_NDimArray.descr_add has been inlined by
the JIT, so we have a better picture of what's happening; in particular, we
can see the call to __0_alloc_with_del____ which allocates the
W_NDimArray for the result, and the raw_malloc which allocates the
actual array. Then we have a long list of 149 simple operations which set the
fields of the resulting array, construct an iterator, and finally do a
call_assembler: this is the actual logic to do the addition, which was
JITtted indipendently; call_assembler is one of the operations to do
JIT-to-JIT calls:

All of this is very suboptimal: in this particular case, we know that the
shape of self.matrix is always (3, 2): so, we are doing an incredible
amount of work, including calling malloc() twice for the temporary arrays, just to
call two functions which ultimately do a total of 6 multiplications
and 6 additions. Note also that this is not a fault of the JIT: CPython+numpy
has to do the same amount of work, just hidden inside C calls.

One possible solution to this nonsense is a well known compiler optimization:
loop unrolling. From the compiler point of view, unrolling the loop is always
risky because if the matrix is too big you might end up emitting a huge blob
of code, possibly uselss if the shape of the matrices change frequently: this
is the main reason why the PyPy JIT does not even try to do it in this case.

However, we know that the matrix is small, and always of the same
shape. So, let's unroll the loop manually:

Yes, it's not an error. After a couple of generations, it stabilizes at around
~0.07-0.08 seconds per generation. This is around 80 (eighty) times faster
than the original CPython+numpy implementation, and around 35-40x faster than
the naive PyPy+numpypy one.

Let's look at the trace again: it no longer contains expensive calls, and
certainly no more temporary malloc() s. The core of the logic is between
lines 386-416, where we can see that it does fast C-level multiplications and
additions: float_mul and float_add are translated straight into
mulsd and addsd x86 instructions.

As I said before, this is a very particular example, and the techniques
described here do not always apply: it is not realistic to expect an 80x
speedup on arbitrary code, unfortunately. However, it clearly shows the potential of PyPy when
it comes to high-speed computing. And most importantly, it's not a toy
benchmark which was designed specifically to have good performance on PyPy:
it's a real world example, albeit small.

You might be also interested in the talk I gave at last EuroPython, in which I
talk about a similar topic: "The Joy of PyPy JIT: abstractions for free"
(abstract, slides and video).

I often hear people who are happy because PyPy makes their code 2 times faster
or so. Here is a short personal story which shows PyPy can go well beyond
that.

DISCLAIMER: this is not a silver bullet or a general recipe: it worked in
this particular case, it might not work so well in other cases. But I think it
is still an interesting technique. Moreover, the various steps and
implementations are showed in the same order as I tried them during the
development, so it is a real-life example of how to proceed when optimizing
for PyPy.

Some months ago I played a bit with evolutionary algorithms: the ambitious
plan was to automatically evolve a logic which could control a (simulated)
quadcopter, i.e. a PID controller (spoiler: it doesn't fly).

The idea is to have an initial population of random creatures: at each
generation, the ones with the best fitness survive and reproduce with small,
random variations.

However, for the scope of this post, the actual task at hand is not so
important, so let's jump straight to the code. To drive the quadcopter, a
Creature has a run_step method which runs at each delta_t (full
code):

inputs is a numpy array containing the desired setpoint and the current
position on the Z axis;

outputs is a numpy array containing the thrust to give to the motors. To
start easy, all the 4 motors are constrained to have the same thrust, so
that the quadcopter only travels up and down the Z axis;

self.state contains arbitrary values of unknown size which are passed from
one step to the next;

self.matrix and self.constant contains the actual logic. By putting
the "right" values there, in theory we could get a perfectly tuned PID
controller. These are randomly mutated between generations.

run_step is called at 100Hz (in the virtual time frame of the simulation). At each
generation, we test 500 creatures for a total of 12 virtual seconds each. So,
we have a total of 600,000 executions of run_step at each generation.

At first, I simply tried to run this code on CPython; here is the result:

Ouch! We are ~5.5x slower than CPython. This was kind of expected: numpy is
based on cpyext, which is infamously slow. (Actually, we are working on
that and on the cpyext-avoid-roundtrip branch we are already faster than
CPython, but this will be the subject of another blog post.)

So, let's try to avoid cpyext. The first obvious step is to use numpypy
instead of numpy (actually, there is a hack to use just the micronumpy
part). Let's see if the speed improves:

So, ~2.7 seconds on average: this is 12x faster than PyPy+numpy, and more than
2x faster than the original CPython. At this point, most people would be happy
and go tweeting how PyPy is great.

In general, when talking of CPython vs PyPy, I am rarely satified of a 2x
speedup: I know that PyPy can do much better than this, especially if you
write code which is specifically optimized for the JIT. For a real-life
example, have a look at capnpy benchmarks, in which the PyPy version is
~15x faster than the heavily optimized CPython+Cython version (both have been
written by me, and I tried hard to write the fastest code for both
implementations).

So, let's try to do better. As usual, the first thing to do is to profile and
see where we spend most of the time. Here is the vmprof profile. We spend a
lot of time inside the internals of numpypy, and allocating tons of temporary
arrays to store the results of the various operations.

Also, let's look at the jit traces and search for the function run:
this is loop in which we spend most of the time, and it is composed of 1796
operations. The operations emitted for the line np.dot(...) +
self.constant are listed between lines 1217 and 1456. Here is the excerpt
which calls np.dot(...); most of the ops are cheap, but at line 1232 we
see a call to the RPython function descr_dot; by looking at the
implementation we see that it creates a new W_NDimArray to store the
result, which means it has to do a malloc():

The implementation of the + self.constant part is also interesting:
contrary the former, the call to W_NDimArray.descr_add has been inlined by
the JIT, so we have a better picture of what's happening; in particular, we
can see the call to __0_alloc_with_del____ which allocates the
W_NDimArray for the result, and the raw_malloc which allocates the
actual array. Then we have a long list of 149 simple operations which set the
fields of the resulting array, construct an iterator, and finally do a
call_assembler: this is the actual logic to do the addition, which was
JITtted indipendently; call_assembler is one of the operations to do
JIT-to-JIT calls:

All of this is very suboptimal: in this particular case, we know that the
shape of self.matrix is always (3, 2): so, we are doing an incredible
amount of work, including calling malloc() twice for the temporary arrays, just to
call two functions which ultimately do a total of 6 multiplications
and 6 additions. Note also that this is not a fault of the JIT: CPython+numpy
has to do the same amount of work, just hidden inside C calls.

One possible solution to this nonsense is a well known compiler optimization:
loop unrolling. From the compiler point of view, unrolling the loop is always
risky because if the matrix is too big you might end up emitting a huge blob
of code, possibly uselss if the shape of the matrices change frequently: this
is the main reason why the PyPy JIT does not even try to do it in this case.

However, we know that the matrix is small, and always of the same
shape. So, let's unroll the loop manually:

Yes, it's not an error. After a couple of generations, it stabilizes at around
~0.07-0.08 seconds per generation. This is around 80 (eighty) times faster
than the original CPython+numpy implementation, and around 35-40x faster than
the naive PyPy+numpypy one.

Let's look at the trace again: it no longer contains expensive calls, and
certainly no more temporary malloc() s. The core of the logic is between
lines 386-416, where we can see that it does fast C-level multiplications and
additions: float_mul and float_add are translated straight into
mulsd and addsd x86 instructions.

As I said before, this is a very particular example, and the techniques
described here do not always apply: it is not realistic to expect an 80x
speedup on arbitrary code, unfortunately. However, it clearly shows the potential of PyPy when
it comes to high-speed computing. And most importantly, it's not a toy
benchmark which was designed specifically to have good performance on PyPy:
it's a real world example, albeit small.

You might be also interested in the talk I gave at last EuroPython, in which I
talk about a similar topic: "The Joy of PyPy JIT: abstractions for free"
(abstract, slides and video).

Wednesday, October 18, 2017

(Cape of)Good Hope for PyPy

Hello from the other side of the world (for most of you)!

With the excuse of coming to PyCon ZA during the last two weeks Armin,
Ronan, Antonio and sometimes Maciek had a very nice and productive sprint in
Cape Town, as pictures show :). We would like to say a big thank you to
Kiwi.com, which sponsored part of the travel costs via its awesome Sourcelift
program to help Open Source projects.

Armin, Anto and Ronan at Cape Point

Armin, Ronan and Anto spent most of the time hacking at cpyext, our CPython
C-API compatibility layer: during the last years, the focus was to make it
working and compatible with CPython, in order to run existing libraries such
as numpy and pandas. However, we never paid too much attention to performance,
so the net result is that with the latest released version of PyPy, C
extensions generally work but their speed ranges from "slow" to "horribly
slow".

For example, these very simple microbenchmarks measure the speed of
calling (empty) C functions, i.e. the time you spend to "cross the border"
between RPython and C. (Note: this includes the time spent doing the loop in regular Python code.) These are the results on CPython, on PyPy 5.8, and on
our newest in-progress version:

So yes: before the sprint, we were ~2-6x slower than CPython. Now, we are
faster than it!
To reach this result, we did various improvements, such as:

teach the JIT how to look (a bit) inside the cpyext module;

write specialized code for calling METH_NOARGS, METH_O and
METH_VARARGS functions; previously, we always used a very general and
slow logic;

implement freelists to allocate the cpyext versions of int and
tuple objects, as CPython does;

the cpyext-avoid-roundtrip branch: crossing the RPython/C border is
slowish, but the real problem was (and still is for many cases) we often
cross it many times for no good reason. So, depending on the actual API
call, you might end up in the C land, which calls back into the RPython
land, which goes to C, etc. etc. (ad libitum).

The branch tries to fix such nonsense: so far, we fixed only some cases, which
are enough to speed up the benchmarks shown above. But most importantly, we
now have a clear path and an actual plan to improve cpyext more and
more. Ideally, we would like to reach a point in which cpyext-intensive
programs run at worst at the same speed of CPython.

The other big topic of the sprint was Armin and Maciej doing a lot of work on the
unicode-utf8 branch: the goal of the branch is to always use UTF-8 as the
internal representation of unicode strings. The advantages are various:

decoding a UTF-8 stream is super fast, as you just need to check that the
stream is valid;

encoding to UTF-8 is almost a no-op;

UTF-8 is always more compact representation than the currently
used UCS-4. It's also almost always more compact than CPython 3.5 latin1/UCS2/UCS4 combo;

Before you ask: yes, this branch contains special logic to ensure that random
access of single unicode chars is still O(1), as it is on both CPython and the
current PyPy.
We also plan to improve the speed of decoding even more by using modern processor features, like SSE and AVX. Preliminary results show that decoding can be done 100x faster than the current setup.

In summary, this was a long and profitable sprint, in which we achieved lots
of interesting results. However, what we liked even more was the privilege of
doing commits from awesome places such as the top of Table Mountain:

(Cape of)Good Hope for PyPy

Hello from the other side of the world (for most of you)!

With the excuse of coming to PyCon ZA during the last two weeks Armin,
Ronan, Antonio and sometimes Maciek had a very nice and productive sprint in
Cape Town, as pictures show :). We would like to say a big thank you to
Kiwi.com, which sponsored part of the travel costs via its awesome Sourcelift
program to help Open Source projects.

Armin, Anto and Ronan at Cape Point

Armin, Ronan and Anto spent most of the time hacking at cpyext, our CPython
C-API compatibility layer: during the last years, the focus was to make it
working and compatible with CPython, in order to run existing libraries such
as numpy and pandas. However, we never paid too much attention to performance,
so the net result is that with the latest released version of PyPy, C
extensions generally work but their speed ranges from "slow" to "horribly
slow".

For example, these very simple microbenchmarks measure the speed of
calling (empty) C functions, i.e. the time you spend to "cross the border"
between RPython and C. (Note: this includes the time spent doing the loop in regular Python code.) These are the results on CPython, on PyPy 5.8, and on
our newest in-progress version:

So yes: before the sprint, we were ~2-6x slower than CPython. Now, we are
faster than it!
To reach this result, we did various improvements, such as:

teach the JIT how to look (a bit) inside the cpyext module;

write specialized code for calling METH_NOARGS, METH_O and
METH_VARARGS functions; previously, we always used a very general and
slow logic;

implement freelists to allocate the cpyext versions of int and
tuple objects, as CPython does;

the cpyext-avoid-roundtrip branch: crossing the RPython/C border is
slowish, but the real problem was (and still is for many cases) we often
cross it many times for no good reason. So, depending on the actual API
call, you might end up in the C land, which calls back into the RPython
land, which goes to C, etc. etc. (ad libitum).

The branch tries to fix such nonsense: so far, we fixed only some cases, which
are enough to speed up the benchmarks shown above. But most importantly, we
now have a clear path and an actual plan to improve cpyext more and
more. Ideally, we would like to reach a point in which cpyext-intensive
programs run at worst at the same speed of CPython.

The other big topic of the sprint was Armin and Maciej doing a lot of work on the
unicode-utf8 branch: the goal of the branch is to always use UTF-8 as the
internal representation of unicode strings. The advantages are various:

decoding a UTF-8 stream is super fast, as you just need to check that the
stream is valid;

encoding to UTF-8 is almost a no-op;

UTF-8 is always more compact representation than the currently
used UCS-4. It's also almost always more compact than CPython 3.5 latin1/UCS2/UCS4 combo;

Before you ask: yes, this branch contains special logic to ensure that random
access of single unicode chars is still O(1), as it is on both CPython and the
current PyPy.
We also plan to improve the speed of decoding even more by using modern processor features, like SSE and AVX. Preliminary results show that decoding can be done 100x faster than the current setup.

In summary, this was a long and profitable sprint, in which we achieved lots
of interesting results. However, what we liked even more was the privilege of
doing commits from awesome places such as the top of Table Mountain:

NumPy and Pandas now work on PyPy2.7 (together with Cython 0.27.1). Many other modules
based on C-API extensions work on PyPy as well.

Cython 0.27.1 (released very recently) supports more projects with PyPy, both
on PyPy2.7 and PyPy3.5 beta. Note version 0.27.1 is now the minimum
version that supports this version of PyPy, due to some interactions with
updated C-API interface code.

We optimized the JSON parser for recurring string keys, which should decrease
memory use by up to 50% and increase parsing speed by up to 15% for large JSON files
with many repeating dictionary keys (which is quite common).

CFFI, which is part of the PyPy release, has been updated to 1.11.1,
improving an already great package for interfacing with C. CFFI now supports
complex arguments in API mode, as well as char16_t and char32_t and has
improved support for callbacks.

Issues in the C-API compatibility layer that appeared as excessive memory
use were cleared up and other incompatibilities were resolved. The C-API
compatibility layer does slow down code which crosses the python-c interface
often. Some fixes are in the pipelines for some of the performance issues, and we still recommend
using pure python on PyPy or interfacing via CFFI.

Please let us know if your use case is slow, we have ideas how to make things
faster but need real-world examples (not micro-benchmarks) of problematic code.

Work sponsored by a Mozilla grant continues on PyPy3.5; we continue on the path to the goal of a complete python 3.5 implementation. Of course the bug fixes and performance enhancements
mentioned above are part of both PyPy2.7 and PyPy3.5 beta.

As always, this release fixed many other issues and bugs raised by the
growing community of PyPy users. We strongly recommend updating.

You can download the v5.9 releases here (note that we provide PyPy3.5 binaries for only Linux 64bit for now):

We would like to thank our donors and contributors, and
encourage new people to join the project. PyPy has many
layers and we need help with all of them: PyPy and RPython documentation
improvements, tweaking popular modules to run on PyPy, or general help
with making RPython’s JIT even better.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7 (stdlib version 2.7.13), and CPython 3.5 (stdlib version 3.5.3). It’s fast (PyPy and CPython 2.7.x performance comparison) due to its integrated tracing JIT compiler.

We also welcome developers of other dynamic languages to see what RPython can do for them.

NumPy and Pandas now work on PyPy2.7 (together with Cython 0.27.1). Many other modules
based on C-API extensions work on PyPy as well.

Cython 0.27.1 (released very recently) supports more projects with PyPy, both
on PyPy2.7 and PyPy3.5 beta. Note version 0.27.1 is now the minimum
version that supports this version of PyPy, due to some interactions with
updated C-API interface code.

We optimized the JSON parser for recurring string keys, which should decrease
memory use by up to 50% and increase parsing speed by up to 15% for large JSON files
with many repeating dictionary keys (which is quite common).

CFFI, which is part of the PyPy release, has been updated to 1.11.1,
improving an already great package for interfacing with C. CFFI now supports
complex arguments in API mode, as well as char16_t and char32_t and has
improved support for callbacks.

Issues in the C-API compatibility layer that appeared as excessive memory
use were cleared up and other incompatibilities were resolved. The C-API
compatibility layer does slow down code which crosses the python-c interface
often. Some fixes are in the pipelines for some of the performance issues, and we still recommend
using pure python on PyPy or interfacing via CFFI.

Please let us know if your use case is slow, we have ideas how to make things
faster but need real-world examples (not micro-benchmarks) of problematic code.

Work sponsored by a Mozilla grant continues on PyPy3.5; we continue on the path to the goal of a complete python 3.5 implementation. Of course the bug fixes and performance enhancements
mentioned above are part of both PyPy2.7 and PyPy3.5 beta.

As always, this release fixed many other issues and bugs raised by the
growing community of PyPy users. We strongly recommend updating.

You can download the v5.9 releases here (note that we provide PyPy3.5 binaries for only Linux 64bit for now):

We would like to thank our donors and contributors, and
encourage new people to join the project. PyPy has many
layers and we need help with all of them: PyPy and RPython documentation
improvements, tweaking popular modules to run on PyPy, or general help
with making RPython’s JIT even better.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7 (stdlib version 2.7.13), and CPython 3.5 (stdlib version 3.5.3). It’s fast (PyPy and CPython 2.7.x performance comparison) due to its integrated tracing JIT compiler.

We also welcome developers of other dynamic languages to see what RPython can do for them.

Monday, August 14, 2017

The Python community has been discussing removing the Global Interpreter Lock for
a long time.
There have been various attempts at removing it:
Jython or IronPython successfully removed it with the help of the underlying
platform, and some have yet to bear fruit, like gilectomy. Since our February sprint in Leysin,
we have experimented with the topic of GIL removal in the PyPy project.
We believe that the work done in IronPython or Jython can be reproduced with
only a bit more effort in PyPy. Compared to that, removing the GIL in CPython is a much
harder topic, since it also requires tackling the problem of multi-threaded reference
counting. See the section below for further details.

As we announced at EuroPython, what we have so far is a GIL-less PyPy
which can run very simple multi-threaded, nicely parallelized, programs.
At the moment, more complicated programs probably segfault. The
remaining 90% (and another 90%) of work is with putting locks in strategic
places so PyPy does not segfault during concurrent accesses to
data structures.

Since such work would complicate the PyPy code base and our day-to-day work,
we would like to judge the interest of the community and the commercial
partners to make it happen (we are not looking for individual
donations at this point). We estimate a total cost of $50k,
out of which we already have backing for about 1/3 (with a possible 1/3
extra from the STM money, see below). This would give us a good
shot at delivering a good proof-of-concept working PyPy with no GIL. If we can get a $100k
contract, we will deliver a fully working PyPy interpreter with no GIL as a release,
possibly separate from the default PyPy release.

People asked several questions, so I'll try to answer the technical parts
here.

What would the plan entail?

We've already done the work on the Garbage Collector to allow doing multi-
threaded programs in RPython. "All" that is left is adding locks on mutable
data structures everywhere in the PyPy codebase. Since it would significantly complicate
our workflow, we require real interest in that topic, backed up by
commercial contracts in order to justify the added maintenance burden.

Why did the STM effort not work out?

STM was a research project that proved that the idea is possible. However,
the amount of user effort that is required to make programs run in a
parallelizable way is significant, and we never managed to develop tools
that would help in doing so. At the moment we're not sure if more work
spent on tooling would improve the situation or if the whole idea is really doomed.
The approach also ended up adding significant overhead on single threaded programs,
so in the end it is very easy to make your programs slower. (We have some money
left in the donation pot for STM which we are not using; according to the rules, we
could declare the STM attempt failed and channel that money towards the present
GIL removal proposal.)

Wouldn't subinterpreters be a better idea?

Python is a very mutable language - there are tons of mutable state and
basic objects (classes, functions,...) that are compile-time in other
language but runtime and fully mutable in Python. In the end, sharing
things between subinterpreters would be restricted to basic immutable
data structures, which defeats the point. Subinterpreters suffers from the same problems as
multiprocessing with no additional benefits.
We believe that reducing mutability to implement subinterpreters is not viable without seriously impacting the
semantics of the language (a conclusion which applies to many other
approaches too).

Why is it easier to do in PyPy than CPython?

Removing the GIL in CPython has two problems:

how do we guard access to mutable data structures with locks and

what to do with reference counting that needs to be guarded.

PyPy only has the former problem; the latter doesn't exist,
due to a different garbage collector approach. Of course the first problem
is a mess too, but at least we are already half-way there. Compared to Jython
or IronPython, PyPy lacks some data structures that are provided by JVM or .NET,
which we would need to implement, hence the problem is a little harder
than on an existing multithreaded platform. However, there is good research
and we know how that problem can be solved.

Best regards,
Maciej Fijalkowski

Hello everyone

The Python community has been discussing removing the Global Interpreter Lock for
a long time.
There have been various attempts at removing it:
Jython or IronPython successfully removed it with the help of the underlying
platform, and some have yet to bear fruit, like gilectomy. Since our February sprint in Leysin,
we have experimented with the topic of GIL removal in the PyPy project.
We believe that the work done in IronPython or Jython can be reproduced with
only a bit more effort in PyPy. Compared to that, removing the GIL in CPython is a much
harder topic, since it also requires tackling the problem of multi-threaded reference
counting. See the section below for further details.

As we announced at EuroPython, what we have so far is a GIL-less PyPy
which can run very simple multi-threaded, nicely parallelized, programs.
At the moment, more complicated programs probably segfault. The
remaining 90% (and another 90%) of work is with putting locks in strategic
places so PyPy does not segfault during concurrent accesses to
data structures.

Since such work would complicate the PyPy code base and our day-to-day work,
we would like to judge the interest of the community and the commercial
partners to make it happen (we are not looking for individual
donations at this point). We estimate a total cost of $50k,
out of which we already have backing for about 1/3 (with a possible 1/3
extra from the STM money, see below). This would give us a good
shot at delivering a good proof-of-concept working PyPy with no GIL. If we can get a $100k
contract, we will deliver a fully working PyPy interpreter with no GIL as a release,
possibly separate from the default PyPy release.

People asked several questions, so I'll try to answer the technical parts
here.

What would the plan entail?

We've already done the work on the Garbage Collector to allow doing multi-
threaded programs in RPython. "All" that is left is adding locks on mutable
data structures everywhere in the PyPy codebase. Since it would significantly complicate
our workflow, we require real interest in that topic, backed up by
commercial contracts in order to justify the added maintenance burden.

Why did the STM effort not work out?

STM was a research project that proved that the idea is possible. However,
the amount of user effort that is required to make programs run in a
parallelizable way is significant, and we never managed to develop tools
that would help in doing so. At the moment we're not sure if more work
spent on tooling would improve the situation or if the whole idea is really doomed.
The approach also ended up adding significant overhead on single threaded programs,
so in the end it is very easy to make your programs slower. (We have some money
left in the donation pot for STM which we are not using; according to the rules, we
could declare the STM attempt failed and channel that money towards the present
GIL removal proposal.)

Wouldn't subinterpreters be a better idea?

Python is a very mutable language - there are tons of mutable state and
basic objects (classes, functions,...) that are compile-time in other
language but runtime and fully mutable in Python. In the end, sharing
things between subinterpreters would be restricted to basic immutable
data structures, which defeats the point. Subinterpreters suffers from the same problems as
multiprocessing with no additional benefits.
We believe that reducing mutability to implement subinterpreters is not viable without seriously impacting the
semantics of the language (a conclusion which applies to many other
approaches too).

Why is it easier to do in PyPy than CPython?

Removing the GIL in CPython has two problems:

how do we guard access to mutable data structures with locks and

what to do with reference counting that needs to be guarded.

PyPy only has the former problem; the latter doesn't exist,
due to a different garbage collector approach. Of course the first problem
is a mess too, but at least we are already half-way there. Compared to Jython
or IronPython, PyPy lacks some data structures that are provided by JVM or .NET,
which we would need to implement, hence the problem is a little harder
than on an existing multithreaded platform. However, there is good research
and we know how that problem can be solved.

Wednesday, July 26, 2017

this is a short blog post, just to announce the existence of this Github repository, which contains binary PyPy wheels for some selected packages. The availability of binary wheels means that you can install the packages much more quickly, without having to wait for compilation.

At the moment of writing, these packages are available:

numpy

scipy

pandas

psutil

netifaces

For now, we provide only wheels built on Ubuntu, compiled for PyPy 5.8.
In particular, it is worth noting that they are notmanylinux1 wheels, which means they could not work on other Linux distributions. For more information, see the explanation in the README of the above repo.

Moreover, the existence of the wheels does not guarantee that they work correctly 100% of the time. they still depend on cpyext, our C-API emulation layer, which is still work-in-progress, although it has become better and better during the last months. Again, the wheels are there only to save compilation time.

To install a package from the wheel repository, you can invoke pip like this:

this is a short blog post, just to announce the existence of this Github repository, which contains binary PyPy wheels for some selected packages. The availability of binary wheels means that you can install the packages much more quickly, without having to wait for compilation.

At the moment of writing, these packages are available:

numpy

scipy

pandas

psutil

netifaces

For now, we provide only wheels built on Ubuntu, compiled for PyPy 5.8.
In particular, it is worth noting that they are notmanylinux1 wheels, which means they could not work on other Linux distributions. For more information, see the explanation in the README of the above repo.

Moreover, the existence of the wheels does not guarantee that they work correctly 100% of the time. they still depend on cpyext, our C-API emulation layer, which is still work-in-progress, although it has become better and better during the last months. Again, the wheels are there only to save compilation time.

To install a package from the wheel repository, you can invoke pip like this:

Friday, June 9, 2017

The PyPy team is proud to release both PyPy2.7 v5.8 (an interpreter supporting
Python 2.7 syntax), and a beta-quality PyPy3.5 v5.8 (an interpreter for Python
3.5 syntax). The two releases are both based on much the same codebase, thus
the dual release. Note that PyPy3.5 supports Linux 64bit only for now.

This new PyPy2.7 release includes the upstream stdlib version 2.7.13, and
PyPy3.5 includes the upstream stdlib version 3.5.3.

We fixed critical bugs in the shadowstack rootfinder garbage collector
strategy that crashed multithreaded programs and very rarely showed up
even in single threaded programs.

We added native PyPy support to profile frames in the vmprof statistical
profiler.

The struct module functions pack* and unpack* are now much faster,
especially on raw buffers and bytearrays. Microbenchmarks show a 2x to 10x
speedup. Thanks to Gambit Research for sponsoring this work.

Please let us know if your use case is slow, we have ideas how to make things
faster but need real-world examples (not micro-benchmarks) of problematic code.

Work sponsored by a Mozilla grant continues on PyPy3.5; numerous fixes from
CPython were ported to PyPy and PEP 489 was fully implemented. Of course the
bug fixes and performance enhancements mentioned above are part of both PyPy
2.7 and PyPy 3.5.

CFFI, which is part of the PyPy release, has been updated to an unreleased 1.10.1,
improving an already great package for interfacing with C.

Anyone using NumPy 1.13.0, must upgrade PyPy to this release since we implemented some previously missing C-API functionality. Many other c-extension modules now work with PyPy, let us know if yours does not.

As always, this release fixed many issues and bugs raised by the
growing community of PyPy users. We strongly recommend updating.

We would like to thank our donors and contributors, and
encourage new people to join the project. PyPy has many
layers and we need help with all of them: PyPy and RPython documentation
improvements, tweaking popular modules to run on PyPy, or general help
with making RPython’s JIT even better.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7 and CPython 3.5. It’s fast (PyPy and CPython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
We also welcome developers of other dynamic languages to see what RPython can do for them.
The PyPy 2.7 release supports:

What else is new?

There are many incremental improvements to RPython and PyPy, the complete listing is here.

Please update, and continue to help us make PyPy better.

Cheers, The PyPy team

The PyPy team is proud to release both PyPy2.7 v5.8 (an interpreter supporting
Python 2.7 syntax), and a beta-quality PyPy3.5 v5.8 (an interpreter for Python
3.5 syntax). The two releases are both based on much the same codebase, thus
the dual release. Note that PyPy3.5 supports Linux 64bit only for now.

This new PyPy2.7 release includes the upstream stdlib version 2.7.13, and
PyPy3.5 includes the upstream stdlib version 3.5.3.

We fixed critical bugs in the shadowstack rootfinder garbage collector
strategy that crashed multithreaded programs and very rarely showed up
even in single threaded programs.

We added native PyPy support to profile frames in the vmprof statistical
profiler.

The struct module functions pack* and unpack* are now much faster,
especially on raw buffers and bytearrays. Microbenchmarks show a 2x to 10x
speedup. Thanks to Gambit Research for sponsoring this work.

Please let us know if your use case is slow, we have ideas how to make things
faster but need real-world examples (not micro-benchmarks) of problematic code.

Work sponsored by a Mozilla grant continues on PyPy3.5; numerous fixes from
CPython were ported to PyPy and PEP 489 was fully implemented. Of course the
bug fixes and performance enhancements mentioned above are part of both PyPy
2.7 and PyPy 3.5.

CFFI, which is part of the PyPy release, has been updated to an unreleased 1.10.1,
improving an already great package for interfacing with C.

Anyone using NumPy 1.13.0, must upgrade PyPy to this release since we implemented some previously missing C-API functionality. Many other c-extension modules now work with PyPy, let us know if yours does not.

As always, this release fixed many issues and bugs raised by the
growing community of PyPy users. We strongly recommend updating.

We would like to thank our donors and contributors, and
encourage new people to join the project. PyPy has many
layers and we need help with all of them: PyPy and RPython documentation
improvements, tweaking popular modules to run on PyPy, or general help
with making RPython’s JIT even better.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7 and CPython 3.5. It’s fast (PyPy and CPython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
We also welcome developers of other dynamic languages to see what RPython can do for them.
The PyPy 2.7 release supports:

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7 and CPython 3.5. It’s fast (PyPy and CPython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
We also welcome developers of other dynamic languages to see what RPython can do for them.
The PyPy 2.7 release supports:

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7 and CPython 3.5. It’s fast (PyPy and CPython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
We also welcome developers of other dynamic languages to see what RPython can do for them.
The PyPy 2.7 release supports:

Saturday, April 1, 2017

We are happy to announce a new release for the PyPI package vmprof.
It is now able to capture native stack frames on Linux and Mac OS X to show you bottle necks in compiled code (such as CFFI modules, Cython or C Python extensions). It supports PyPy, CPython versions 2.7, 3.4, 3.5 and 3.6. Special thanks to Jetbrains for funding the native profiling support.

What is vmprof?

If you have already worked with vmprof you can skip the next two section. If not, here is a short introduction:

The goal of vmprof package is to give you more insight into your program. It is a statistical profiler. Another prominent profiler you might already have worked with is cProfile. It is bundled with the Python standard library.

vmprof's distinct feature (from most other profilers) is that it does not significantly slow down your program execution. The employed strategy is statistical, rather than deterministic. Not every function call is intercepted, but it samples stack traces and memory usage at a configured sample rate (usually around 100hz). You can imagine that this creates a lot less contention than doing work before and after each function call.

As mentioned earlier cProfile gives you a complete profile, but it needs to intercept every function call (it is a deterministic profiler). Usually this means that you have to capture and record every function call, but this takes an significant amount time.The overhead vmprof consumes is roughly 3-4% of your total program runtime or even less if you reduce the sampling frequency. Indeed it lets you sample and inspect much larger programs. If you failed to profile a large application with cProfile, please give vmprof a shot.

vmprof.com or PyCharm

There are two major alternatives to the command-line tools shipped with vmprof:

While the command line tool is only good for quick inspections, vmprof.com
and PyCharm compliment each other providing deeper insight into your
program. With PyCharm you can view the per-line profiling results inside
the editor. With the vmprof.com you get a handy visualization of the profiling results as a flame chart and memory usage graph.

Since the PyPy Team runs and maintains the service on vmprof.com (which is by the way free and open-source), I’ll explain some more details here. On vmprof.com you can inspect the generated profile interactively instead of looking at console output. What is sent to vmprof.com? You can find details here.

Flamegraph: Accumulates and displays the most frequent codepaths. It allows you to quickly and accurately identify hot spots in your code. The flame graph below is a very short run of richards.py (Thus it shows a lot of time spent in PyPy's JIT compiler).

List all functions (optionally sorted): the equivalent of the vmprof command line output in the web.

Memory curve: A line plot that shows how how many MBytes have been consumed over the lifetime of your program (see more info in the section below).

Native programsThe new feature introduced in vmprof 0.4.x allows you to look beyond the Python level. As you might know, Python maintains a stack of frames to save the execution. Up to now the vmprof profiles only contained that level of information. But what if you program jumps to native code (such as calling gzip compression on a large file)? Up to now you would not see that information.Many packages make use of the CPython C API (which we discurage, please lookup cffi for a better way to call C). Have you ever had the issue that you know that your performance problems reach down to, but you could not profile it properly? Now you can!Let's inspect a very simple Python program to find out why a program is significantly slower on Linux than on Mac:

Take two NxN random matrix objects and create a dot product. The first argument to the dot product provides the absolute value of the random matrix.

Run

Python

NumPy

OS

n=...

Took

[1]

CPython 3.5.2

NumPy 1.12.1

Mac OS X, 10.12.3

n=5000

~9 sec

[2]

CPython 3.6.0

NumPy 1.12.1

Linux 64, Kernel 4.9.14

n=1000

~26 sec

Note that the Linux machine operates on a 5 times smaller matrix, still it takes much longer. What is wrong? Is Linux slow? CPython 3.6.0? Well no, lets inspect and [1] and [2] (shown below in that order).

[2] runs on Linux, spends nearly all of the time in PyArray_MatrixProduct2, if you compare to [1] on Mac OS X, you'll see that a lot of time is spent in generating the random numbers and the rest in cblas_matrixproduct.

Blas has a very efficient implementation so you can achieve the same on Linux if you install a blas implementation (such as openblas).

Usually you can spot potential program source locations that take a lot of time and might be the first starting point to resolve performance issues.

Beyond Python programs

It is not unthinkable that the strategy can be reused for native programs. Indeed this can already be done by creating a small cffi wrapper around an entry point of a compiled C program. It would even work for programs compiled from other languages (e.g. C++ or Fortran). The resulting function names are the full symbol name embedded into either the executable symboltable or extracted from the dwarf debugging information. Most of those will be compiler specific and contain some cryptic information.

Memory profiling
We thankfully received a code contribution from the company Blue Yonder. They have built a memory profiler (for Linux and Mac OS X) on top of vmprof.com that displays the memory consumption for the runtime of your process.

You can run it the following way:

$ python -m vmprof --mem --web script.py

By adding --mem, vmprof will capture memory information and display it in the dedicated view on vmprof.com. You can tha view by by clicking the 'Memory' switch in the flamegraph view.

vmprof has not reached the end of development. There are many features we could implement. But there is one feature that could be a great asset to many Python developers.

Continuous delivery of your statistical profile, or in short, profile streaming. One of the great strengths of vmprof is that is consumes very little overhead. It is not a crazy idea to run this in production.

It would require a smart way to stream the profile in the background to vmprof.com and new visualizations to look at much more data your Python service produces.

If that sounds like a solid vmprof improvement, don't hesitate to get in touch with us (e.g. IRC #pypy, mailing list pypy-dev, or comment below)

You can help!

There are some immediate things other people could help with. Either by donating time or money (yes we have occasional contributors which is great)!

We gladly received code contribution for the memory profiler. But it was not enough time to finish the migration completely. Sadly it is a bit brittle right now.

We would like to spend more time on other visualizations. This should include to give a much better user experience on vmprof.com (like a tutorial that explains the visualization that we already have).

We are happy to announce a new release for the PyPI package vmprof.
It is now able to capture native stack frames on Linux and Mac OS X to show you bottle necks in compiled code (such as CFFI modules, Cython or C Python extensions). It supports PyPy, CPython versions 2.7, 3.4, 3.5 and 3.6. Special thanks to Jetbrains for funding the native profiling support.

What is vmprof?

If you have already worked with vmprof you can skip the next two section. If not, here is a short introduction:

The goal of vmprof package is to give you more insight into your program. It is a statistical profiler. Another prominent profiler you might already have worked with is cProfile. It is bundled with the Python standard library.

vmprof's distinct feature (from most other profilers) is that it does not significantly slow down your program execution. The employed strategy is statistical, rather than deterministic. Not every function call is intercepted, but it samples stack traces and memory usage at a configured sample rate (usually around 100hz). You can imagine that this creates a lot less contention than doing work before and after each function call.

As mentioned earlier cProfile gives you a complete profile, but it needs to intercept every function call (it is a deterministic profiler). Usually this means that you have to capture and record every function call, but this takes an significant amount time.The overhead vmprof consumes is roughly 3-4% of your total program runtime or even less if you reduce the sampling frequency. Indeed it lets you sample and inspect much larger programs. If you failed to profile a large application with cProfile, please give vmprof a shot.

vmprof.com or PyCharm

There are two major alternatives to the command-line tools shipped with vmprof:

While the command line tool is only good for quick inspections, vmprof.com
and PyCharm compliment each other providing deeper insight into your
program. With PyCharm you can view the per-line profiling results inside
the editor. With the vmprof.com you get a handy visualization of the profiling results as a flame chart and memory usage graph.

Since the PyPy Team runs and maintains the service on vmprof.com (which is by the way free and open-source), I’ll explain some more details here. On vmprof.com you can inspect the generated profile interactively instead of looking at console output. What is sent to vmprof.com? You can find details here.

Flamegraph: Accumulates and displays the most frequent codepaths. It allows you to quickly and accurately identify hot spots in your code. The flame graph below is a very short run of richards.py (Thus it shows a lot of time spent in PyPy's JIT compiler).

List all functions (optionally sorted): the equivalent of the vmprof command line output in the web.

Memory curve: A line plot that shows how how many MBytes have been consumed over the lifetime of your program (see more info in the section below).

Native programsThe new feature introduced in vmprof 0.4.x allows you to look beyond the Python level. As you might know, Python maintains a stack of frames to save the execution. Up to now the vmprof profiles only contained that level of information. But what if you program jumps to native code (such as calling gzip compression on a large file)? Up to now you would not see that information.Many packages make use of the CPython C API (which we discurage, please lookup cffi for a better way to call C). Have you ever had the issue that you know that your performance problems reach down to, but you could not profile it properly? Now you can!Let's inspect a very simple Python program to find out why a program is significantly slower on Linux than on Mac:

Take two NxN random matrix objects and create a dot product. The first argument to the dot product provides the absolute value of the random matrix.

Run

Python

NumPy

OS

n=...

Took

[1]

CPython 3.5.2

NumPy 1.12.1

Mac OS X, 10.12.3

n=5000

~9 sec

[2]

CPython 3.6.0

NumPy 1.12.1

Linux 64, Kernel 4.9.14

n=1000

~26 sec

Note that the Linux machine operates on a 5 times smaller matrix, still it takes much longer. What is wrong? Is Linux slow? CPython 3.6.0? Well no, lets inspect and [1] and [2] (shown below in that order).

[2] runs on Linux, spends nearly all of the time in PyArray_MatrixProduct2, if you compare to [1] on Mac OS X, you'll see that a lot of time is spent in generating the random numbers and the rest in cblas_matrixproduct.

Blas has a very efficient implementation so you can achieve the same on Linux if you install a blas implementation (such as openblas).

Usually you can spot potential program source locations that take a lot of time and might be the first starting point to resolve performance issues.

Beyond Python programs

It is not unthinkable that the strategy can be reused for native programs. Indeed this can already be done by creating a small cffi wrapper around an entry point of a compiled C program. It would even work for programs compiled from other languages (e.g. C++ or Fortran). The resulting function names are the full symbol name embedded into either the executable symboltable or extracted from the dwarf debugging information. Most of those will be compiler specific and contain some cryptic information.

Memory profiling
We thankfully received a code contribution from the company Blue Yonder. They have built a memory profiler (for Linux and Mac OS X) on top of vmprof.com that displays the memory consumption for the runtime of your process.

You can run it the following way:

$ python -m vmprof --mem --web script.py

By adding --mem, vmprof will capture memory information and display it in the dedicated view on vmprof.com. You can tha view by by clicking the 'Memory' switch in the flamegraph view.

vmprof has not reached the end of development. There are many features we could implement. But there is one feature that could be a great asset to many Python developers.

Continuous delivery of your statistical profile, or in short, profile streaming. One of the great strengths of vmprof is that is consumes very little overhead. It is not a crazy idea to run this in production.

It would require a smart way to stream the profile in the background to vmprof.com and new visualizations to look at much more data your Python service produces.

If that sounds like a solid vmprof improvement, don't hesitate to get in touch with us (e.g. IRC #pypy, mailing list pypy-dev, or comment below)

You can help!

There are some immediate things other people could help with. Either by donating time or money (yes we have occasional contributors which is great)!

We gladly received code contribution for the memory profiler. But it was not enough time to finish the migration completely. Sadly it is a bit brittle right now.

We would like to spend more time on other visualizations. This should include to give a much better user experience on vmprof.com (like a tutorial that explains the visualization that we already have).

Tuesday, March 21, 2017

The PyPy team is proud to release both PyPy2.7 v5.7 (an interpreter supporting
Python v2.7 syntax), and a beta-quality PyPy3.5 v5.7 (an interpreter for Python
v3.5 syntax). The two releases are both based on much the same codebase, thus
the dual release. Note that PyPy3.5 only supports Linux 64bit for now.

This new PyPy2.7 release includes the upstream stdlib version 2.7.13, and PyPy3.5 (our first in the 3.5 series) includes the upstream stdlib version 3.5.3.

We continue to make incremental improvements to our C-API compatibility layer (cpyext). PyPy2 can now import and run many C-extension packages, among the most notable are Numpy, Cython, and Pandas. Performance may be slower than CPython, especially for frequently-called short C functions. Please let us know if your use case is slow, we have ideas how to make things faster but need real-world examples (not micro-benchmarks) of problematic code.

Work proceeds at a good pace on the PyPy3.5 version due to a grant from the Mozilla Foundation, hence our first 3.5.3 beta release. Thanks Mozilla !!! While we do not pass all tests yet, asyncio works and as these benchmarks show it already gives a nice speed bump. We also backported the f"" formatting from 3.6 (as an exception; otherwise “PyPy3.5” supports the Python 3.5 language).

CFFI has been updated to 1.10, improving an already great package for interfacing with C.

We now use shadowstack as our default gcrootfinder even on Linux. The alternative, asmgcc, will be deprecated at some future point. While about 3% slower, shadowstack is much more easily maintained and debuggable. Also, the performance of shadowstack has been improved in general: this should close the speed gap between other platforms and Linux.

As always, this release fixed many issues and bugs raised by the growing community of PyPy users. We strongly recommend updating.

We would like to thank our donors for the continued support of the PyPy project.
We would also like to thank our contributors and encourage new people to join the project. PyPy has many layers and we need help with all of them: PyPy and RPython documentation improvements, tweaking popular modules to run on pypy, or general help with making RPython’s JIT even better.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7 and CPython 3.5. It’s fast (PyPy and CPython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
We also welcome developers of other dynamic languages to see what RPython can do for them.
The PyPy 2.7 release supports:

What else is new?

There are many incremental improvements to RPython and PyPy, the complete listing is here.

Please update, and continue to help us make PyPy better.

Cheers, The PyPy team

The PyPy team is proud to release both PyPy2.7 v5.7 (an interpreter supporting
Python v2.7 syntax), and a beta-quality PyPy3.5 v5.7 (an interpreter for Python
v3.5 syntax). The two releases are both based on much the same codebase, thus
the dual release. Note that PyPy3.5 only supports Linux 64bit for now.

This new PyPy2.7 release includes the upstream stdlib version 2.7.13, and PyPy3.5 (our first in the 3.5 series) includes the upstream stdlib version 3.5.3.

We continue to make incremental improvements to our C-API compatibility layer (cpyext). PyPy2 can now import and run many C-extension packages, among the most notable are Numpy, Cython, and Pandas. Performance may be slower than CPython, especially for frequently-called short C functions. Please let us know if your use case is slow, we have ideas how to make things faster but need real-world examples (not micro-benchmarks) of problematic code.

Work proceeds at a good pace on the PyPy3.5 version due to a grant from the Mozilla Foundation, hence our first 3.5.3 beta release. Thanks Mozilla !!! While we do not pass all tests yet, asyncio works and as these benchmarks show it already gives a nice speed bump. We also backported the f"" formatting from 3.6 (as an exception; otherwise “PyPy3.5” supports the Python 3.5 language).

CFFI has been updated to 1.10, improving an already great package for interfacing with C.

We now use shadowstack as our default gcrootfinder even on Linux. The alternative, asmgcc, will be deprecated at some future point. While about 3% slower, shadowstack is much more easily maintained and debuggable. Also, the performance of shadowstack has been improved in general: this should close the speed gap between other platforms and Linux.

As always, this release fixed many issues and bugs raised by the growing community of PyPy users. We strongly recommend updating.

We would like to thank our donors for the continued support of the PyPy project.
We would also like to thank our contributors and encourage new people to join the project. PyPy has many layers and we need help with all of them: PyPy and RPython documentation improvements, tweaking popular modules to run on pypy, or general help with making RPython’s JIT even better.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7 and CPython 3.5. It’s fast (PyPy and CPython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
We also welcome developers of other dynamic languages to see what RPython can do for them.
The PyPy 2.7 release supports:

Saturday, March 4, 2017

Today
is the last day of our yearly sprint event in Leysin. We had lots of
ideas on how to enhance the current state of PyPy, we went skiing and
had interesting discussions around virtual machines, the Python
ecosystem, and other real world problems.

Why don't you join us next time?

A usual PyPy sprints day goes through the following stages:

Planning Session: Tasks from previous days that have seen progress or
are completed are noted in a shared document. Everyone adds new tasks
and then assigns themselves to one or more tasks (usually in pairs). As
soon as everybody is happy with their task and has a partner to work
with, the planning session is concluded and the work can start.

Discussions: A sprint is a good occasion to discuss difficult
and important topics in person. We usually sit down in a separate area
in the sprint room and discuss until a) nobody wants to discuss anymore
or b) we found a solution to the problem. The good thing is that usally
the outcome is b).

Lunch: For lunch we prepare sandwiches and other finger food.

Continue working until dinner, which we eat at a random restaurant in Leysin.

Goto 1 the next day, if sprint has not ended.

Sprints
are open to everybody and help newcomers to get started with PyPy (we usually
pair you with a developer familiar with PyPy). They are perfect to
discuss and find solutions to problems we currently face. If you are
eager to join next year, please don't hesitate to register next year
around January.

Sprint Summary

Sprint goals included to work on the following topics:

Work towards releasing PyPy 3.5 (it will be released soon)

CPython Extension (CPyExt) modules on PyPy

Have fun in winter sports (a side goal)

Highlights

We have spent lots of time debugging and fixing memory issues on CPyExt.
In particular, we fixed a serious memory leak where taking a memoryview
would prevent numpy arrays from ever being freed. More work is still required to ensure that our GC always releases arrays in a timely
manner.

Fruitful discussions and progress about how to flesh out some details about the unicode representation in PyPy. Our current goal is to use utf-8 as the unicode representation internally and have fast vectorized operations (indexing, check if valid, ...).

PyPy will participate in GSoC 2017 and we will try to allocate more resources to that than last year.

Profile and think about some details how to reduce the starting size of the interpreter. The starting point would be to look at the parser and reduce the amount of strings to keep alive.

Found a topic for a student's master thesis: correctly freeing cpyext reference cycles.

Run lots of Python3 code on top of PyPy3 and resolve issues we found along the way.

Initial work on making RPython thread-safe without a GIL.

List of attendees

- Stefan Beyer

- Antonio Cuni

- Maciej Fijalkowski

- Manuel Jacob

- Ronan Lamy

- Remi Meier

- Richard Plangger

- Armin Rigo

- Robert Zaremba

We
would like to thank our donors for the continued support of the PyPy
project and we looking forward to next years sprint in Leysin.

The PyPy Team

Today
is the last day of our yearly sprint event in Leysin. We had lots of
ideas on how to enhance the current state of PyPy, we went skiing and
had interesting discussions around virtual machines, the Python
ecosystem, and other real world problems.

Why don't you join us next time?

A usual PyPy sprints day goes through the following stages:

Planning Session: Tasks from previous days that have seen progress or
are completed are noted in a shared document. Everyone adds new tasks
and then assigns themselves to one or more tasks (usually in pairs). As
soon as everybody is happy with their task and has a partner to work
with, the planning session is concluded and the work can start.

Discussions: A sprint is a good occasion to discuss difficult
and important topics in person. We usually sit down in a separate area
in the sprint room and discuss until a) nobody wants to discuss anymore
or b) we found a solution to the problem. The good thing is that usally
the outcome is b).

Lunch: For lunch we prepare sandwiches and other finger food.

Continue working until dinner, which we eat at a random restaurant in Leysin.

Goto 1 the next day, if sprint has not ended.

Sprints
are open to everybody and help newcomers to get started with PyPy (we usually
pair you with a developer familiar with PyPy). They are perfect to
discuss and find solutions to problems we currently face. If you are
eager to join next year, please don't hesitate to register next year
around January.

Sprint Summary

Sprint goals included to work on the following topics:

Work towards releasing PyPy 3.5 (it will be released soon)

CPython Extension (CPyExt) modules on PyPy

Have fun in winter sports (a side goal)

Highlights

We have spent lots of time debugging and fixing memory issues on CPyExt.
In particular, we fixed a serious memory leak where taking a memoryview
would prevent numpy arrays from ever being freed. More work is still required to ensure that our GC always releases arrays in a timely
manner.

Fruitful discussions and progress about how to flesh out some details about the unicode representation in PyPy. Our current goal is to use utf-8 as the unicode representation internally and have fast vectorized operations (indexing, check if valid, ...).

PyPy will participate in GSoC 2017 and we will try to allocate more resources to that than last year.

Profile and think about some details how to reduce the starting size of the interpreter. The starting point would be to look at the parser and reduce the amount of strings to keep alive.

Found a topic for a student's master thesis: correctly freeing cpyext reference cycles.

Run lots of Python3 code on top of PyPy3 and resolve issues we found along the way.

Initial work on making RPython thread-safe without a GIL.

List of attendees

- Stefan Beyer

- Antonio Cuni

- Maciej Fijalkowski

- Manuel Jacob

- Ronan Lamy

- Remi Meier

- Richard Plangger

- Armin Rigo

- Robert Zaremba

We
would like to thank our donors for the continued support of the PyPy
project and we looking forward to next years sprint in Leysin.

Wednesday, March 1, 2017

We are almost ready to release an alpha version of PyPy 3.5. Our goal is to release it shortly after the sprint. Many modules have already been ported and it can probably run many Python 3 programs already. We are happy to receive any feedback after the next release.

To show that the heart (asyncio) of Python 3 is already working we have prepared some benchmarks. They are done by Paweł Piotr Przeradowski @squeaky_pl for a HTTP workload on serveralasynchronous IO libraries, namely the relatively new asyncio andcurio libraries and the battle-tested tornado, gevent and Twisted libraries. To see the benchmarks check out https://github.com/squeaky-pl/zenchmarks and the instructions for reproducing can be found inside README.md in the repository. Raw results can be obtained from https://github.com/squeaky-pl/zenchmarks/blob/master/results.csv.

The
purpose of the presented benchmarks is showing that the upcoming PyPy release
is already working with unmodified code that runs on CPython 3.5. PyPy
also manages to make them run significantly faster.

The
benchmarks consist of HTTP servers implemented on the top of the mentioned
libraries. All the servers are single-threaded relying on underlying
event loops to provide concurrency. Access logging was disabled to
exclude terminal I/O from the results. The view code consists of a
lookup in a dictionary mapping ASCII letters to verses from the famous
Zen of Python. If a verse is found the view returns it, otherwise a 404
Not Found response is served. The 400 Bad Request and 500 Internal
Server Error cases are also handled.

The workload was generated with the wrk HTTP benchmarking tool. It is run with one thread opening up to 100
concurrent connections for 2 seconds and repeated 1010 times to get
consecutive measures. There is a Lua script provided
that instructs wrk to continuously send 24 different requests that hit
different execution paths (200, 404, 400) in the view code. Also it is
worth noting that wrk will only count 200 responses as successful so the actual request per second throughput is higher.

For your convenience all the used libraries versions are vendoredinto the benchmark repository. There is also a precompiled portable version of wrk provided
that should run on any reasonably recent (10 year old or newer) Linux
x86_64 distribution. The benchmark was performed on a public cloud scaleway x86_64 server launched in a Paris data center. The server was running
Ubuntu 16.04.01 LTS and reported Intel(R) Xeon(R) CPU D-1531 @ 2.20GHz
CPU. CPython 3.5.2 (shipped by default in Ubuntu) was benchmarked
against a pypy-c-jit-90326-88ef793308eb-linux64 snapshot of the 3.5 compatibility branch of PyPy.

We are almost ready to release an alpha version of PyPy 3.5. Our goal is to release it shortly after the sprint. Many modules have already been ported and it can probably run many Python 3 programs already. We are happy to receive any feedback after the next release.

To show that the heart (asyncio) of Python 3 is already working we have prepared some benchmarks. They are done by Paweł Piotr Przeradowski @squeaky_pl for a HTTP workload on serveralasynchronous IO libraries, namely the relatively new asyncio andcurio libraries and the battle-tested tornado, gevent and Twisted libraries. To see the benchmarks check out https://github.com/squeaky-pl/zenchmarks and the instructions for reproducing can be found inside README.md in the repository. Raw results can be obtained from https://github.com/squeaky-pl/zenchmarks/blob/master/results.csv.

The
purpose of the presented benchmarks is showing that the upcoming PyPy release
is already working with unmodified code that runs on CPython 3.5. PyPy
also manages to make them run significantly faster.

The
benchmarks consist of HTTP servers implemented on the top of the mentioned
libraries. All the servers are single-threaded relying on underlying
event loops to provide concurrency. Access logging was disabled to
exclude terminal I/O from the results. The view code consists of a
lookup in a dictionary mapping ASCII letters to verses from the famous
Zen of Python. If a verse is found the view returns it, otherwise a 404
Not Found response is served. The 400 Bad Request and 500 Internal
Server Error cases are also handled.

The workload was generated with the wrk HTTP benchmarking tool. It is run with one thread opening up to 100
concurrent connections for 2 seconds and repeated 1010 times to get
consecutive measures. There is a Lua script provided
that instructs wrk to continuously send 24 different requests that hit
different execution paths (200, 404, 400) in the view code. Also it is
worth noting that wrk will only count 200 responses as successful so the actual request per second throughput is higher.

For your convenience all the used libraries versions are vendoredinto the benchmark repository. There is also a precompiled portable version of wrk provided
that should run on any reasonably recent (10 year old or newer) Linux
x86_64 distribution. The benchmark was performed on a public cloud scaleway x86_64 server launched in a Paris data center. The server was running
Ubuntu 16.04.01 LTS and reported Intel(R) Xeon(R) CPU D-1531 @ 2.20GHz
CPU. CPython 3.5.2 (shipped by default in Ubuntu) was benchmarked
against a pypy-c-jit-90326-88ef793308eb-linux64 snapshot of the 3.5 compatibility branch of PyPy.