Thursday, October 29, 2015

PyPy 4.0.0

We’re pleased and proud to unleash PyPy 4.0.0, a major update of the PyPy python 2.7.10 compatible interpreter with a Just In Time compiler. We have improved warmup time and memory overhead used for tracing, added vectorization for numpy and general loops where possible on x86 hardware (disabled by default), refactored rough edges in rpython, and increased functionality of numpy.
You can download the PyPy 4.0.0 release here:

We would like to thank our donors for the continued support of the PyPy project.
We would also like to thank our contributors (7 new ones since PyPy 2.6.0) and encourage new people to join the project. PyPy has many layers and we need help with all of them: PyPy and RPython documentation improvements, tweaking popular modules to run on PyPy, or general help with making RPython’s JIT even better.

New Version Numbering

Since the past release, PyPy 2.6.1, we decided to update the PyPy 2.x.x versioning directly to PyPy 4.x.x, to avoid confusion with CPython 2.7 and 3.5. Note that this version of PyPy uses the stdlib and implements the syntax of CPython 2.7.10.

Vectorization

Richard Plangger began work in March and continued over a Google Summer of Code to add avectorization step to the trace optimizer. The step recognizes common constructs and emits SIMD code where possible, much as any modern compiler does. This vectorization happens while tracing running code, so it is actually easier at run-time to determine the availability of possible vectorization than it is for ahead-of-time compilers.
Availability of SIMD hardware is detected at run time, without needing to precompile various code paths into the executable.
The first version of the vectorization has been merged in this release, since it is so new it is off by default. To enable the vectorization in built-in JIT drivers (like numpy ufuncs), add –jit vec=1, to enable all implemented vectorization add –jit vec_all=1
Benchmarks and a summary of this work appear here

Maciej Fijalkowski and Armin Rigo refactored internals of Rpython that now allow PyPy to more efficiently use guards in jitted code. They also rewrote unrolling, leading to a warmup time improvement of 20% or so. The reduction in guards also means a reduction in the use of memory, also a savings of around 20%.

Numpy

Our implementation of numpy continues to improve. ndarray and the numeric dtypes are very close to feature-complete; record, string and unicode dtypes are mostly supported. We have reimplemented numpy linalg, random and fft as cffi-1.0 modules that call out to the same underlying libraries that upstream numpy uses. Please try it out, especially using the new vectorization (via –jit vec=1 on the command line) and let us know what is missing for your code.

CFFI

While not applicable only to PyPy, cffi is arguably our most significant contribution to the python ecosystem. Armin Rigo continued improving it, and PyPy reaps the benefits of cffi-1.3: improved manangement of object lifetimes, __stdcall on Win32, ffi.memmove(), and percolate const, restrict keywords from cdef to C code.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It’s fast (pypy and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
We also welcome developers of other dynamic languages to see what RPython can do for them.
This release supports x86 machines on most common operating systems (Linux 32/64, Mac OS X 64, Windows 32, OpenBSD, freebsd), as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux.
We also introduce support for the 64 bit PowerPC hardware, specifically Linux running the big- and little-endian variants of ppc64.

Please try it out and let us know what you think. We welcome feedback, we know you are using PyPy, please tell us about it!
Cheers
The PyPy Team

PyPy 4.0.0

We’re pleased and proud to unleash PyPy 4.0.0, a major update of the PyPy python 2.7.10 compatible interpreter with a Just In Time compiler. We have improved warmup time and memory overhead used for tracing, added vectorization for numpy and general loops where possible on x86 hardware (disabled by default), refactored rough edges in rpython, and increased functionality of numpy.
You can download the PyPy 4.0.0 release here:

We would like to thank our donors for the continued support of the PyPy project.
We would also like to thank our contributors (7 new ones since PyPy 2.6.0) and encourage new people to join the project. PyPy has many layers and we need help with all of them: PyPy and RPython documentation improvements, tweaking popular modules to run on PyPy, or general help with making RPython’s JIT even better.

New Version Numbering

Since the past release, PyPy 2.6.1, we decided to update the PyPy 2.x.x versioning directly to PyPy 4.x.x, to avoid confusion with CPython 2.7 and 3.5. Note that this version of PyPy uses the stdlib and implements the syntax of CPython 2.7.10.

Vectorization

Richard Plangger began work in March and continued over a Google Summer of Code to add avectorization step to the trace optimizer. The step recognizes common constructs and emits SIMD code where possible, much as any modern compiler does. This vectorization happens while tracing running code, so it is actually easier at run-time to determine the availability of possible vectorization than it is for ahead-of-time compilers.
Availability of SIMD hardware is detected at run time, without needing to precompile various code paths into the executable.
The first version of the vectorization has been merged in this release, since it is so new it is off by default. To enable the vectorization in built-in JIT drivers (like numpy ufuncs), add –jit vec=1, to enable all implemented vectorization add –jit vec_all=1
Benchmarks and a summary of this work appear here

Maciej Fijalkowski and Armin Rigo refactored internals of Rpython that now allow PyPy to more efficiently use guards in jitted code. They also rewrote unrolling, leading to a warmup time improvement of 20% or so. The reduction in guards also means a reduction in the use of memory, also a savings of around 20%.

Numpy

Our implementation of numpy continues to improve. ndarray and the numeric dtypes are very close to feature-complete; record, string and unicode dtypes are mostly supported. We have reimplemented numpy linalg, random and fft as cffi-1.0 modules that call out to the same underlying libraries that upstream numpy uses. Please try it out, especially using the new vectorization (via –jit vec=1 on the command line) and let us know what is missing for your code.

CFFI

While not applicable only to PyPy, cffi is arguably our most significant contribution to the python ecosystem. Armin Rigo continued improving it, and PyPy reaps the benefits of cffi-1.3: improved manangement of object lifetimes, __stdcall on Win32, ffi.memmove(), and percolate const, restrict keywords from cdef to C code.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It’s fast (pypy and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
We also welcome developers of other dynamic languages to see what RPython can do for them.
This release supports x86 machines on most common operating systems (Linux 32/64, Mac OS X 64, Windows 32, OpenBSD, freebsd), as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux.
We also introduce support for the 64 bit PowerPC hardware, specifically Linux running the big- and little-endian variants of ppc64.

Tuesday, October 20, 2015

it took some time to catch up with the JIT refacrtorings merged in this summer. But, (drums) we are happy to announce that:

The next release of PyPy, "PyPy 4.0.0", will ship the new auto vectorizer

The goal of this project was to increase the speed of numerical applications in both the NumPyPy library and for arbitrary Python programs. In PyPy we have focused a lot on improvements in the 'typical python workload', which usually involves object and string manipulations, mostly for web development. We're hoping with this work that we'll continue improving the other very important Python use case - numerics.

What it can do!

It targets numerics only. It
will not execute object manipulations faster, but it is capable of
enhancing common vector and matrix operations.
Good news is that it is not specifically targeted for the NumPy library and the PyPy
virtual machine. Any interpreter (written in RPython) is able make use
of the vectorization. For more information about that take a look here, or consult the documentation. For the time being it is not turn on by default, so be sure to enable it by specifying --jit vec=1before running your program.

If your language (written in RPython) contains many array/matrix operations, you can easily integrate the optimization by adding the parameter 'vec=1' to the JitDriver.

NumPyPy Improvements

Let's take a look at the core functions of the NumPyPy library (*). The following tests tests show the speedup of the core functions commonly used in Python code interfacing with NumPy, on CPython with NumPy, on the PyPy 2.6.1 relased several weeks ago, and on PyPy 15.11 to be released soon. Timeit was used to test the time needed to run the operation in the plot title on various vector (lower case) and square matrix (upper case) sizes displayed on the X axis. The Y axis shows the speedup compared to CPython 2.7.10. This means that higher is better.

In comparison to PyPy 2.6.1, the speedup greatly improved. The hardware support really strips down the runtime of the vector and matrix operations. There is another operation we would like to highlight: the dot product.It is a very common operation in numerics and PyPy now (given a moderate sized matrix and vector) decreases the time spent in that operation. See for yourself:

These are nice improvements in the NumPyPy library and we got to a competitive level only making use of SSE4.1.

Future work

This is not the end of the road. The GSoC project showed that it is possible to implement this optimization in PyPy. There might be other improvements we can make to carry this further:

Check alignment at runtime to increase the memory throughput of the CPU

Support the AVX vector extension which (at least) doubles the size of the vector register

Handle each and every corner case in Python traces to enable it globally

Do not rely only on loading operations to trigger the analysis, there might be cases where combination of floating point values could be done in parallel

Cheers,
The PyPy Team

(*) The benchmark code can be found here it was run using this configuration: i7-2600 CPU @ 3.40GHz (4 cores).

Hi everyone,

it took some time to catch up with the JIT refacrtorings merged in this summer. But, (drums) we are happy to announce that:

The next release of PyPy, "PyPy 4.0.0", will ship the new auto vectorizer

The goal of this project was to increase the speed of numerical applications in both the NumPyPy library and for arbitrary Python programs. In PyPy we have focused a lot on improvements in the 'typical python workload', which usually involves object and string manipulations, mostly for web development. We're hoping with this work that we'll continue improving the other very important Python use case - numerics.

What it can do!

It targets numerics only. It
will not execute object manipulations faster, but it is capable of
enhancing common vector and matrix operations.
Good news is that it is not specifically targeted for the NumPy library and the PyPy
virtual machine. Any interpreter (written in RPython) is able make use
of the vectorization. For more information about that take a look here, or consult the documentation. For the time being it is not turn on by default, so be sure to enable it by specifying --jit vec=1before running your program.

If your language (written in RPython) contains many array/matrix operations, you can easily integrate the optimization by adding the parameter 'vec=1' to the JitDriver.

NumPyPy Improvements

Let's take a look at the core functions of the NumPyPy library (*). The following tests tests show the speedup of the core functions commonly used in Python code interfacing with NumPy, on CPython with NumPy, on the PyPy 2.6.1 relased several weeks ago, and on PyPy 15.11 to be released soon. Timeit was used to test the time needed to run the operation in the plot title on various vector (lower case) and square matrix (upper case) sizes displayed on the X axis. The Y axis shows the speedup compared to CPython 2.7.10. This means that higher is better.

In comparison to PyPy 2.6.1, the speedup greatly improved. The hardware support really strips down the runtime of the vector and matrix operations. There is another operation we would like to highlight: the dot product.It is a very common operation in numerics and PyPy now (given a moderate sized matrix and vector) decreases the time spent in that operation. See for yourself:

These are nice improvements in the NumPyPy library and we got to a competitive level only making use of SSE4.1.

Future work

This is not the end of the road. The GSoC project showed that it is possible to implement this optimization in PyPy. There might be other improvements we can make to carry this further:

Check alignment at runtime to increase the memory throughput of the CPU

Support the AVX vector extension which (at least) doubles the size of the vector register

Handle each and every corner case in Python traces to enable it globally

Do not rely only on loading operations to trigger the analysis, there might be cases where combination of floating point values could be done in parallel

Cheers,
The PyPy Team

(*) The benchmark code can be found here it was run using this configuration: i7-2600 CPU @ 3.40GHz (4 cores).

Friday, October 16, 2015

PyPy's JIT now supports the 64-bit PowerPC architecture! This is the
third architecture supported, in addition to x86 (32 and 64) and ARM
(32-bit only). More precisely, we support Linux running the big- and the
little-endian variants of ppc64. Thanks to IBM for funding this work!

The new JIT backend has been merged into "default". You should be able
to translate PPC versions
as usual
directly on the machines. For
the foreseeable future, I will compile and distribute binary versions
corresponding to the official releases (for Fedora), but of course I'd
welcome it if someone else could step in and do it. Also, it is unclear
yet if we will run a buildbot.

To check that the result performs well, I logged in a ppc64le machine
and ran the usual benchmark suite of PyPy (minus sqlitesynth: sqlite
was not installed on that machine). I ran it twice at a difference of
12 hours, as an attempt to reduce risks caused by other users suddenly
using the machine. The machine was overall relatively quiet. Of
course, this is scientifically not good enough; it is what I could come
up with given the limited resources.

Here are the results, where the numbers are speed-up factors between the
non-jit and the jit version of PyPy. The first column is x86-64, for
reference. The second and third columns are the two ppc64le runs. All
are Linux. A few benchmarks are not reported here because the runner
doesn't execute them on non-jit (however, apart from sqlitesynth, they
all worked).

The last line reports the geometric mean of each column. We see that
the goal was reached: PyPy's JIT actually improves performance by a
factor of around 9.7 to 10 times on ppc64le. By comparison, it "only"
improves performance by a factor 9.3 on Intel x86-64. I don't know why,
but I'd guess it mostly means that a non-jitted PyPy performs slightly
better on Intel than it does on PowerPC.

Why is that? Actually, if we do the same comparison with an ARM
column too, we also get higher numbers there than on Intel.
When we discovered that a few years ago, we guessed that
on ARM running the whole interpreter in
PyPy takes up a lot of resources, e.g. of instruction cache, which the
JIT's assembler doesn't need any more after the process is warmed up.
And caches are much bigger on Intel. However, PowerPC is much closer
to Intel, so this argument doesn't work for PowerPC.
But there are other more subtle
variants of it. Notably, Intel is doing crazy things about branch
prediction, which likely helps a big interpreter---both the non-JITted
PyPy and CPython, and both for the interpreter's main loop itself and
for the numerous indirect branches that depend on the types of the
objects. Maybe the PowerPC is as good as Intel, and so this argument
doesn't work either. Another one would be:
on PowerPC I did notice that gcc itself is not
perfect at optimization. During development of this backend, I often
looked at assembler produced by gcc, and there are a number of small
inefficiencies there. All these are factors that slow down the
non-JITted version of PyPy, but don't influence the speed of the
assembler produced just-in-time.

Anyway, this is just guessing. The fact remains that PyPy can now
be used on PowerPC machines. Have fun!

A bientôt,

Armin.

Hi all,

PyPy's JIT now supports the 64-bit PowerPC architecture! This is the
third architecture supported, in addition to x86 (32 and 64) and ARM
(32-bit only). More precisely, we support Linux running the big- and the
little-endian variants of ppc64. Thanks to IBM for funding this work!

The new JIT backend has been merged into "default". You should be able
to translate PPC versions
as usual
directly on the machines. For
the foreseeable future, I will compile and distribute binary versions
corresponding to the official releases (for Fedora), but of course I'd
welcome it if someone else could step in and do it. Also, it is unclear
yet if we will run a buildbot.

To check that the result performs well, I logged in a ppc64le machine
and ran the usual benchmark suite of PyPy (minus sqlitesynth: sqlite
was not installed on that machine). I ran it twice at a difference of
12 hours, as an attempt to reduce risks caused by other users suddenly
using the machine. The machine was overall relatively quiet. Of
course, this is scientifically not good enough; it is what I could come
up with given the limited resources.

Here are the results, where the numbers are speed-up factors between the
non-jit and the jit version of PyPy. The first column is x86-64, for
reference. The second and third columns are the two ppc64le runs. All
are Linux. A few benchmarks are not reported here because the runner
doesn't execute them on non-jit (however, apart from sqlitesynth, they
all worked).

The last line reports the geometric mean of each column. We see that
the goal was reached: PyPy's JIT actually improves performance by a
factor of around 9.7 to 10 times on ppc64le. By comparison, it "only"
improves performance by a factor 9.3 on Intel x86-64. I don't know why,
but I'd guess it mostly means that a non-jitted PyPy performs slightly
better on Intel than it does on PowerPC.

Why is that? Actually, if we do the same comparison with an ARM
column too, we also get higher numbers there than on Intel.
When we discovered that a few years ago, we guessed that
on ARM running the whole interpreter in
PyPy takes up a lot of resources, e.g. of instruction cache, which the
JIT's assembler doesn't need any more after the process is warmed up.
And caches are much bigger on Intel. However, PowerPC is much closer
to Intel, so this argument doesn't work for PowerPC.
But there are other more subtle
variants of it. Notably, Intel is doing crazy things about branch
prediction, which likely helps a big interpreter---both the non-JITted
PyPy and CPython, and both for the interpreter's main loop itself and
for the numerous indirect branches that depend on the types of the
objects. Maybe the PowerPC is as good as Intel, and so this argument
doesn't work either. Another one would be:
on PowerPC I did notice that gcc itself is not
perfect at optimization. During development of this backend, I often
looked at assembler produced by gcc, and there are a number of small
inefficiencies there. All these are factors that slow down the
non-JITted version of PyPy, but don't influence the speed of the
assembler produced just-in-time.

Anyway, this is just guessing. The fact remains that PyPy can now
be used on PowerPC machines. Have fun!

Monday, October 5, 2015

This is the second part of the series of improvements in warmup time and
memory consumption in the PyPy JIT. This post covers recent work on sharing guard
resume data that was recently merged to trunk. It will be a part
of the next official PyPy release. To understand what it does, let's
start with a loop for a simple example:

The above operations gets executed at the entrance, so each time we call f(). They ensure
all the optimizations done below stay valid. Now, as long as nothing
out of the ordinary happens, they only ensure that the world around us never changed. However, if e.g. someone puts new
methods on class A, any of the above guards might fail. Despite the fact that it's a very unlikely
case, PyPy needs to track how to recover from such a situation. Each of those points needs to keep the full
state of the optimizations performed, so we can safely deoptimize them and reenter the interpreter.
This is vastly wasteful since most of those guards never fail, hence some sharing between guards
has been performed.

We went a step further - when two guards are next to each other or the
operations in between them don't have side effects, we can safely redo the operations or to simply
put, resume in the previous guard. That means every now and again we execute a few
operations extra, but not storing extra info saves quite a bit of time and memory. This is similar to the approach that LuaJIT takes, which is called sparse snapshots.

I've done some measurements on annotating & rtyping translation of pypy, which
is a pretty memory hungry program that compiles a fair bit. I measured, respectively:

total time the translation step took (annotating or rtyping)

time it took for tracing (that excludes backend time for the total JIT time) at
the end of rtyping.

memory the GC feels responsible for after the step. The real amount of memory
consumed will always be larger and the coefficient of savings is in 1.5-2x mark

Here is the table:

branch

time annotation

time rtyping

memory annotation

memory rtyping

tracing time

default

317s

454s

707M

1349M

60s

sharing

302s

430s

595M

1070M

51s

win

4.8%

5.5%

19%

26%

17%

Obviously pypy translation is an extreme example - the vast majority of the code out there
does not have that many lines of code to be jitted. However, it's at the very least
a good win for us :-)

We will continue to improve the warmup performance and keep you posted!

Cheers,
fijal

Hello everyone!

This is the second part of the series of improvements in warmup time and
memory consumption in the PyPy JIT. This post covers recent work on sharing guard
resume data that was recently merged to trunk. It will be a part
of the next official PyPy release. To understand what it does, let's
start with a loop for a simple example:

The above operations gets executed at the entrance, so each time we call f(). They ensure
all the optimizations done below stay valid. Now, as long as nothing
out of the ordinary happens, they only ensure that the world around us never changed. However, if e.g. someone puts new
methods on class A, any of the above guards might fail. Despite the fact that it's a very unlikely
case, PyPy needs to track how to recover from such a situation. Each of those points needs to keep the full
state of the optimizations performed, so we can safely deoptimize them and reenter the interpreter.
This is vastly wasteful since most of those guards never fail, hence some sharing between guards
has been performed.

We went a step further - when two guards are next to each other or the
operations in between them don't have side effects, we can safely redo the operations or to simply
put, resume in the previous guard. That means every now and again we execute a few
operations extra, but not storing extra info saves quite a bit of time and memory. This is similar to the approach that LuaJIT takes, which is called sparse snapshots.

I've done some measurements on annotating & rtyping translation of pypy, which
is a pretty memory hungry program that compiles a fair bit. I measured, respectively:

total time the translation step took (annotating or rtyping)

time it took for tracing (that excludes backend time for the total JIT time) at
the end of rtyping.

memory the GC feels responsible for after the step. The real amount of memory
consumed will always be larger and the coefficient of savings is in 1.5-2x mark

Here is the table:

branch

time annotation

time rtyping

memory annotation

memory rtyping

tracing time

default

317s

454s

707M

1349M

60s

sharing

302s

430s

595M

1070M

51s

win

4.8%

5.5%

19%

26%

17%

Obviously pypy translation is an extreme example - the vast majority of the code out there
does not have that many lines of code to be jitted. However, it's at the very least
a good win for us :-)

We will continue to improve the warmup performance and keep you posted!