Thursday, October 29, 2015

PyPy 4.0.0 Released - A Jit with SIMD Vectorization and More

PyPy 4.0.0

We’re pleased and proud to unleash PyPy 4.0.0, a major update of the PyPy python 2.7.10 compatible interpreter with a Just In Time compiler. We have improved warmup time and memory overhead used for tracing, added vectorization for numpy and general loops where possible on x86 hardware (disabled by default), refactored rough edges in rpython, and increased functionality of numpy.
You can download the PyPy 4.0.0 release here:

We would like to thank our donors for the continued support of the PyPy project.
We would also like to thank our contributors (7 new ones since PyPy 2.6.0) and encourage new people to join the project. PyPy has many layers and we need help with all of them: PyPy and RPython documentation improvements, tweaking popular modules to run on PyPy, or general help with making RPython’s JIT even better.

New Version Numbering

Since the past release, PyPy 2.6.1, we decided to update the PyPy 2.x.x versioning directly to PyPy 4.x.x, to avoid confusion with CPython 2.7 and 3.5. Note that this version of PyPy uses the stdlib and implements the syntax of CPython 2.7.10.

Vectorization

Richard Plangger began work in March and continued over a Google Summer of Code to add avectorization step to the trace optimizer. The step recognizes common constructs and emits SIMD code where possible, much as any modern compiler does. This vectorization happens while tracing running code, so it is actually easier at run-time to determine the availability of possible vectorization than it is for ahead-of-time compilers.
Availability of SIMD hardware is detected at run time, without needing to precompile various code paths into the executable.
The first version of the vectorization has been merged in this release, since it is so new it is off by default. To enable the vectorization in built-in JIT drivers (like numpy ufuncs), add –jit vec=1, to enable all implemented vectorization add –jit vec_all=1
Benchmarks and a summary of this work appear here

Maciej Fijalkowski and Armin Rigo refactored internals of Rpython that now allow PyPy to more efficiently use guards in jitted code. They also rewrote unrolling, leading to a warmup time improvement of 20% or so. The reduction in guards also means a reduction in the use of memory, also a savings of around 20%.

Numpy

Our implementation of numpy continues to improve. ndarray and the numeric dtypes are very close to feature-complete; record, string and unicode dtypes are mostly supported. We have reimplemented numpy linalg, random and fft as cffi-1.0 modules that call out to the same underlying libraries that upstream numpy uses. Please try it out, especially using the new vectorization (via –jit vec=1 on the command line) and let us know what is missing for your code.

CFFI

While not applicable only to PyPy, cffi is arguably our most significant contribution to the python ecosystem. Armin Rigo continued improving it, and PyPy reaps the benefits of cffi-1.3: improved manangement of object lifetimes, __stdcall on Win32, ffi.memmove(), and percolate const, restrict keywords from cdef to C code.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It’s fast (pypy and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
We also welcome developers of other dynamic languages to see what RPython can do for them.
This release supports x86 machines on most common operating systems (Linux 32/64, Mac OS X 64, Windows 32, OpenBSD, freebsd), as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux.
We also introduce support for the 64 bit PowerPC hardware, specifically Linux running the big- and little-endian variants of ppc64.

Please try it out and let us know what you think. We welcome feedback, we know you are using PyPy, please tell us about it!
Cheers
The PyPy Team

PyPy 4.0.0

We’re pleased and proud to unleash PyPy 4.0.0, a major update of the PyPy python 2.7.10 compatible interpreter with a Just In Time compiler. We have improved warmup time and memory overhead used for tracing, added vectorization for numpy and general loops where possible on x86 hardware (disabled by default), refactored rough edges in rpython, and increased functionality of numpy.
You can download the PyPy 4.0.0 release here:

We would like to thank our donors for the continued support of the PyPy project.
We would also like to thank our contributors (7 new ones since PyPy 2.6.0) and encourage new people to join the project. PyPy has many layers and we need help with all of them: PyPy and RPython documentation improvements, tweaking popular modules to run on PyPy, or general help with making RPython’s JIT even better.

New Version Numbering

Since the past release, PyPy 2.6.1, we decided to update the PyPy 2.x.x versioning directly to PyPy 4.x.x, to avoid confusion with CPython 2.7 and 3.5. Note that this version of PyPy uses the stdlib and implements the syntax of CPython 2.7.10.

Vectorization

Richard Plangger began work in March and continued over a Google Summer of Code to add avectorization step to the trace optimizer. The step recognizes common constructs and emits SIMD code where possible, much as any modern compiler does. This vectorization happens while tracing running code, so it is actually easier at run-time to determine the availability of possible vectorization than it is for ahead-of-time compilers.
Availability of SIMD hardware is detected at run time, without needing to precompile various code paths into the executable.
The first version of the vectorization has been merged in this release, since it is so new it is off by default. To enable the vectorization in built-in JIT drivers (like numpy ufuncs), add –jit vec=1, to enable all implemented vectorization add –jit vec_all=1
Benchmarks and a summary of this work appear here

Maciej Fijalkowski and Armin Rigo refactored internals of Rpython that now allow PyPy to more efficiently use guards in jitted code. They also rewrote unrolling, leading to a warmup time improvement of 20% or so. The reduction in guards also means a reduction in the use of memory, also a savings of around 20%.

Numpy

Our implementation of numpy continues to improve. ndarray and the numeric dtypes are very close to feature-complete; record, string and unicode dtypes are mostly supported. We have reimplemented numpy linalg, random and fft as cffi-1.0 modules that call out to the same underlying libraries that upstream numpy uses. Please try it out, especially using the new vectorization (via –jit vec=1 on the command line) and let us know what is missing for your code.

CFFI

While not applicable only to PyPy, cffi is arguably our most significant contribution to the python ecosystem. Armin Rigo continued improving it, and PyPy reaps the benefits of cffi-1.3: improved manangement of object lifetimes, __stdcall on Win32, ffi.memmove(), and percolate const, restrict keywords from cdef to C code.

What is PyPy?

PyPy is a very compliant Python interpreter, almost a drop-in replacement for CPython 2.7. It’s fast (pypy and cpython 2.7.x performance comparison) due to its integrated tracing JIT compiler.
We also welcome developers of other dynamic languages to see what RPython can do for them.
This release supports x86 machines on most common operating systems (Linux 32/64, Mac OS X 64, Windows 32, OpenBSD, freebsd), as well as newer ARM hardware (ARMv6 or ARMv7, with VFPv3) running Linux.
We also introduce support for the 64 bit PowerPC hardware, specifically Linux running the big- and little-endian variants of ppc64.

@Gerrit, they are 2 different things. One is the option to say you are interested in the SIMD support and the other is a check if SIMD support is available in the HW if you are interested in using it. I'm sure once SIMD support has been used for some time it will eventually be enabled by default but since it is new and potential could have some unknown issues at this time you have to explicitly enable it at this time.

I keep watching the progress of PyPy with excitement. Cool things happening here. But I continue to be disappointed that it doesn't tip towards Python3. It's dead to me until that becomes the majority effort. :(

The PyPy project contains a large plurality of interests. A lot of the people working on it are volunteers. So PyPy3 will happen if people within the project become interested in that part, or if new people with that interest join the project. At the moment, this seems not happening, which we can all be sad about. However, blaming anybody with differing interest for that situation feels a bit annoying to me.

@PeteVine you posted a random executable from dropbox claiming to have pypy with x87 backend. PyPy does not have an x87 backend and this raises suspitions this was just some malware. Now if you want someone to compare one thing against some other thing, please link to sources and not random binaries so the person comparing can look themselves. Additionally you did not post a benchmark, just a link to the binary

Well, I was suggesting benchmarking the 32-bit backends to see how much difference SIMD makes - x87 means the standard fpu whereas the default uses SSE2. I know it's processor archaeology so you may have forgotten pypy even had it ;)

The ready-to-use pypy distro (built by me) was meant for anyone in possesion of a real set of benchmarks (not synthetic vector stuff) to be able to try it quickly.

And btw, you could have simply edited the dropbox link out. I'd already tested py3k using this backend and mentioned it in one of the issues on bitbucket so it's far from random.

@ all the people asking about pypy3 - you have the python 3.2 compatible pypy (py3k) at your disposal even now.

@PeteVine: to clarify, PyPy has no JIT backend emitting the old-style x87 fpu instructions. What you are posting is very likely a PyPy whose JIT doesn't support floats at all. It emits calls to already-compiled functions, like the one doing addition of float objects, instead of writing a SSE2 float addition on unboxed objects.

Instead, use the official PyPy and run it with vectorization turned on and off (as documented) on the same modern machine. This allows an apple-to-apple comparison.

Thanks for clarifying, I must have confused myself after seeing it was i486 compatible.

Are you saying the only difference between the backends I wanted to benchmark would boil down to jit-emitting performance and not actual pypy performance? (I must admit I tried this a while ago with fibonacci and there was no difference at all).

In other words, even before vectorization functionality was added, shouldn't it be possible to detect that the non-SSE2 backend is running on newer hardware and use the available SIMD? (e.g. for max. compatibility)

@PeteVine Sorry, I don't understand your questions. Why do you bring the JIT-emitting performance to the table? And why fibonacci (it's not a benchmark with floats at all)? And I don't get the last question either ("SIMD" = "vectorization").

To some people, merely dropping the word "SIMD" into a performance discussion makes them go "ooh nice" even if they don't have a clue what it is. I hope you're more knowledgeable than that and that I'm merely missing your point :-)

The last part should have been pretty clear as it was referring to the newly added –jit vec=1 so it's not me who's dropping SIMD here (shorthand for different instructions sets) as can be seen in the title of this blog post.

All this time I was merely interested in comparing the two 32-bit backends, that's all. One's using the i486/x87 instruction set regardless of any jit codes, the other is able take advantage of anything up to SSE2. The quick fibonacci test was all I did so you could have pointed me to a real set of benchmarks instead of throwing these little jabs :)

@PeteVine: ok, there is a misunderstanding somewhere, I think. Let me try to clarify: PyPy's JIT has always used non-SIMD SSE2 instructions to implement floating point operations. We have a slow mode where only x87 instructions are used, but usually don't fall back to that, and it does not make sense to compare against that mode.

What the new release experimentally added is support for SIMD SSE instructions using autoparallelization when --jit vec=1 is given. This only works if your program uses numpy arrays or other simple list processing code. For details on that (and for benchmarks) it's probably best to read Richard Plangger's blog.

But please, I'd really like to see how much of a handicap the much-maligned non-SSE2 backend incurs. Could you recommend a set of python (not purely computational) benchmarks so I can put this peevee of mine to rest/test?

Anyways, @Armin Rigo is a great educator himself judging from his patient replies in the bugtracker! So yeah, kudos to you guys!