Wednesday, May 4, 2011

Numpy in PyPy - status and roadmap

Hello.

NumPy integration is one of the single most requested features for PyPy. This
post tries to describe where we are, what we plan (or what we don't plan), and
how you can help.

Short version for the impatient: we are doing experiments, which show that
PyPy+numpy can be faster and better than CPython+numpy. We have a plan on how
to move forward, but at the moment there is lack of dedicated people or money
to tackle it.

The slightly longer version

Integrating numpy in PyPy has been my pet project on an on-and-off (mostly off)
basis over the past two years. There were some experiments, then a long
pause, and then some more experiments which are documented below.

The general idea is not to use the existing CPython module, but to
reimplement numpy in RPython (i.e. the language PyPy is implemented in), thus
letting our JIT achieve extra speedups. The really cool thing about this part
is that numpy will automatically benefit of any general JIT improvements,
without any need of extra tweaking.

At the moment, there is branch called numpy-exp which contains a
translatable version of a very minimal version of numpy in the module called
micronumpy. Example benchmarks show the following:

add

iterate

CPython 2.6.5 with numpy 1.3.0

0.260s (1x)

4.2 (1x)

PyPy numpy-exp @ 3a9d77b789e1

0.120s (2.2x)

0.087 (48x)

The add benchmark spends most of the time inside the + operator on
arrays (doing a + a + a + a + a), , which in CPython is implemented in C.
As you can see from the table above, the PyPy version is already ~2 times
faster. (Although numexpr is still faster than PyPy, but we're working on it).

The exact way array addition is implemented is worth another blog post, but in
short it lazily evaluates the expression and computes it at the end, avoiding
intermediate results. This approach scales much better than numexpr
and can lead to speeding up all the operations that you can perform on matrices.

The next obvious step to get even more speedups would be to extend the JIT to
use SSE operations on x86 CPUs, which should speed it up by about additional
2x, as well as using multiple threads to do operations.

iterate is also interesting, but for entirely different reasons. On CPython
it spends most of the time inside a Python loop; the PyPy version is ~48 times
faster, because the JIT can optimize across the python/numpy boundary, showing
the potential of this approach, users are not grossly penalized for writing
their loops in Python.

The drawback of this approach is that we need to reimplement numpy in RPython,
which takes time. A very rough estimate is that it would be possible to
implement an useful subset of it (for some definition of useful) in a period
of time comprised between one and three man-months.

It also seems that the result will be faster for most cases and the same speed
as original numpy for other cases. The only problem is finding the dedicated
persons willing to spend quite some time on this and however, I am willing to
both mentor such a person and encourage him or her.

The good starting point for helping would be to look at what's already
implemented in micronumpy modules and try extending it. Adding a - operator
or adding integers would be an interesting start. Drop by on #pypy on
irc.freenode.net or get in contact with developers via some other channel (such
as the pypy-dev mailing list) if you want to help.

Another option would be to sponsor NumPy development. In case you're
interested, please get in touch with us or leave your email in comments.

Cheers,
fijal

Hello.

NumPy integration is one of the single most requested features for PyPy. This
post tries to describe where we are, what we plan (or what we don't plan), and
how you can help.

Short version for the impatient: we are doing experiments, which show that
PyPy+numpy can be faster and better than CPython+numpy. We have a plan on how
to move forward, but at the moment there is lack of dedicated people or money
to tackle it.

The slightly longer version

Integrating numpy in PyPy has been my pet project on an on-and-off (mostly off)
basis over the past two years. There were some experiments, then a long
pause, and then some more experiments which are documented below.

The general idea is not to use the existing CPython module, but to
reimplement numpy in RPython (i.e. the language PyPy is implemented in), thus
letting our JIT achieve extra speedups. The really cool thing about this part
is that numpy will automatically benefit of any general JIT improvements,
without any need of extra tweaking.

At the moment, there is branch called numpy-exp which contains a
translatable version of a very minimal version of numpy in the module called
micronumpy. Example benchmarks show the following:

add

iterate

CPython 2.6.5 with numpy 1.3.0

0.260s (1x)

4.2 (1x)

PyPy numpy-exp @ 3a9d77b789e1

0.120s (2.2x)

0.087 (48x)

The add benchmark spends most of the time inside the + operator on
arrays (doing a + a + a + a + a), , which in CPython is implemented in C.
As you can see from the table above, the PyPy version is already ~2 times
faster. (Although numexpr is still faster than PyPy, but we're working on it).

The exact way array addition is implemented is worth another blog post, but in
short it lazily evaluates the expression and computes it at the end, avoiding
intermediate results. This approach scales much better than numexpr
and can lead to speeding up all the operations that you can perform on matrices.

The next obvious step to get even more speedups would be to extend the JIT to
use SSE operations on x86 CPUs, which should speed it up by about additional
2x, as well as using multiple threads to do operations.

iterate is also interesting, but for entirely different reasons. On CPython
it spends most of the time inside a Python loop; the PyPy version is ~48 times
faster, because the JIT can optimize across the python/numpy boundary, showing
the potential of this approach, users are not grossly penalized for writing
their loops in Python.

The drawback of this approach is that we need to reimplement numpy in RPython,
which takes time. A very rough estimate is that it would be possible to
implement an useful subset of it (for some definition of useful) in a period
of time comprised between one and three man-months.

It also seems that the result will be faster for most cases and the same speed
as original numpy for other cases. The only problem is finding the dedicated
persons willing to spend quite some time on this and however, I am willing to
both mentor such a person and encourage him or her.

The good starting point for helping would be to look at what's already
implemented in micronumpy modules and try extending it. Adding a - operator
or adding integers would be an interesting start. Drop by on #pypy on
irc.freenode.net or get in contact with developers via some other channel (such
as the pypy-dev mailing list) if you want to help.

Another option would be to sponsor NumPy development. In case you're
interested, please get in touch with us or leave your email in comments.

Great post. (I'm another person who would like numpy on pypy).What are the guidelines for when something should be implemented in RPython? For me personally there are a few instances I would trade some of the dynamicism of Python for speed in my own code.

@Nick the mixed approach (use cpyext and pieces in RPython) sounds maybe valuable short term, but it can burn people easily. RPython-only is way more elegant and gives you wins upfront. Since there is noone willing to invest time in short term approach, this sounds like a no-brainer.

@matt almost nothing should be implemented in RPython, except the interpreter itself. Writing Python should be fast enough. Numpy is a notable example where we want to leverage last bits and pieces of JIT and be really really fast. For example you can't really leverage SSE from Python layer.

Are you in touch with Numpy developers? Are they eager to "stop" using Python and move to RPython? I mean, if this work needs to be redone for each version of Numpy, we will be always lagging behind, and always spend lot of efforts. On the other hand, if Numpy devs will start using the RPython for and let die the pure-Python one, then, the porting effort would me much more meaningful, and I believe it will be easier to find a group of people interested in doing it (myself, maybe)

1) It doesn't sound like this path will lead to easier integration of scipy. If I'm wrong please let me know! But if I'm right, the reality is that most of the reason I care about numpy is because scipy depends on it, and I care about scipy.

2) What about the numpy refactoring effort, which is supposed to be making a better C interface for numpy, which works with IronPython as well as CPython (http://lists.ironpython.com/pipermail/users-ironpython.com/2010-December/014059.html)? Why not just encourage that effort, and leverage it for PyPy integration? Is there a reason it won't work for numpy even though it works for both IronPython and CPython? (

@Davide it's not Python vs RPython, it's C (which numpy is implemented in) vs RPython. No numpy users will be requires to use RPython for anything.

@Gary I believe you're wrong. The idea stays the same - you can call arbitrary C code that will manipulate raw memory and do what it wants to do. The idea is to implement only the interface part (which uses CPython C API) and not the C part, which will work anyway. So at the end, we hope to leverage that effort. Also we're not microsoft and we can't pay large sums of money to do it and having small subset of numpy that's really fast appeals much more to me than a large effort that only gives numpy for pypy (that's not faster than cpython's one).

@Maciej: It was clear to me that numpy users shouldn't change anything, but I thought you intended to change only the Python part of Numpy, not the C part.

Now, if you plan to change the whole C sections, that's a huge job. What are your plans for dependencies like the BLAS, LAPACK and the likes? Would you reimplement them in RPython as well?

And regardless of the answer, my question is still valid: do you see this project as a "catch-up porting" of Numpy, with the version for CPython going on by itself? Or do you see the RPython fork becoming the mainstream Numpy? And if it's the latter, how that would perform on CPython? I think these questions are the key of the matter.

OK. As I've argued before in various pypy groups, I think one of the groups that will most strongly benefit from pypy's speed is the scientific community -- but they need numpy and scipy. So now that I know that this plan will (hopefully) allow for both of those to be used from pypy, I'm encouraged by it.

@Maciej: The parts of Scipy written in Python are for the most part not large. The main work would be in reimplementing the C code that uses Numpy's C-API, and figuring out a way to interface with Fortran code.

You say you lack sufficient resources to put in a large effort, but your answers to CPython extensions is "reimplement everything RPython". Would it not make more sense to improve cpyext so that you get good performance out of it (maybe even JIT compatible)? This seems like a better answer then re-writing every single CPython extension and trying to keep the RPython implementation in sync.

@Joseph cpyext will always be only a semi-permanent compatibility layer. Making numpy work with cpyext is both unrewarding (hard work with obscure bugs), but also significantly harder to make fast, in some places completely impossible. Yes, it doesn't make sense for all extensions, it doesn't even make sense for most. Numpy is however special, since speed is the reason of it's existence. Also, frankly, when it comes down to my free time "let's make this cool JITed code run 50x faster than CPython" beats "let's stare puzzled at this segfault".

To everybody asking why we cannot just use cpyext: I already tried it. It's not gonna happen without hacking the crap out of numpy. Additionally, it's going to be slow: Numpy is not fast for most operations, because of double-unboxing. Only vector ops are fast. JITing the operations is going to be a big win.

For those of you not believing numpy is slow, look at numexpr (http://code.google.com/p/numexpr/) which implements many of the same ideas that we are planning on implementing.

Extremely exciting! Perhaps this is a good time to document the internals of NumPy a bit better while your scour the source to reimplement in RPython.

Perhaps this is a good fit for a Kickstarter (or similar) project? I believe this requires very talented and dedicated developers and paying the professionally by raising money on the Internet should be possible. It's been done before.

Yes, having a couple of Kickstarter projects for PyPy would be nice. It seems the current view is "we'll wait for someone wanting a feature enough to fund it". Picking one or two known valuable features to put on Kickstarter would provide for a nice test: can you raise more money by asking for it in a targeted way?

To address some of the criticism you're receiving, it may be worth making another post clarifying the points made in the comments and elsewhere:

- numpy+cpyext has been tried and found wanting (and very hard to debug)- no developers available that are interested in beating their heads against that particular wall- pure C and Python components of numpy should remain largely the same- only the Python bindings layer that uses the CPython C API needs to be reimplemented- RPython has its own FFI which is PyPy's preferred way to interface to non-Python code (http://pypy.readthedocs.org/en/latest/rffi.html)- cpyext is a useful tool for compatibility with relatively simple C extensions that don't stress the C API greatly, but numpy is not such an extension.

Hi maciej, I am david (we quickly met at pycon where I presented myself as a numpy guy).

I think part of the misunderstanding is around the meaning of "numpy in pypy". Rewriting an array class on top of pypy is certainly valuable, and I am in no position to tell other people what to do in their free time. But I don't think it can realistically mean people will be able to use this instead of numpy after 2-3 man months: how will interfacing with BLAS/LAPACK work ? How will interfacing with the vast amount of fortran code in scipy work ?

If cpyext is indeed a dead-end, it would valuable to know why. Personally, I would certainly be happy to fix parts of numpy that makes cpyext impractically, even if it meant it were twice slower than on cpython. Because I could still benefit from pypy *elsewhere*, without having to rewrite all the numpy/scipy/etc... code.

@david please look above at my responses. there will still be a piece of memory you can pass to LAPACK or BLAS or something. the RPython part is about the interface only and not C-only part. If you want to improve numpy, please separate C-only parts from interface parts as much as possible, using C from RPython is a no-brainer.

Maciej, let me second Nick's polite request for a more detailed post about the plan.

If even David, an actual numpy developer can misunderstand your description, what do you expect from the unwashed masses of scipy users like me? :) Fortunately it does not take too much effort to alleviate the worries. All you have to do is explain to everyone that the plan takes into account the giant amount of C and Fortran code in numpy/scipy, and takes into account the fact that forking numpy/scipy is infeasible.

@daniel: I don't think there is a misunderstanding as much as a different people wanting different things. I believe that Maciej and other pypy people are more interested in leveraging pypy and its JIT do to things which are indeed quite complicated in numpy today (avoid temporary, fast iterators in python, etc...). I have little doubt that pypy is a better platform than cpython to experiment this kind of things.

I am more surprised about the claim that numpy is so tight to cpyhon internals. It certainly depends on the C API, but mostly public API, documented as such.

Hey Maciej! This sounds absolutely awesome. I hope you can find someone to do the necessary work. I think you might need to explain a little better in a separate post where that 48x speedup comes from, and why RPython is a necessary part of it. I think I understand why, but clearly some of the commenters don't :).

Well, if the answer of "How to make numpy available in pypy" is "do a complicated rewrite of numpy," then I'm pretty skeptical about the pypy project. I primarily use numpy, but also scipy sometimes and Image sometimes. As a user it's most important to me that code runs. Speed is not as critical. For example if I take stddev() of an array I first want that to run, and only secondarily want it efficient. If there's a library that I might want to use, and I can't expend a reasonable amount of effort to wrap it, or else someone else can do that, then I don't find pypy that encouraging at all. Since there are lots of libraries out there, and it has been convincingly argued that Python's primary utility is its library support.

@Anonymous: While you may not be concerned with performance, a great many people are. The only way to have arbitrary numpy stuff work in theory would be CPyExt, but as we've said that's frought with complications in that a) it won't work out of the box on something that uses as many corners of the CPython C-API as NumPy, and b) will always be slow. Given people's desire for speed with respect to NumPy we consider reimplementing it a reasonable course.

Alex -- I'm not saying speed is unimportant. What I'm saying is being able to easily make existing CPython extension modules compile against numpy is very important to people. If there is a 20% slowdown or a 10% speedup of the C extension in many cases that is no big deal. Most importantly it would put PyPy on rather equal standing with CPython. And then the JIT pure Python code might win out for efficiency, so PyPy might be a net win for many users.

On the other hand doing research into lazy evaluation and vectorizing and loop restructuring, can obviously make numpy faster, but is more of a tangent, than being helpful to the majority of users who just want to run their CPython extensions at roughly the same speed under PyPy. Until people can actually run their extensions easily (which I argue is the major value that Python has) I doubt there will be much adoption of PyPy.

Say I can already add lists of floats and take their standard deviation using numpy, using the C extension library. It isn't clear to me why this should be substantially less efficient under PyPy than under CPython.

We see the same issue with Python 3.0 adoption. Personally I think it makes bad language changes such as getting rid of string % operator which I use constantly, so I'd avoid it for that reason. But far more importantly it can't run a lot of the libraries I use, with comparable performance. So it's completely a no go to me for that reason.

So I am suggesting that optimizing a single library by rewriting it, seems a case of premature optimization when most libraries can't even run with PyPy.

It's a tough call, but for me most libraries run under PyPy. There are few that don't but I can usually work around that. Regarding numpy - noone wants slower numpy *no matter what*. Besides, it's not clear whether making numpy behave using CPyext would take less effort than writing it from scratch - the first reasonable subset can be expected *much* faster, when doing a rewrite.

Numpy really *is* special, for all my needs, I want a small subset that performs reasonably well, not a whole thing that performs poorly. It's a matter of taste, but it's also much more fun, which plays a lot in terms of people spending free time on it. Would you rather add functionality for X that you need or fix next obscure segfault?

@Anonymous Clarifying: We're hoping to reuse most parts of numpy (and scipy), especially those written in pure C. The "only" part requiring rewriting is the part that uses CPython C API, which is mostly the array interface.

Maciej -- I didn't realize large parts of these libraries could be reused. So maybe once the PyPy C extension facilities are working well enough that important 3rd party libraries can be compiled, I'll just switch to PyPy for performance. It sure does sound more fun to make numpy functions compile down to heavily optimized RPython and get big speed gains. But I still maintain that users would appreciate being able to get all arbitrary libraries to build in the first place, e.g. if scipy or library X depends on the numpy C interface, and that gets broken in the PyPy numpy implementation, then users won't be able to use their desired library at all. So I guess I'm just arguing that the most C extension modules that can work with numpy, the better. Since if we wanted fast but no libraries we'd be using C :-).

Maciej (et all),it looks like this issue isn't clear yet to people. Let's see if I can help.

Numpy is made of 3 "piece" (doesn't matter if they are separate pieces or mingled together, they are there): a pure python part, a pure C part and a C-to-python "glue". All of them are very important to numpy, but the C-to-python glue is special, in that both python and C need to access the same data structures without any conversion or copy (otherwise it will be slow). I'm not sure what exactly numpy is doing for this "special glue" part, but that's the point where pypy suffer: of course pypy works just fine with pure python, and doesn't "care" at all about the C sections. So one option is to rewrite the C-to-python pieces of numpy. I'm sorry but it's still unclear to me if you want also to rewrite the C part or not (here you said kind-of-yes: http://morepypy.blogspot.com/2011/05/numpy-in-pypy-status-and-roadmap.html?showComment=1304533136864#c3499269873134208179 and here you said no: http://morepypy.blogspot.com/2011/05/numpy-in-pypy-status-and-roadmap.html?showComment=1308734098907#c2151221303214453177 so probably you should clarify better)

Now, if I understand it right, your plan is to fork numpy for this purpose (either rewrite the C-to-python glue only, or the C part also). I believe this will fail, and the reason is pretty simple: first, even before you start, you already say that you don't have people/money/time to commit to this project. Second, maintaining a fork is a huge, huge task. You might easily introduce bugs, break feature, etc - while people are expecting something that "just works" as drop-in replacement, so even a "almost success" from a technical point of view, can be a big failure for adopter, if it doesn't behave. Last, but not least, numpy is a moving target, and you'll always play catch up. Is this the game you want to play??

Now, I don't want to tell you what you have to do for fun, but if you want to have chances of success, you have to change the "politics" of your plan. I trust you that technically your plan is fine, but rather than implementing it within a numpy fork (or worst: rewrite), I suggest that you work with the numpy and/or CPython community, to see if you can write a wrapper around cpyext (or whatever they are using for C-to-Python glue). This wrapper (at compiler time) should either become cpyext (or whatever) if you are using CPython, or become "something else" if you are using pypy. If you persuade numpy people to use this wrapper you'll have the same numpy code base working as is in CPython and pypy. Sure you will not be exploiting the faster-than-C capabilities of pypy, but you can get there more smoothly: improving the speed one feature at time, while the rest of the framework is still working and thus useful, and thus increasing its user base, people interested in it (and some of them may become contributors).

Instead your plan sounds like: implement one feature at time, while the rest of the framework doesn't work and thus nobody uses it in production, let alone care about its speed. On top of which, you'll be trying to catch-up with numpy.

Here is a completely different approach taken by IronPython for Scipy+Numpy compatibility:

http://www.johndcook.com/blog/2009/03/19/ironclad-ironpytho/

It's basically a bidirectional FFI. Have a CPython and an IronPython both running, and wrap objects so that IronPython objects can be used by CPython and vice versa. This requires some platform specific binary level compatibility, in their case, DLL hacking, to allow the FFI to work in both directions.

It seems like that approach should be practical for getting all of large libraries such as Scipy or Numpy working in Pypy. Since it's already been demonstrated to work for IronPython.

The above roadmap proposes instead speeding up the core array object by coding it in RPython.

But I wonder if these two approaches could work together. For example Numpy could be configured to use ordinary CPython array objects, or PyPy compiled RPython array objects. Then the FFI just has to take care to wrap objects appropriately that are in the "other interpreter".