Monday, June 20, 2011

Speeding up Python (NumPy, Cython, and Weave)

The high-level nature of Python makes it very easy to program, read, and reason about code. Many programmers report being more productive in Python. For example, Robert Kern once told me that "Python gets out of my way" when I asked him why he likes Python. Others express it as "Python fits your brain." My experience resonates with both of these comments.

It is not rare, however, to need to do many calculations over a lot of data. No matter how fast computers get, there will always be cases where you still need the code to be as fast as you can get it. In those cases, I first reach for NumPy which provides high-level expressions of fast low-level calculations over large arrays. With NumPy's rich slicing and broadcasting capabilities, as well as its full suite of vectorized calculation routines, I can quite often do the number crunching I am trying to do with very little effort.

Even with NumPy's fast vectorized calculations, however, there are still times when either the vectorization is too complex, or it uses too much memory. It is also sometimes just easier to express the calculation with a simple loop. For those parts of the application, there are two general approaches that work really well to get you back to compiled speeds: weave or Cython.

Weave is a sub-package of SciPy and allows you to inline arbitrary C or C++ code into an extension module that is dynamically loaded into Python and executed in-line with the rest of your Python code. The code is compiled and linked at run-time the very first time the code is executed. The compiled code is then cached on-disk and made available for immediate later use if it is called again.

Cython is an extension-module generator for Python that allows you to write Python-looking code (Python syntax with type declarations) that is then pre-compiled to an extension module for later dynamic linking into the Python run-time. Cython translates Python-looking code into "not-for-human-eyes" C-code that compiles to reasonably fast C-code. Cython has been gaining a lot of momentum in recent years as people who have never learned C, can use Cython to get C-speeds exactly where they want them starting from working Python code. Even though I feel quite comfortable in C, my appreciation for Cython has been growing over the past few years, and I know am an avid supporter of the Cython community and like to help it whenever I can.

Recently I re-did the same example that Prabhu Ramachandran first created several years ago which is reported here. This example solves Laplace's equation over a 2-d rectangular grid using a simple iterative method. The code finds a two-dimensional function, u, where ∇2 u = 0, given some fixed boundary conditions.

This code takes a very long time to run in order to converge to the correct solution. For a 100x100 grid, visually-indistinguishable convergence occurs after about 8000 iterations. The pure Python solution took an estimated 560 seconds (9 minutes) to finish (using IPython's %timeit magic command).

NumPy Solution

Using NumPy, we can speed this code up significantly by using slicing and vectorized (automatic looping) calculations that replace the explicit loops in the Python-only solution. The NumPy update code is:

Using num_update as the calculation function reduced the time for 8000 iterations on a 100x100 grid to only 2.24 seconds (a 250x speed-up). Such speed-ups are not uncommon when using NumPy to replace Python loops where the inner loop is doing simple math on basic data-types.

Quite often it is sufficient to stop there and move on to another part of the code-base. Even though you might be able to speed up this section of code more, it may not be the critical path anymore in your over-all problem. Programmer effort should be spent where more benefit will be obtained. Occasionally, however, it is essential to speed-up even this kind of code.

Even though NumPy implements the calculations at compiled speeds, it is possible to get even faster code. This is mostly because NumPy needs to create temporary arrays to hold intermediate simple calculations in expressions like the average of adjacent cells shown above. If you were to implement such a calculation in C/C++ or Fortran, you would likely create a single loop with no intermediate temporary memory allocations and perform a more complex computation at each iteration of the loop.

In order to get an optimized version of the update function, we need a machine-code implementation that Python can call. Of course, we could do this manually by writing the inner call in a compilable language and using Python's extension facilities. More simply, we can use Cython and Weave which do most of the heavy lifting for us.

Cython solution

Cython is an extension-module writing language that looks a lot like Python except for optional type declarations for variables. These type declarations allow the Cython compiler to replace generic, highly dynamic Python code with specific and very fast compiled code that is then able to be loaded into the Python run-time dynamically. Here is the Cython code for the update function:

This code looks very similar to the original Python-only implementation except for the additional type-declarations. Notice that even NumPy arrays can be declared with Cython and Cython will correctly translate Python element selection into fast memory-access macros in the generated C code. When this function was used for each iteration in the inner calculation loop, the 8000 iterations on a 100x100 grid took only 1.28 seconds.

For completeness, the following shows the contents of the setup.py file that was also created in order to produce a compiled-module where the cy_update function lived.

The extension module was then built using the command: python setup.py build_ext --inplace

Weave solution

An older, but still useful, approach to speeding up code is to use weave to directly embed a C or C++ implementation of the algorithm into the Python program directly. Weave is a module that surrounds the bit of C or C++ code that you write with a template to on-the-fly create an extension module that is compiled and then dynamically loaded into the Python run-time. Weave has a caching mechanism so that different strings or different types of inputs lead to a new extension module being created, compiled, and loaded. The first time code using weave runs, the compilation has to take place. Subsequent runs of the same code will load the cached extension module and run the machine code.

The inline function takes a string of C or C++ code plus a list of variable names that will be pushed from the Python namespace into the compiled code. The inline function takes this code and the list of variables and either loads and executes a function in a previously-created extension module (if the string and types of the variables have been previously created) or else creates a new extension module before compiling, loading, and executing the code.

Notice that weave defines special macros so that U2 allows referencing the elements of the 2-d array u using simple expressions. Weave also defines the special C-array of integers Nu to contain the shape of the u array. There are also special macros similarly defined to access the elements of array u if it would have been a 1-, 3-, or 4-dimensional array (U1, U3, and U4). Although not used in this snippet of code, the C-array Su containing the strides in each dimension and the integer Du defining the number of dimensions of the array are both also defined.

Using the weave_update function, 8000 iterations on a 100x100 grid took only 1.02 seconds. This was the fastest implementation of all of the methods used. Knowing a little C and having a compiler on hand can certainly speed up critical sections of code in a big way.

Faster Cython solution (Update)

After I originally published this post, I received some great feedback in the Comments section that encouraged me to add some parameters to the Cython solution in order to get an even faster solution. I was also reminded about pyximport and given example code to make it work more easily. Basically by adding some compiler directives to Cython to avoid some checks at each iteration of the loop, Cython generated even faster C-code. To the top of my previous Cython code, I added a few lines:

#cython: boundscheck=False#cython: wraparound=False

I then saved this new file as _laplace.pyx, and added the following lines to the top of the Python file that was running the examples:

This provided an update function cy_update2 that resulted in the very fastest implementation (943 ms) for 8000 iterations of a 100x100 grid.

Summary

The following table summarizes the results which were all obtained on a 2.66 Ghz Intel Core i7 MacBook Pro with 8GB of 1067 Mhz DDR3 Memory. The relative speed column shows the speed relative to the NumPy implementation.

Method

Time (sec)

Relative Speed

Pure Python

560

250

NumPy

2.24

1

Cython

1.28

0.57

Weave

1.02

0.45

Faster Cython

0.94

0.42

Clearly when it comes to doing a lot of heavy number crunching, Pure Python is not really an option. However, perhaps somewhat surprisingly, NumPy can get you most of the way to compiled speeds through vectorization. In situations where you still need the last ounce of speed in a critical section, or when it either requires a PhD in NumPy-ology to vectorize the solution or it results in too much memory overhead, you can reach for Cython or Weave. If you already know C/C++, then weave is a simple and speedy solution. If, however, you are not already familiar with C then you may find Cython to be exactly what you are looking for to get the speed you need out of Python.

Thanks for the writeup.There's also a numexpr project ( http://code.google.com/p/numexpr/ ) which claims to speed up numpy operations.It'll be good if you can add it to the mix to see how far we can go with only pure python user code

Thanks for the suggestion about numexpr. I definitely thought about numexpr and actually did do a numexpr example --- but in this case I did not get any speed up. In fact, it was slower than the NumPy example (took 4 seconds). Now, I didn't try to investigate if any configurations to the numexpr engine would speed that up. If you have any suggestions that would be very helpful.

The reason I am asking is that my Fortran reimplementation of the *same* NumPy solution (i.e. using arrays instead of loops) is 10.6 faster. As such (if my benchmark is correct), your conclusion that NumPy can get you "most" of the way to compiled speed would be questionable, because it would be better to simply use Fortran, using the NumPy like programming, to get 10x speedup, with minimal effort.

But maybe there is some hidden problem somewhere (i.e. some compiler options, lapack (?), who knows).

I've just finished a 4 hour tutorial at EuroPython 2011 on High Performance Python, the slides are online:http://ep2011.europython.eu/conference/talks/experiences-making-cpu-bound-tasks-run-much-fasterI covered Python, PyPy, Cython, Numpy (+cython), NumExpr, ShedSkin, multiprocessing, ParallelPython and pyCUDA for the Mandelbrot problem.I'm in the process of writing up the training into a free ebook, it'll be on my blog (http://ianozsvald.com/) all going well within a week.Ian.

Hi Travis! I'd like to point out that PyPy is very promising in terms of massively speeding up native Python and considerably speeding up Numpy.

I ran your example with the native Python and Numpy update methods, and got the behavior you observe: the speedup is at least two orders of magnitude. Then I wrote a tiny wrapper class around Python lists to emulate 2D arrays, and ran it through PyPy 1.5. At 8000 iterations, it's roughly 2x slower than CPython+Numpy. That is an astounding improvement over native Python!

There is an effort underway to port Numpy to PyPy, but it seems not enough communication is happening between PyPy and Numpy developers. I need Numpy for my job, and I would love to see Numpy incorporate support for PyPy! (I intend to help as well.) I think PyPy has made spectacular progress recently and is the future of Python.

I wonder how well np_inline works. Given the trouble with weave from time to time, it seems like a simple alternative. I haven't used it but saw it on the scipy mailing lists:

http://pages.cs.wisc.edu/~johnl/np_inline/

I'm also curious about Ondrej's benchmarks. In the past, with the original Performance Python article the speed difference with the modified weave/pyrex/fortran was not too much.

The PyPy benchmarks are also very exciting.

I think there is merit in actually spinning this off as a small project in itself where folks can contribute code and add to the list of benchmarks. Some form of a shootout. We could simply open up a small project on github for this? What do you think?

I have created a new Github project called scipy/speed located here: https://github.com/scipy/speed

There, I included a "modular" version of Ondrej's F90 example (compile with f2py). The standard looping construct gave similar results to Cython (0.93s).

However, the "vectorized" F90 code gave the very fastest results, completing the 8000 iterations in 0.57s. This is indeed impressive. It looks like modern Fortran 90 is still the fastest way to compile vectorized expressions.

Nice post, my response is pretty late in but may help people. You have a bug in most of your implementations, except the numpy solution.The numpy solution does something different than the rest, you can't actually do this update in place and get what you expect as is each iteration through the loop overwrites data that the next iterations expects to be there, that is, this iteration which updates U2(i,j) actually ruins the input for the next iteration's U2(i,j-1)

Micha, It's been a (very) long time since I did this kind of stuff and my memory is dim on the details. (And Varga's book on "Matrix Iterative Analysis" is sitting on my bookshelf at the office.) That being said, isn't the numpy version akin to a Jacobi over-relaxation (JOR) pass, while the other solutions are akin to a simultaneous over-relaxation (SOR) pass? IFF that is the case (and my memory isn't failing me), then in fact the SOR passes *should* converge faster from a numerical analysis point of view, due to their asymptotic performance.It's true that means the numpy and other implementations are (in effect) using different algorithms, with the numpy being the slower performer.In other words, it may not be a bug, it may be a feature!Warning! People who are obsessed about this kind of stuff should definitely look it up in (somplace like) Varga's book, rather than trusting my imperfect memory.

I noticed the same thing -- the NumPy version does Jacobi while the others do Gauss-Seidel. You could do red-black Gauss-Seidel in the pure NumPy version, but I don't think that it is possible to do pure Gauss-Seidel with NumPy. Also, if you look at the results, you'll see that the NumPy version converges slower.

The timings are likely still fine, since the # of operations are the same, but the convergence will be worse for the NumPy version.

I've been playing with this with the python C-API, ctypes, and f2py examples as well and f2py is the fastest.

This information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic.Regards,PHP Course Chennai|php training in velachery|php training institute

Thanks for your details and explanations..I want more information from your side..I Am working in erp development software in chennaishould you need for any other clarification please call in this number.044-6565 6523.

It was really a wonderful article and I was really impressed by reading this blog. We are giving all software and Database Course Online Training. Oracle Training in Chennai is one of the reputed Training institute in Chennai. They give professional and real time training for all students. Oracle Training in chennai

Oracle Training in chennaiIt’s too informative blog and I am getting conglomerations of info’s about Oracle interview questions and answer .Thanks for sharing, I would like to see your updates regularly so keep blogging.

Informatica Training in chennaiThis information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic.

There are lots of information about latest technology and how to get trained in them, like Hadoop Training in Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies Hadoop Training in Chennai By the way you are running a great blog. Thanks for sharing this..

Nice article i was really impressed by seeing this article, it was very interesting and it is very useful for me.I get a lot of great information from this blog. Thank you for your sharing this informative blog..

Hey, nice site you have here!We provide world-class Oracle certification and placement training course as i wondered Keep up the excellent work experience!Please visit Greens Technologies located at Chennai Adyar Oracle Training in chennai

Awesome blog if our training additional way as an SQL and PL/SQL trained as individual, you will be able to understand other applications more quickly and continue to build your skill set which will assist you in getting hi-tech industry jobs as possible in future courese of action..visit this blog Green Technologies In Chennai

Nice site....Please refer this site also nice if Our vision succes!Training are focused on perfect improvement of technical skills for Freshers and working professional. Our Training classes are sure to help the trainee with COMPLETE PRACTICAL TRAINING and Realtime methodologies.Green Technologies In Chennai

I updated the post to reflect the suggestions Job oriented Hadoop training in Chennai is offered by our institue is mainly focused on real time and industry oriented. We provide training from beginner’s level to advanced level techniques thought by our experts. Hadoop Training in Chennai

Oracle Training in Chennai is one of the best oracle training institute in Chennai which offers complete Oracle training in Chennai by well experienced Oracle Consultants having more than 12+ years of IT experience.

There are lots of information about latest technology and how to get trained in them, like Hadoop Training Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Hadoop Training in Chennai). By the way you are running a great blog. Thanks for sharing this.

A Best Pega Training course that is exclusively designed with Basics through Advanced Pega Concepts.With our Pega Training in Chennai you’ll learn concepts in expert level with practical manner.We help the trainees with guidance for Pega System Architect Certification and also provide guidance to get placed in Pega jobs in the industry.

Our HP Quick Test Professional course includes basic to advanced level and our QTP course is designed to get the placement in good MNC companies in chennai as quickly as once you complete the QTP certification training course.

GREENS TECHNOLOGIES, ONE OF THE BEST IT INSTITUTES FOR ORACLE SQL TRAINING IN CHENNAI OFFERS TRAINING WITH PRACTICAL GUIDANCE. OUR TRAINING ACADEMY IS FULLY EQUIPPED WITH SUPERIOR INFRASTRUCTURE AND LAB FACILITIES. WE ARE PROVIDING THE BEST ORACLE PLSQL TRAINING IN CHENNAI.

Thanks for sharing this informative blog .To make it easier for you Greens Techonologies at Chennai is visualizing all the materials about (OBIEE).SO lets Start brightening your future.and using modeling tools how to prepare and build objects and metadata to be used in reports and more trained itself visit Obiee Training in chennai

i gain the knowledge of Java programs easy to add functionalities play online games, chating with others and industry oriented coaching available from greens technology chennai in Adyar may visit.Core java training In Chennai

There are lots of information about latest technology and how to get trained in them, like Best Hadoop Training In Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies Hadoop Training in Chennai By the way you are running a great blog. Thanks for sharing this blogs..

This information is impressive..I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topicAndroid Training In Chennai In Chennai

Whatever we gathered information from the blogs, we should implement that in practically then only we can understand that exact thing clearly, but it’s no need to do it, because you have explained the concepts very well. It was crystal clear, keep sharing.. Websphere Training in Chennai

This information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic..Informatica Training in chennai | QTP Training in Chennai

very nice blogs!!! i have to learning for lot of information for this sites...Sharing for wonderful information.Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing.Sap BI Training in Chennai

Truely a very good article on how to handle the future technology. This content creates a new hope and inspiration within me. Thanks for sharing article like this. The way you have stated everything above is quite awesome. Keep blogging like this. Thanks :)

Testing is the best in Information Technolgohy compared to anyother domain. Thanks for sharing your knowledge via this content. You have presented it very well. Thank you once again. I have bookmarked this page for future use. Keep blogging article like this.

Best SQL Query Tuning Training Center In Chennai This information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic..

This information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic.Regards,

TollFree +1(855)837-9965! Repair All PC, LLC in Cleveland Ohio. Our Microsoft certified professionals & System Engineers which provide Best Tools and Expert Solutions to all you issues on repairallpc.net

Python is a high-level, interpreted, interactive and object-oriented scripting language. This post for Python beginners which gives great idea on basics of Python. Also find best Python Training Classes in your locality at UrbanPro.com

Are you in need of a Loan to pay off your debt and start a new life? You have come to the right place were you can get your loan at a very low interest rate. Interested people/company should please contact us via email for more details.jubrinunityfinancialloan@gmail.com

this blog are the proof that digital marketing is the most important skills in grouth the career so if you want For website creation, promotion and development contact here. For your digital marketing needs just have a look at Click Perfect

Nice article i was really impressed by seeing this article, it was very interesting and it is very useful for me.I get a lot of great information from this blog. Thank you for your sharing this informative blog.. Python Training in Chennai

Get complete Apple Support on all Mac Book versions from 24/7 Apple team. We offer best customer care helpline number +1844-809-1494. Apple support is here to help you with all of your Apple products. Mac 24/7 technical customer helpline number +1844-809-1494. Apple Support Number : +1844-809-1494

Nowadays, most of the businesses rely on cloud based CRM tool to power their business process. They want to access the business from anywhere and anytime. In such scenarios, salesforce CRM will ensure massive advantage to the business owners. Salesforce Training | Salesforce Training in Chennai

Norton.com/setup 01444-390-866.Norton 360 is a security program that protects your computer from malicious programs such as viruses, Trojans, spyware and worms. Installation errors and problems when installing Norton 360 are caused by a number of different factors that vary depending on the computer. With some simple troubleshooting steps, you can resolve Norton 360 installation errors and get your program running on your computer. 1-800-214-7583. Norton.com/setup 01444-390-866.

GREEN WOMEN HOSTELGreen Women hostel is one of the leading Ladies hostel in Adyar and we serving an excellent service to Staying people, We create a home atmosphere, it is the best place for Working WomenOur hostel Surrounded around bus depot, hospital, atm, bank, medical Shop & 24 hours Security Facility

Freelance Best Makeup & Hair Artist in Jaipur with huge experience and Specialization in Bridal and Wedding Makeup,Celebrity Makeup,Professional Makeup,Creative Makeup,Bollywood Makeup and Character Makeup in Delhi,Jaipur,Rajasthan. Natural Makeup that allows your skin to breath with a radiant glow and remains flawless throughout your special day.

Actually landing on this blog has been a technical discovery and I am looking forward to discover more discoveries and invention. I hope the writer will continually keep us updated with new information and on a regular basis. I also found important information that is related to the subject matter under discussion and it can be accessed by clicking on Amylase Activity Lab Report.

Nowadays, most of the businesses rely on cloud based CRM tool to power their business process. They want to access the business from anywhere and anytime. In such scenarios, salesforce CRM will ensure massive advantage to the business owners. Cloud Computing Training in Chennai | Cloud Computing Courses

A Pioneer Institute owned by industry professionals to impart vibrant, innovative and global education in the field of Hospitality to bridge the gap of 40 lakh job vacancies in the Hospitality sector. The Institute is contributing to the creation of knowledge and offer quality program to equip students with skills to face the global market concerted effort by dedicated faculties, providing best learning environment in fulfilling the ambition to become a Leading Institute in India.

Thanks for posting useful information.You have provided an nice article, Thank you very much for this one. And i hope this will be useful for many people.. and i am waiting for your next post keep on updating these kinds of knowledgeable things...Really it was an awesome article...very interesting to read..please sharing like this information......Android training in chennaiIos training in chennai

It's interesting that many of the bloggers your tips helped to clarify a few things for me as well as giving.. very specific nice content. And tell people specific ways to live their lives.Sometimes you just have to yell at people and give them a good shake to get your point across.Web Design CompanyWeb Development CompanyWeb Development Company

Thanks for posting useful information.You have provided an nice article, Thank you very much for this one. And i hope this will be useful for many people.. and i am waiting for your next post keep on updating these kinds of knowledgeable things...Really it was an awesome article...very interesting to read..please sharing like this information......Web Design Development CompanyMobile App Development Company

A Pioneer Institute owned by industry professionals to impart vibrant, innovative and global education in the field of Hospitality to bridge the gap of 40 lakh job vacancies in the Hospitality sector. The Institute is contributing to the creation of knowledge and offer quality program to equip students with skills to face the global market concerted effort by dedicated faculties, providing best learning environment in fulfilling the ambition to become a Leading Institute in India.

Excellent and very cool idea and the subject at the top of magnificence and I am happy to this post..Interesting post! Thanks for writing it.What's wrong with this kind of post exactly? It follows your previous guideline for post length as well as clarity..Matlab Training in Chennai

Thank you for taking the time to provide us with your valuable information. We strive to provide our candidates with excellent care and we take your comments to heart.As always, we appreciate your confidence and trust in usAnalytics Training in Chennai