I'm using python to do some Bayesian statistics. I've coded it up in python and in Fortran 95. The Fortran code is waaay faster... like a factor of 100. I expected the Fortran to be faster, but I was really hoping that by using numpy I could get the python code to come close, maybe within a factor of 2. I've profiled the python code and it looks like the majority of the time is spent doing the following things:

scipy.stats.rvs: taking a random draw from a distribution. I do this ~19000 times and it takes a total time of 3.552 sec

numpy.slogdet: computing the log of the determinant of a matrix. I do this ~10,000 and it takes a total of 2.48 s

numpy.solve: solve a linear system: I call this routine ~10,000 times for a total time of 2.557 s

In total my code runs in ~ 11 sec whereas my fortran code takes .092 sec. Is this a joke? I'm really not trying to be unrealistic in my expectations of python, and I certainly don't expect to get my python code to be as fast as Fortran.. but to be slower by a factor of > 100.. Python's gotta be able to do better than that. Just in case you are curious, here is the full output of my profiler:( I don't know why it broke the text into several blocks)

Could you post a snippet of the python code?
–
marshall.wardMar 19 '12 at 23:26

do you call scipy.stats.rvs and others inside the loop as global methods with actual dots . in the statements or did you assign it to local variables like rvs = scipy.stats.rvs? in newer Python versions this should be optimised, but in 2.6 i'm not sure, it might create a difference...
–
AprillionMar 19 '12 at 23:42

1

there are two factors here: the code and the coder, without code it is difficult to say which one the main responsible is for the coder asthonishment.
–
joaquinMar 20 '12 at 0:17

Just a small comment in the beginning. You can definitely replace set of calls "cand = norm.rvs(..)" inside the loop over j by a single one "cands = norm.rvs(loc=rho_draws[i-1],scale=candidate_sig2_rhos)" that should save you at least some of the time spent in .rvs()
–
sega_saiMar 20 '12 at 3:55

1

Can you add full link to the code that can be run? including the data. otherwise it's a bit useless.
–
fijalMar 20 '12 at 10:16

6 Answers
6

It is hard to tell exactly what's going on with your code, But my suspicion is that you just have some data which is not (or could not be) very vectorized.
Because obviously the call of .rvs() 19000 times is going to be way slower than the .rvs(size=19000). See:

In [5]: %timeit x=[scipy.stats.norm().rvs() for i in range(19000)]
1 loops, best of 3: 1.23 s per loop
In [6]: %timeit x=scipy.stats.norm().rvs(size=19000)
1000 loops, best of 3: 1.67 ms per loop

So if you indeed have a not very vectorized code or algorithm it is well expected to be slower than fortran.

Check out the performance page created by the SciPy/NumPy folks. There are a number of remarkably easy extras that foster very fast code. Among them are (a) using the weave module, especially the inline and blitz options. (b) Using Cython to write some of your functions in C but be able to call and use them in Python.

I do a lot of large-scale scientific computing work in Python for statistics, finance, and (in grad school) computer vision. The reason why Python is excellent for these kinds of issues is not that my naive, first hack code would yield the fastest solution, but because in Python I can easily interface with tons of other tasks. I can easily issue Linux commands for other programs, easily read and parse most data files, easily interface with SQL and other databse software; I have all of the R statistics library available, use of OpenCV commands (in much much nicer syntax that the C++ version), and much more.

When the importance of my task was to manipulate a new dataset and get my hands dirty, feeling out the nuances of that data, then Python's ease of programming, along with matplotlib, made it much better. Later on, when I need to scale things up, I can always use PyCUDA, Cython, or just rewrite things in C++ if high-end performance is required. Since most machines have multiprocessors now, the multiprocessing module, as well as mpi4py, allow me to quickly and cheaply turn annoying for-loop style tasks into much shorter tasks, without needing to migrate to C++.

In short, the real utility of Python doesn't come from the language all by itself, but from becoming really proficient with the add-ons and extras that let you cheaply make your little set of common problems execute quickly on the data sets that matter in the day-to-day.

Real-time embedded communications software is going to be using C++ for a long time to come... same for high-frequency trading strategies. But then again, professional solutions to these types of things is not really what Python is meant for. And in some cases, folks prefer unusual solutions for that stuff anyway.

Well I'm not super experienced with python but so far I love it and want to use it for everything. The thing that baffles me here is that it seems like the speed bottlenecks are calls to numpy linear algebra routines, which I thought were basically calls to BLAS/LAPACK. That should be as fast as it gets right? p.s. to all, thanks for you speedy and insightful comments. I really appreciate the help
–
legosesMar 20 '12 at 2:46

it might be worth noting that SciPy performance page is quite a fair bit outdated. They don't compare identical algos (that does make a huge difference in my runs) and on ancient hardware (or just significantly different than mine new xeon)
–
fijalMar 20 '12 at 11:03

1

I'm happy to use f2py and cython if it will help.. But in my case, what would I convert to cython and fortran code?... The routines that are taking the longest are the calls to LAPACK/BLAS. That should be just as fast in python? right?
–
legosesMar 20 '12 at 13:34

Get rid of the two for loops and two list comprehensions by replacing them with Numpy functions and constructs that use numpy.ndarrays. Also do not print in between the computation - that is also slow. You can probably get 10-50 fold speed increase just by following the above advice.

You usually should't use Numpy or Scipy to compute scalar values. Use 'ordinary' Python. Extending the example provided by @sega_sai:

In [11]: %timeit x = [normalvariate(0, 1) for i in range(190)]
1000 loops, best of 3: 274 µs per loop
In [12]: %timeit x = [scipy.stats.norm().rvs() for i in range(190)]
10 loops, best of 3: 180 ms per loop
In [13]: %timeit x = scipy.stats.norm().rvs(size=190)
1000 loops, best of 3: 987 µs per loop

It is faster if you make an instance of scipy.stats.norm().rvs

In [14]: rvs = scipy.stats.norm().rvs
In [15]: %timeit x = [rvs() for i in range(190)]
100 loops, best of 3: 3.8 ms per loop
In [16]: %timeit x = rvs(size=190)
10000 loops, best of 3: 44 µs per loop

Also note that PyMC has complained about Scipy's probability distributions:

"Based on informal comparison using version 2.0, the distributions in PyMC tend to be approximately an order of magnitude faster than their counterparts in SciPy (using version 0.7)"
.