For the past two months or so, I’ve been slowly migrating my scientific workflow (that’s a fancy way of saying “my chaotic data hacking”) from Matlab ((R) (TM) (C)) to Python. The results are overwhelmingly positive, so I’d like to rant about it a bit. First, some background.

My work typically involves the analysis of tons of remote sensing observations contained in files of various formats (netCDF if I’m very lucky, HDF if I’m lucky, some weird non-standard binary thing if I’m not); all these files span terabytes and terabytes of hard drive space stored in racks in a big temperature-controlled room somewhere high in the sky. I ssh to a central server on which all these drives are mounted; I then usually run there code in whatever language is the most convenient to analyze the data.

Why Matlab

After a few years of this, Matlab emerged as the best solution for several reasons:

interactive sessions let you play with the data and make the analysis algorithms “evolve” (the analysis procedure is often not cast in stone and writes itself as I go along and understand the data better);

the syntax is well-suited to work with numerical arrays (ie vectorized code, something also present in f90 but where it sometimes gives buggy results);

powerful input/output facilities, reading netCDF and HDF is as easy as ncload file.nc or hdfread(‘file.hdf’, ‘some_variable’), without all the administrative overhead of compiled languages (memory management, static typing, etc). This is an important point, in Fortran it often takes me as long to get the I/O right than the actual algorithm;

For me, this means I usually get results quicker with an interpreted language like Matlab’s, even taking into account the higher speed of compiled code like Fortran. A nice side-effect is that working suddenly becomes a lot more enjoyable when I don’t have to spend so much time remembering all the Fortran idiosyncrasies, the differences between compilers (will this code work with ifort/gfortran/g95/pghpf/etc?) and which libraries to link to, fixing messy mixes of f77 and f90 syntax to give a predictable output, etc. Once I run the actual program, I know it would have been faster using a compiled language, but it would have taken me longer to get right and the coding would have been a lot less fun.

Why !Matlab

Now, the problems. Matlab is not free as in speech, meaning you often can’t see the code. Matlab is not free as in beer, meaning our institution owns a limited number of licenses, meaning that during student rush hours you often can’t even launch Matlab at all. The initial goal of Matlab was the analysis of matrices (hence the “mat”), not general arrays, which makes the code look weird in places and explains the FUCKING SEMICOLON you have to append to every instruction to prevent millions of numbers no human will ever be able to read to flash in front of your eyes. The fact that you cannot launch standalone Matlab scripts without fishy syntax like “matlab -nodisplay < script.m”. Because of this (mostly because of the non-free), for some time now I have been looking for a replacement. I’ve tried R, Scilab, Octave, and tons of other stuff, but every time I’ve found the language and the plotting capabilities to be worse when I was hoping for, at least, similar (I guess I was also somewhat reluctant to learn another closed-system language).

But somewhere I always hid a secret wish… to use Python. I love its syntax, focus on simplicity and readability, but it lacked by default any capability for serious number crunching, so I had been patiently waiting from the sidelines for the maturation of Python packages for scientific work. Well, the stars are now aligning.

A year ago or so I took another look at Python’s scientific stack. I liked what I saw: everything good in Matlab (see above), without the annoyances, and free. But trying to get things running I got lost in the mess of version numbers and a never-ending chain of interdependent packages, which is even more fun when you have no root access and the machine you’re using comes with Python 1.5.2 and that’s it, no chocolate for you. Basically, you have to recompile everything by hand, and make sure you don’t forget that crucial compilation flag somewhere! Unfortunately I had other things to do (like actual work), so I reluctantly let go of the idea and stuck with Matlab.

Then came SAGE

Fast-forward to 2 or 3 months ago, when I stumble upon SAGE. SAGE (apart from being a RSS aggregator for Firefox and a satellite instrument) is basically a wrapper around Python with tons of scientific packages added, all nicely pre-compiled into tasty binaries just for you by very nice people (which involves tons of work, not as simple as it sounds). These goodies come in gzipped tarballs that you dump into your $HOME. You can then launch the sage program, which handles regular Python just fine and includes all the modules I was longing for: NumPy (easy, efficient handling of huge numerical array with slicing and dicing), SciPy (input/output and scientific functions), Matplotlib (lots of plotting tools with lickable, anti-aliased output and a syntax almost identical to Matlab)! Even IPython is there, meaning you get a comfortable interactive experience with tab completion on files, objects, dictionaries and tons of other niceties! Since SAGE lets you install additional packages with a single command, it’s a piece of cake to add wxPython to get direct-to-screen plotting within your interactive session. Apotheose! Great success! Matlab without Matlab. AND it’s Python, meaning you’re using an actual, REAL language with object-oriented programming, introspection, dictionaries, etc. And since Python fits your brain, the first code you come up with is most likely the right one. As a bonus, SAGE is available for linux (32/64), Mac OS X and even Windows (I think) so your code will work everywhere! Bliss.

The best part was when I realized, a few hours later, that I actually didn’t need to use the SAGE program itself… inside the SAGE directory lies a local/ folder containing all the binaries, libraries and Python packages it used. It even contains its own Python 2.5! Set the PATH, LD_LIBRARY_PATH and PYTHONPATH environment variables right and suddenly you have a perfectly consistent installation of everything that’s needed to do scientific work in Python! Other users on the same machine just need to change the same variables, and they can play too! Apotheose²! So in addition to its primary goals of providing a replacement for Mathematica/Maple/etc, SAGE, as a side-effect, provides the whole Python scientific shebang compiled and wrapped up in a nice package, for your pleasure.

Since Python is pretty smart, new Python modules will then install themselves in the right place with python setup.py install. So go ahead: install Basemap, netcdf4-python, PyHDF, scipy-cluster, PyNGL, whatever you need.

(Sidenote: Other “integrated” Python distributions with a similar focus on scientific analysis are starting to pop up, like Python(x,y) or the EnthoughtPython Distribution. Travis Oliphant, one of the major architect of the recent NumPy restructuration, is now president of Enthought. They also hosts the SciPy website; you can’t get more central than that. Interesting stuff should happen there soon. They are a little too window-centric, though.)

Success

Migrating my last work project from Matlab to Python has been a success: all the figures in my last paper were generated in Python, they look almost exactly the same as the ones generated in Matlab (just as good or better — fonts are noticeably nicer thanks to anti-aliasing), and the code is as small and feels better. It seems like the only thing you could miss from Matlab are its numerous toolboxes, something which is slowly getting fixed within SciPy (I don’t actively use them so I don’t care). Adieu point-virgule!

Of course now that I’ve been bitten by the Python bug, I’m starting to follow the NumPy, SciPy and Matplotlib mailing lists. Some great things are afoot, like the imminent NumPy 1.1 (previously 1.0.5, including shiny masked arrays, histograms and I/O), the release of Travis Oliphant’s Guide to NumPy in august 2008, lots of integration and standardization efforts between the various components, etc.

I guess the best thing is that it made me excited again about the idea of hacking stuff…

75 Responses to Bye Matlab, hello Python, thanks Sage

Actually, there has been a groundswell of Macs around the place and several new client projects on various Linux distros, so look for more of this sort of goodness to spill over into some nice tools. One of the main points of the EPD is to provide an easy-to-install common set of tools that are all cross platform.

I’ve passed the past few days trying to get a coherent installation of ETS on my 10.5 in various ways, but no luck. The enthought egg repository is almost empty. Installation from source (svn or stable) stopped with errors. It seems getting everything to work together is pretty tough in the Python world 😉

Personally I use ruby, but I applaude you for your choice. I hope python goes strong and unifies many different concepts. Under the hood, ruby and python are very similar to the success of one will benefit for the other.

A quick speed comparison on image processing tasks showed that “vectorized” code using Numpy or Matlab was almost identical (and pretty quick). “Non-vectorized” code using for loops takes about 1.5 times as long in Python.

Nice post! I’m seriously thinking about switching. The biggest problem is that unless my colleagues switch it will be hard to share code. However… luck favors the bold! Go for it!

You seem to indicate anti-aliasing as a pro for python figures. However, a better practice is to save figures in vectorized (e.g. eps) rather than rasterized formats (e.g. png). This makes the figures in your document look nice regardless of screen or printing resolution.

Hey, very interesting post. I too have migrated from matlab to python, however my advisor uses matlab exclusively, so I can’t give it up altogether. Did you know it is actually convenient to call a matlab session and use both interpreters in parallel (with the help of mlabwrap )? This is great if you want to exchange data back and forth or use the advantages of one interpreter over the other for toolboxes or speedup. I would be interested to know more about speed comparison tests. Someone mentioned above that python takes 1.5x longer than matlab when looping; my experience was the opposite of this. Could someone post a link to that comparison ?

“Someone mentioned above that python takes 1.5x longer than matlab when looping; my experience was the opposite of this. ”

As was mine – I guess it depends on what task you’re performing.

The first time I used NumPy was to convert some of my existing code from MATLAB. Now I use MATLAB mostly for computation, and not linear algebra related routines (numerical integration, and some other calculations). I couldn’t figure out a way to fully vectorize it in MATLAB – so I had a for loop (or doubly nested one – not sure).

I did somewhat of a direct conversion to SciPy/NumPy. It was 7 times faster. I think the Python interpreter is simply a lot faster than the MATLAB one.

In cases where I’ve seen others doing vectorized comparisons, the answer is usually what the poster suggested – close to identical speed.

Python has a flaw that I find irritating beyond belief, especially for technical computing. Matlab has it too, with a twist. Many “new” languages inherit too much from C, and this is one of those things that is best left behind.

What I’m talking about is the “Henry Ford” approach to array indexing—you can have any array you want as long as it starts at zero.

One of the tenets of modern language design is that languages should abstract and conform to the programmer’s problem, not forcing the programmer to adapt his/her programming style. Forcing all arrays to begin indexing from zero is bad design, in this sense.

I’ve suggested this on other lists and people have asked, apparently in full naivete, why in the world would anyone want an array that starts with any other index. Thus, one sees that Python was not intended primarily as a technical computing language, but “gets by.”

Others say, adjust your indices in your code to fix the problem. Obviously, that is the only solution (within Python). But it is error prone and tedious to do so. Others say, but having convenient indexing will slow down the program, to which the answer is, no, because after the indexing is adjusted in the user’s code, the program will run even slower than it would if the indexing were computed in some optimal fashion by the compiler.

Also, Python is unsafe with respect to array boundary checking—if a negative-number index is supplied, Python, as a deliberately-designed-in feature, adjusts the index modulo the nominal array length and indexes backwards from the far end of the array!

After nearly 7 years of working with MATLAB on daily basis I have to say I hate it as a programming language and would happily move on any day. However, MATLAB comes with such a huge number of toolboxes and functions that it would be very difficult to find a replacement that would work for everyone. Personally, I use MATLAB mostly for signal processing and statistical analysis. What I use include various filter design tools, spectral estimation routines, curve fitting routines, PDF and CDF related functions, ANOVA and multi-variable regression and sometimes wavelets. I could not find these in SAGE documentations. Am I missing something?

There is a plethora of filter design and probability functions in SciPy (SAGE “packages” SciPy along with its other stuff and so may not document what is in SciPy very well). The ANOVA function used to be there but is gone because we couldn’t vouch for it. However, there is a new standalone stats package that may contain ANOVA functionality. Wavelets are available in a separate package but are not as feature-complete.

There is an extremly useful tool that I use with sage called the sage notebook. Anyone can test out sage before downloading the massive 2 GB+ file to your hard drive by visiting sagenb.com. I am taking a course at the university of washington from the main dev. of the notebook and find it to be a great tool. Check it out!

Oscar, there is a fairly simple way around the 0-based “array index access.” Python de-sugars the [n] syntax to a method call, namely __getitem__(n). If you override __getitem__ and __setitem__ in a class which inherits from list, you can do your “array index access” however you’d please.

About the looping speed in Sage: If speed is really necessary, you can construct this particular function in Cython. This is a python to C converter, embedded in Sage. Using it “intelligent” (avoiding to use Python wrappers for ints and floats) you get in reality a pure C program where things are suddenly 100x or more faster than a python loop. Of course, this is function can be called directly by Sage/Python, nearly without thinking about any details …

I’m working in the interface between hydrology and climate. My concern is leaving pv-wave (sort of IDL). I started using R and I am quite happy with it, but, as soon I’ll start a completely new project, I somethimes think if I should use python instead of R.

Did you consider R? It is free software, it reads netcdf, it does beautiful plots, …

Hi! I would like to find a free application able to replace Matlab for my work. I don’t mind semicolons nor programming languages (well, not really: I hate *pointers!!), I only want it to do the work for me.

I use Matlab R/C/TM to fit a set of complex functions to experimental data. This involves handling complex numbers and a lot of ill-posed calculations. I’ve checked out the Sage reference manual, but I haven’t found what I need. Does anyone know any software suitable for this work? I think that Sage, Octave and Scilab don’t include complex fitting functions.

Hmmm, just started using Scilab after years of using Matlab. I got a new job but the new bosses are reluctant to invest in Matlab. Understandable since a stand alone license cost over 1350GBP without any toolboxes and the scripts I write can only be used on my computer. Scilab seems okay (and it seems maybe a bit more advanced than matlab) but indeed the graphics are a bit disappointing and user friendliness is not in their dictionary.
After reading the OP posts I will give SAGE I try.

Excellent stuff! I am in the migration stage myself from Matlab to scientific python and will definitely be coming back for info along the way. I also work with remote sensing data and feel the pain in regard to just pushing raw data around.

My initial excitement about SAGE as a free Matlab alternative turned into disappointment on discovering that I need almost 1GB of supportive VMWare software to get SAGE to run on a Windows machine; essentially emulating Linux on Windows (methinks). Not yet convinced that the effort is worthwhile. In my books Matlab still rule.

Ben: check out Python(x,y) and the Enthought Python distribution, then, they run natively on Windows. They probably offer pretty much the same advantages as Sage if you are looking only for a Python distribution with scientific modules included, and not the full agglomerate of specialized math software that Sage is.

Well, I also want to move onto Python from MATLAB, but it is not so easy. My background is computer vision. There’s the Opencv wrapper http://wwwx.cs.unc.edu/~gb/wp/blog/2007/02/04/python-opencv-wrapper-using-ctypes/ for Python which I will need. The problem seems that the datatypes of images are different among libraries. If I want to plot the image using matplotlib, I need to convert the IpleImage to another datatype I suppose.

I used wxPython and it was really great. In a few days, I made a standalone media player for windows. Learning Python (investing Python) seems that it worths because it can be used for a larger scale of fields.

I have been using Python for about a year and a half now, and I love it. I am very experienced in MATLAB as well, and since I’m at a university it’s still freely available for me. I use both languages for their strengths: I still prefer MATLAB for plotting and exploring data, but if I’m going to build an application that requires any interactivity, then it’s all WxPython. The NumPy interface is just a bit more klunky than MATLAB’s, as other comments have noted, and I’ve never been satisfied with matplotlib, particularly since it’s currently incapable of plotting in 3D. But as a programming language, no contest. So, like everything else, it depends on what you’re doing with it. I explicitly tested their relative speeds at one point, and they were equal in my computer vision task at the time.

However, beware the Python OpenCV wrapper! I’ve been using OpenCV off and on since 2002 (in C++) and I’ve always found it to contain obtuse and poorly documented code. As of 2007, the Python wrapper matched neither the C library nor the documentation, in really obvious ways such as the number of arguments to a function (so, which one is no longer needed…?). I spent a month and failed to translate the camera calibration functions into Python, so I hacked and rebuilt the OpenCV library and wrapper to work for me, which is a decidedly suboptimal solution. I also wrote a “helper” utility to translate images back and forth from NumPy to OpenCV formats… element by element, the only way that worked. I’m venting these frustrations partly as a warning and partly in the vain hope that someone involved in the OpenCV project will see and care. All that said, however, I and several acquaintances have written wildly successful computer vision solutions in Python without OpenCV. SciPy+matplotlib does actually provide most of the functionality you need, but its documentation is also pretty sparse (though accurate).

Oscar has a good point, and it’s one of the reasons I like Igor Pro: its waves can be accessed by point number (0-based as I remember), but they can also be given dimensions (range and unit) and accessed that way as well. It makes many algorithms much simpler and graphing more automatic.

If we’re stuck with standard indexing, I tend to prefer R’s usage of using negative indices to exclude values. It’s odd at first, but turns out to be incredibly useful.

And speaking of R, I like Sage but have to admit that it falls far, far short of R in two key areas: 1) documentation, and 2) graphing.

Sage could really use a much better help system that allows better searching, has more complete entries, and includes things like “See also” which is often the fastest way to find something whose name you don’t quite remember.

And Sage graphing needs a lot more control over appearance, plus ease of output to PDF.

At this point, I am using Sage a fair amount, but if CAS features weren’t a part of what I’m doing, I’d prefer R. As Sage grows and matures, I’m sure it will match R in many ways, but it’s not there yet.

When I build sage from source on my ubuntu box and a Centos system where I lack root access the sage build of python can’t find properly import numpy.

File “/home/afiten1/enyphd10/saskey/sage-3.1.2/local/lib/python2.5/site-packages/numpy/linalg/linalg.py”, line 29, in
from numpy.linalg import lapack_lite
ImportError: libf77blas.so: cannot open shared object file: No such file or directory

This is a question: Is there a good tutorial somewhere going through how to use Sage with large (100s of gigabytes) netCDF files for statistical analysis. That is how to organize the files on directories, what sage functions to invoke, how to plot some graphs and do some statistical functions on the sample.

Python:
Python is a better language for general applications and is free of charge.
Both:
Both are very easy to learn and you will create applications fast.
MATLAB:
Superior desktop tools, better documentation, has 100+ well designed toolboxes and is, from R2008a, better for OO programming.
AND in addition to that, MATLAB has Simulink.

This is great. I have been using a patchwork of programs and shell scripts calling perl scripts calling R scripts, and then plotting everything with the Generic Mapping Tools. It is flexible and free this way, but the thought of wrapping it together in a neater package is appealing, and I am thinking of migrating to scipy/matplotlib/basemap. Do you have a sample script that ingests a netCDF file and builds a plot from some of the contained data? I don’t see an example of this in the documentation.

I wasn’t aware of SAGE, though I have fiddled with SciPy over the years. Even ported a rather large Mathematica package to it back in … geez, 2002 or so. The availability of a distribution certainly makes things easier. Maintaining a distribution of scientific software can be a nightmare.

I just don’t understand , you can’t live with Matlab’s semicolon, and you are happy with Python’s indentation nazi?

About the negative indexing: I think Matlab’s ‘end’ keyword is a much elegant approach. If python want’s to be consistent with its 0-based indexing scheme, then -0 should be the last element while -1 shall be the second-last.

I’ve worked with Matlab for a few years now and have been using Python for some stuff of late, so this seems promising.

I’m curious though, after downloading SAGE, how do I set the PATH, LD_LIBRARY_PATH and PYTHONPATH variables to allow access to SAGE functionality and libraries via Python? I’m on a linux OS if it matters. Thanks!

@Chris: “I’m curious though, after downloading SAGE, how do I set the PATH, LD_LIBRARY_PATH and PYTHONPATH variables to allow access to SAGE functionality and libraries via Python?” -> Just start the sage environment via “sage -sh”, that does all the magic for you.

I also work on Remote Sensing and specifically SAR. Fortunately there are GDAL python bindings which load most of the remote sensing formats I use. I would have never been able to use MATLAB with SAR since it can’t easily load CEOS and COSAR just 2 of the strange formats. Python does this seamlessly and is a great help. I am still trying to convince my supervisor to make a switch.

I started using python in the 90’s for large scale sonar signal processing. We used both MATLAB and python. We wrote nice python wrappers for the data. At the time numerical python wasn’t as well developed. Since 2000 I’ve been teaching programming using python. I like both the python(x,y) & Enthought distributions. Both are complete. They include ipython, idle, scipy, numpy, matplotlib, & PIL. It’s nice that (x,y) also includes vpython. These are all packages that intro undergraduate students find easy to use.

As for SAGE, I think it is much too specialized for pure math applications. Very quickly the documentation gets in abstract math structures like Rational Fields, etc. In the tutorial, right after plotting and functions, the next topic is Rings. “The integer {…, -1, 0, 1, 2, …}, called ZZ in Sage.”

VPython was briefly mentioned, and maybe a word about it is appropriate, given the interests of the people posting here. VPython (download from vpython.org) makes it amazingly easy to write programs that produce navigable real-time 3D animations as a side effect (!) of Python computations. It is used in both education and research.

I have previously programmed in over a dozen languages, such as FORTRAN, COBOL, C, and Java – now I am learning Python.

Python indentation is a brilliant way to not only avoid lots of curly brackets, but to ensure that the indented code is actually semantically meaningful.

I languages such as C and Java, people can indent code to make it look pretty, but careless people can mix up the indentation to cause confusion _ Python’s use of indentation avoids this confusion.

As for zero base arrays, when I was learning ALGOL68, I hated them and defined all arrays to start with 1. After becoming proficient in C, and later Java, I began to appreciate the value and reasoning behind starting arrays at zero. Python follows the lead of C and Java in using arrays starting at zero, so it would cause confusion in programmers to have adopted arrays starting at one. In C, starting arrays at zero, simplifies the semantics of manipulating memory pointers, and is also more efficient (as using one based arrays, would mean subtracting one from each index to locate the position of an element in an array).

This business of the first index of an array, reminds me of an engineer who complained that BBC BASIC required radians for trig functions as he ‘had’ to complicate his code by constantly converting from degrees to radians, such as x = SIN(45*PI/180) + COS(30*PI/180) etc.!!!

trying to get things running I got lost in the mess of version numbers and a never-ending chain of interdependent packages, which is even more fun when you have no root access and the machine you’re using comes with Python 1.5.2 and that’s it, no chocolate for you.

It sounds like you are white-collar, i.e. you could afford a $500 (or less) laptop and a free OS like Ubuntu. I had no problems with scipy, numpy, iPython, and I’m a chump at computers.

I believe there is also a build of Puppy Linux (also free) which has all the mathematical softwares pre-installed and runs off of a live CD or flash drive.

I’ve been waiting in the wings for one of Octave, Yacas, Sage, and-on-and-on to get enough *oomph* (like the Ubuntu project) to become “The One, Basically Dominant Free Software” that I would then spend the energy to learn. Your post, I think, may have pushed me far enough toward Sage.