In science we often have to process large datasets. For that reason, the performance of a langage has always been the most important criteria. However, even though using compiled langages like Fortran or C is the best choice performance-wise, there are a lot of good reasons to use an interpreted langage like instead. I will give a general overview of why you could use an interpreted langage instead of your favourite compiled langage and why it's a good idea. I'll dive into more details in a second part, detailing my own setup using Python.

Interpreted langages, as opposited to compiled langage do not need a compilation step to be run. When dealing with data, you frequently encounter heterogenous data, or data from different sources that need similar yet different codes.

Classical use case

Imagine you're working on the correlation between the weather and the amount of pollution in the air. You'd have a dataset giving you the weather (rainy, sunny, …) as a function of time as well as another dataset giving you information about air pollution. Using a compiled langages, you would probably :

write a bunch of code that reads the weather dataset

compile it

run it and wait for the output and go back to 1 to correct errors, …

write a bunch of code that reads the pollution dataset

compile it

run it and wait for the output, raging because you are reloading and recomputing everything from the first dataset, and go back to 1 to correct errors, …

cheer because you've loaded everything

write a bunch of code that actually does something of your data

compile it

run it, and go back to 1, until you have your data loaded and analysed

wonder how to plot your data

Ccompiled langage become very good when the time required for steps 3, 7 and 12 is bigger than the time required for all the other points.

If you're using an interpreted langage though, you would do (or I think you should do!) something like:

load your payload of external libraries

write a bunch of code to read your weather dataset

launch it and correct your error, going back to 2

write a bunch of code to read your pollution dataset

launch it and correct your errors, going back to 4

write a bunch of code that actually does something of your data

run it and correct your errors, going back to 6

do your plots in the langage you've been using so far

This is achieved thanks to the fact that once loaded, you don't need to reload data. For small datasets, this is not of much help but when loading start taking seconds, it's a real pain to have to reload it each time. One final advantage is also that you have plotting ability and high-level data analysis all in the same place.

Python rocks

My python setup is heavily relying on the jupyter notebook http://jupyter.readthedocs.org/en/latest/ with a combination of pandas for data loading and analysis, numpy for mathematics and matplotlib for the plots. Here is an overview of the features of Jupyter I'm using. The python kernel for jupyter is called ipython.

Jupyter

Jupyter is a web application that help you manage your project. In jupyter's notebooks, you write your code in cells that you can then execute. There a lot of reasons why I think jupyter is amazing.

Jupyter magic

Jupyter is actually language independant. I'm using it as a frontend for python (using ipython), but it can embed a ridiculous amount of different languages: C, Julia, R, OCaml, Java, C#, …

Since the notebook is only a webpage, anyone who has a web browser can actually see your notebook ; it means sharing your results is as easy as one click. This is especially useful since you can do literate programming, detailing the step of your analysis in the middle of your code.

The notebooks provided by jupyter are based on a run-per-cell method. This means that you write code in a cell, then execute the cell, creating some output and eventually populating the global scope. But you can also insert "formatted text" cells, with embedded LaTeX formulae, images, movies, … Whatever a web-browser can handle!

Jupyter %magic

Each jupyter cell is independant and the way it is executed can be customized using 'magics'. This extremely powerful. For example, I had to write to find the eigenvalues of 200,000 matrices. Using jupyter, I only added '%%cython' at the beginning of my cell, turning it into Cython-compiled code! All the functions in the cell are then available to the global scope without needing to create another file, compile it, import it, …

I also use the magics to run cells on different jupyter instances, to wrap the cell in a function (to prevent it from polluting the scope), to dump data, …

%matplotlib notebook

If you add this line in a cell and execute it, any figure generated by matplotlib will be drawn in the browser, with interactive controls (zoom, move, download plot).