Imagine Pythons and… data analysis

After my colleague spoke about dragons, I am going to stay in the reptile group, and introduce some snakes into the mix. Oh, and there will be some cute Pandas later on!

First: an apology. I am aware that the name of the programming language was inspired by the fantastic Monty Python. The pun was just too good to pass.

Second: another apology. This post will not contain too many programming examples, just a short example and a rant about unnecessary obstacles.

Like many researchers in data driven science, programming tools, are among the most important tools I use. In our own group, almost all folks in the natural sciences are using them.

I am working mostly with hydro(geo)logic time series, that – for Austria – are available at ehyd.gv.at. Now, a CSV file can of course also be easily read and analyzed with Excel or a similar program such as LibreOffice Calc. But reading and analyzing 1400 files? We need something better. And even if we manage to cobble them together into one big file, we still need a better tool.

The limits of spreadsheets

One of those better tools, is the programming language python, which in combination with the excellent pandas and matplotlib packages, could theoretically read and plot all my 1400 time series with a very simple (and simplified for the sake of brevity):

in a few seconds, and then use the 9 lines that are not concerned with plotting, to built a much easier to handle dataframe, to base further analysis on.

You might remember that I said “theoretically” above. This is mainly because of the following:

In order for the hydro_ts = pd.read_csv(current_csv… line to be actually able to produce anything besides crashes, errors and unusable data, I need a few dozen lines of snake charming to turn a simple CSV file provided by ehyd, into a CSV file, that can actually be read automatically. While the German word “Lücke” might be a nice gesture for someone scrolling through the 20,000 plus entries of a file containing 60 years of daily data (Who on earth does this?…), it poses the “interesting” challenge of handling German umlaute and teaching python to see “Lücke” as a valid symbol for the generally accepted NaN. After this is finally done, the act of replacing the decimal comma with a decimal point is rather trivial.

What is not trivial, however, was finding out the hard way that some files are formatted differently from the others, and suddenly have a third column, but only for a few days in the middle of the file, that is there to explain about the nature of the “Lücke”, but without a header at the top of the time series.

But this is also solvable. And after way more work than needed, there it finally is. A nice, self written program to analyze and plot “my” data.

That is: until I add in some more data, only to realize that monthly data is not always saved as Date; Data but also sometimes as Startdate; Enddate; Data, and of course with no clear, machine readable headers…

Another issue, is the analysis itself. While it is generally demanded by journals that the author(s) describe their methods, it’s often a very long way to get from a “we applied this statistical method, and then that one…” to a working code, to actually reproduce their results, yet alone apply a new method to your own data. But this is for another rant…