Blog

In the mixed environment of ScraperWiki we make use of a broad variety of tools for data analysis. Data Science at the Command Line by Jeroen Janssens covers tools available at the Linux command line for doing data analysis tasks. The book is divided thematically into chapters on Obtaining, Scrubbing, Modeling, Interpreting Data with “intermezzo” chapters on parameterising shell scripts, using the Drake workflow tool and parallelisation using GNU Parallel.

The original motivation for the book was a desire to move away from purely GUI based approaches to data analysis (I think he means Excel and the Windows ecosystem). This is a common desire for data analysts, GUIs are very good for a quick look-see but once you start wanting to repeat analysis or even repeat visualisation they become more troublesome. And launching Excel just to remove a column of data seems a bit laborious. Windows does have its own command line, PowerShell, but it’s little used by data scientists. This book is about the Linux command line, examples are all available on a virtual machine populated with all of the tools discussed in the book.

The command line is at its strongest with the early steps of the data analysis process, getting data from places, carrying out relatively minor acts of tidying and answering the question “does my data look remotely how I expect it to look?”. Janssens introduces the battle tested tools sed, awk, and cut which we use around the office at ScraperWiki. He also introduces jq (the JSON parser), this is a more recent introduction but it’s great for poking around in JSON files as commonly delivered by web APIs. An addition I hadn’t seem before was csvkit which provides a suite of tools for processing CSV at the command line, I particularly like the look of csvstat. csvkit is a Python tool and I can imagine using it directly in Python as a library.

The style of the book is to provide a stream of practical examples for different command line tools, and illustrate their application when strung together. I must admit to finding shell commands deeply cryptic in their presentation with chunks of options effectively looking like someone typing a strong password. Data Science is not an attempt to clear the mystery of these options more an indication that you can work great wonders on finding the right incantation.

Next up is the Rio tool for using R at the command line, principally to generate plots. I suspect this is about where I part company with Janssens on his quest to use the command line for all the things. Systems like R, ipython and the ipython notebook all offer a decent REPL (read-evaluation-print-loop) which will convert seamlessly into an actual program. I find I use these REPLs for experimentation whilst I build a library of analysis functions for the job at hand. You can write an entire analysis program using the shell but it doesn’t mean you should!

Weka provides a nice example of smoothing the command line interface to an established package. Weka is a machine learning library written in Java, it is the code behind Data Mining: Practical Machine Learning Tools and techniques. The edges to be smoothed are that the bare command line for Weka is somewhat involved since it requires a whole pile of boilerplate. Janssens demonstrates nicely how to do this by developing automatically autocompletion hints for the parts of Weka which are accessible from the command line.

The book starts by pitching the command line as a substitute for GUI driven applications which is something I can agree with to at least some degree. It finishes by proposing the command line as a replacement for a conventional programming language with which I can’t agree. My tendency would be to move from the command line to Python fairly rapidly perhaps using ipython or ipython notebook as a stepping stone.

Data Science at the Command Line is definitely worth reading if not following religiously. It’s a showcase for what is possible rather than a reference book as to how exactly to do it.

4 Responses to “Book review: Data Science at the Command Line by Jeroen Janssens”

Thank you very much for writing this review, Ian. It’s good to hear that the command line is part of your day-to-day activities at SraperWiki.

I was a bit surprised to read “It finishes by proposing the command line as a replacement for a conventional programming language with which I can’t agree.” If you can explain where you think I’m proposing this then I shall correct it immediately! In the meantime, please allow me to use this space to provide some context.

If there’s one thing I want readers to take away from the book, it’s that a data scientist should use whatever approach gets the job (or part of the job) done. That could mean R to do some regression, D3 to create an interactive visualization, Go to scrape a wiki, and yes, sometimes the command line. It’s a valuable skill to be able to chop your problem into subproblems, identify when you can best use which approach, and stitch everything together. On the one hand, it would be silly to think that the command line is best approach for everything. On the other hand, while a programming language can sometimes give you much more speed, power, and flexibility, that doesn’t mean you should use that programming language for everything. For example, it’s perfectly fine to start with the command line to obtain and scrub some data and then continue with IPython Notebook in combination with pandas and seaborn to explore it. Mix and match approaches, be creative, and be practical!

Let me end this babble by saying that I still believe that being able to leverage the power of the command line, and integrate it with your data science workflow, will make you a more efficient and productive data scientist. My suggestion: start with cowsay (http://datascienceatthecommandline.com/#cowsay) and take it from there.

I drew the implication that one should use the command line in preference to a more conventional programming from p161 in the “Be Creative” section.

I agree entirely with using a range of tools, according to the task at hand. I’m intending making more use of the command line than I currently make. I think there will always be a discussion as to what the best tool for a particular task is, and even if a “best” tool exists in the sense that “best” is a qualitative judgement. For example, I might prefer to do something in Python rather than shell because I can write it in a way that might be longer but is more descriptive of what its doing. Someone who has a better memory for command line options would come to a different judgement.