Tag Archives: scrub

No matter how handy graphical user interfaces are, the good old command line remains a useful tool for performing various low-level data manipulation and system administration tasks. It is the fallback when you need to do something that has no way of graphical control. Being much more expressive and open-ended than a predefined set of controls, the command shell is the ultimate control environment for your computer.

Data science has become one of the most intensely practised computer applications, so it is no wonder that it also benefits greatly from the hands-on control approach of the command line shell. Data scientist Jeroen Janssens has had the foresight to combine the fundamental operations of data science and the most suitable command line tools into a book that collects many useful practices, tips and tricks for processing and preparing data, called “Data Science at the Command Line” (O’Reilly, 2014).

At its highest abstraction levels, data science involves using models and machine learning to extract patterns from data and extrapolate results from data sets that are often much larger than fits in memory at any one time. At a lower level, it involves multiple file formats and just plain hard work to get the data in a fit shape to be analysed, and this is where the command line comes in.

There is only so much you can do with canned tools like text editors, but a world of possibilities opens for you when you have the power can chain simple commands together, forming pipelines of data where one command’s output becomes another one’s input. You can also redirect input from a file to a command, and from a command to a file.

Even though Linux and macOS installations have various command shells, apart from the defaults, Janssens shows you how to use a set of tools called the Data Science Toolbox, which actually uses VirtualBox or Vagrant to plant a self contained GNU/Linux environment with Python, R and various other tools of the trade on your local machine, without disturbing the host operating system too much.

With real-life examples, Janssens shows you how to use classic Linux command line tools like cut, grep, tr, uniq and sort to your advantage. You will also learn how to get data from the Internet, from databases and even Microsoft Excel spreadsheets, where most of the world’s operational data lies hidden from plain sight.

From this book I learned completely new and interesting ways to work with CSV (Comma Separated Value) files, and it introduced me to the excellent csvkit, with its collection of power tools to cut, merge and reorder columns in CSV files, perform SQL-style queries on the lines, and grep through them.

Among other things you get information on Drake, described as “make for data” – which, if you’re familiar with the classic software development tool make (and of course you are) should whet your appetite. There is also a chapter about how to make your data pipelines run faster by parallelising them and running commands on remote machines.

Scrubbing the data is less than half the fun, but usually more than half of the work in data science. You will learn to write executable scripts in Python and R with their comprehensive data science and statistics libraries, and learn to explore your data using visualisations that consist of statistical diagrams like bar charts and box plots. So the command line is not just text; even though the images are generated using commands, they are obviously shown in a GUI window.

Finally, there is a chapter on modelling data using both supervised and unsupervised learning methods, which serves as a cursory introduction to machine learning, although you are referred to more comprehensive texts on the algorithms involved.

At the back of the book there is a handy reference for all the commands discussed in the book, which include many of the old UNIX stalwarts found in Linux, but also newer tools like jq for processing JSON.

If you need to do data preparation for a data science project, you owe it to yourself to become good friends with the command line, and utilise the many tools described in Janssens’ book in your daily work. Even if you don’t “automate all the things“, you will benefit from the pipeline approach to data processing.