This hands-on guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You'll learn how to combine small, yet powerful, command-line tools to quickly obtain, scrub, explore, and model your data.

Discover why the command line is an agile, scalable, and extensible technology. Even if you're already comfortable processing data with, say, Python or R, you'll greatly improve your data science workflow by also leveraging the power of the command line.

Installing the Data Science Toolbox

In both the book and the webcast we use many different command-line tools. The distribution of GNU/Linux that we assume, Ubuntu, comes with a whole bunch of command-line tools pre-installed. Moreover, Ubuntu offers many packages that contain other, relevant command-line tools. Installing these packages yourself is not too difficult. However, we also use command-line tools that are not available as packages and require a more manual, and more involved, installation. In order to acquire the necessary command-line tools without having to go through the involved installation process of each, we encourage you to install the Data Science Toolbox.

The Data Science Toolbox is a virtual environment that allows you to get started doing data science in minutes. The default version comes with commonly used software for data science, including the Python scientific stack and R together with its most popular packages. Additional software and data bundles are easily installed. These bundles can be specific to a certain book, course, or organization. You can read more about the Data Science Toolbox in general at http://datasciencetoolbox.org.

While there exists a bundle for this book, which contains all the data, scripts, and command-line tools used in this book, we have also created a pre-bundled Data Science Toolbox specifically for this book. The following five steps describe how to install it.

Step 2: Download and Install Vagrant

Similarly to Step 1, browse to the Vagrant (HashiCorp, 2014) download page (http://www.vagrantup.com/downloads.html) and download the appropriate binary. Open the binary and follow the installations instructions. If you already have Vagrant installed, please make sure that it’s version 1.5 or higher.

Step 3: Download and Start the Data Science Toolbox

Open a terminal (known as the command prompt in Microsoft Windows). Create a directory, for example MyDataScienceToolbox, and navigate to it by typing:

$ mkdir MyDataScienceToolbox
$cd MyDataScienceToolbox

In order to initialize the Data Science Toolbox, run the following command:

$ vagrant init data-science-toolbox/data-science-at-the-command-line

This creates a file named Vagrantfile. This is a configuration file that tells Vagrant how to launch the virtual machine. This file contains a lot of lines that are commented out. A minimal version is:

By running the following command, the Data Science Toolbox will be downloaded and booted.

$ vagrant up

If everything went well, then you now have a Data Science Toolbox running on your local machine.

If you ever see the message default: Warning: Connection timeout. Retrying... printed repeatedly, then it may be that the virtual machine is waiting for input. This may happen when the virtual machine has not been properly shut down. In order to find out what’s wrong, add the following lines to Vagrantfile before the last end statement:

config.vm.provider"virtualbox"do|vb|vb.gui=trueend

This will cause VirtualBox to show a screen. Once the virtual machine has booted and you have identified the problem, you can remove these lines from Vagrantfile. The username and password are both vagrant.

Step 4: Log in (on Microsoft Windows)

If you are running Microsoft Windows, you need to either run Vagrant with a graphical user interface (see above on how to set that up) or use a third-party application in order to log in to the Data Science Toolbox. We recommend PuTTY for this. Browse to the PuTTY download page (http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html) and download putty.exe. Run PuTTY, and enter the following values:

Host Name (or IP address): 127.0.0.1

Port: 2222

Connection type: SSH

If you want, you can save these values as a session by clicking the Save button, so that you do not need to enter these values again. Click the Open button and enter vagrant for both the username and the password.

Step 5: Get the Data and Scripts

Currently, the Data Science Toolbox does not contain all the data and scripts. An update will be uploaded once the book is finished. For now, in order to obtain the data and scripts, you run the following:

$ cd ~/book
$ git pull

List of Command-line Tools

This is an overview of all the command-line tools discussed in the book. (Please note that, due to time constraints, the webcast will only discuss a subset.) This includes binary executables, interpreted scripts, and Bash builtins and keywords. For each command-line tool, we state, when available and appropriate, the following information:

The actual command to type at the command-line.

A description.

The name of the package it belongs to.

The version used in the book.

The year that version was released.

The primary author(s).

A website to find more information.

How to install it.

How to obtain help.

An example usage.

All command-line tools listed here are included in the Data Science Toolbox for Data Science at the Command Line. See the previous sections for instructions on how to set it up. The install commands assume that you're running Ubuntu 14.04. Please note that citing open source software is not trivial, and that some information may be missing or incorrect.

cowsay

Generate an ASCII picture of a cow with a message. Particularly useful when building up a particular pipeline is starting to frustrate you a bit too much. Cowsay (version 3.03+dfsg1) by Tony Monroe (1999).

echo

env

Run a program in a modified environment. It’s often used to specify which interpreter should run our script. Env (version 8.21) by Richard Mlynarik and David MacKenzie (2012). http://www.gnu.org/software/coreutils.

$ sudo apt-get install coreutils
$ man env
$#!/usr/bin/env python

export

Set export attribute for shell variables. Useful for making shell variables available to other command-line tools. Export is a Bash builtin.