Data Science Toolbox

As a data scientist, you don't want to waste your time installing software. Our goal is to provide a virtual environment that will enable you to start doing data science in a matter of minutes.

As a teacher, author, or organization, making sure that your students, readers, or members have the same software installed is not straightforward. This open source project will enable you to easily create custom software and data bundles for the Data Science Toolbox.

A virtual environment for data science

The Data Science Toolbox is a virtual environment based on Ubuntu Linux that is specifically suited for doing data science. Its purpose is to get you started in a matter of minutes. You can run the Data Science Toolbox either locally (using VirtualBox and Vagrant) or in the cloud (using Amazon Web Services).

We aim to offer a virtual environment that contains the software that is most commonly used for data science while keeping it as lean as possible. After a fresh install, the Data Science Toolbox contains the following software:

R, with the following packages: ggplot2, plyr, dplyr, lubridate, zoo, forecast, and sqldf.

dst, a command-line tool for installing additional bundles on the Data Science Toolbox (see next section).

Let us know if you want to see something added to the Data Science Toolbox.

Additional software and data bundles

The Data Science Toolbox has support for so-called bundles.
A bundle is a collection of software or data that is specific to a certain book, course, or project. (In case you're interested, a bundle is essentially an Ansible playbook.) Once you're logged in to your Data Science Toolbox, you can install bundles with the dst command-line tool. We are currently working on a few exciting bundles that should come out shortly.

Getting started with Data Science Toolbox 0.1.5

Note: If you want to install the Data Science Toolbox for the book Data Science at the Command Line, then you should, for now, follow these instructions.

There are two ways to run the Data Science Toolbox: (1) locally using VirtualBox and Vagrant and (2) in the cloud using Amazon Web Services. Both ways result in exactly the same environment. Select the appropriate tab below for the corresponding installation steps.

Because the local version of the Data Science Toolbox runs on top of VirtualBox and Vagrant, it can be installed on Linux, Mac OS X, and Microsoft Windows.

Step 1: Download and install VirtualBox

Go to the Virtualbox download page and download the appropriate binary. Open the binary and follow the installations instructions.

Step 2: Download and install Vagrant

Similarly to Step 1, go the Vagrant download page and download the appropriate binary. Open the binary and follow the installations instructions.

Step 3: Download and start the Data Science Toolbox

Open a terminal (known as the command prompt in Microsoft Windows). Create a directory, for example "MyDataScienceToolbox", and navigate to it:

$ mkdir MyDataScienceToolbox
$cd MyDataScienceToolbox

In order to download and start the Data Science Toolbox, run the following commands:

$ vagrant init data-science-toolbox/dst
$ vagrant up

Step 4: Log in (on Mac OS X and Linux)

If you are running Mac OS X or some other UNIX-like operating system, you can log in to the Data Science Toolbox by simply running the following command in a terminal:

$ vagrant ssh

Step 4: Log in (on Microsoft Windows)

If you are running Microsoft Windows, you need to use a third-party application in order to log in to the Data Science Toolbox. We recommend Putty for this. Go to its download page and download putty.exe. Run putty.exe and enter the following values:

Host Name (or IP address): 127.0.0.1
Port: 2222
Connection type: SSH

(If you want, you can save these values as a session by clicking the "Save" button, so that you do not need to enter these values again.) Click the "Open" button and enter "vagrant" for both the username and the password.

Step 5: Set up IPython Notebook (optional)

If you like to be able to run IPython Notebook on your Data Science Toolbox, invoke the following command to create a password-protected profile:

vagrant@data-science-toolbox:~$ dst setup base

(Note that vagrant@data-science-toolbox:~ indicates that this command should be run on the Data Science Toolbox.)
Step 3 created a file named Vagrantfile, which is a configuration file used by Vagrant. Open the file in your favorite text editor and add the following text somewhere around line 22:

config.vm.network"forwarded_port",guest:8888,host:8888

This line instructs Vagrant to open up port 8888 so that the IPython Notebook server is accessible from your browser. Restart the Data Science Toolbox and log in again so that the changes take effect:

$ vagrant reload
$ vagrant ssh

To start the IPython Notebook server, run:

vagrant@data-science-toolbox:~$ sudo ipython notebook --profile=dst

You can now access the IPython Notebook server at https://localhost:8888. Because the SSL certificate is self-signed, you may get a warning message from your browser. The image below shows how Chrome complains about this. Because you know what's on the server-side, you can just click on the "Proceed anyway" button.

Step 6: Install additional software packages and bundles (optional)

It's unlikely that the Data Science Toolbox contains all the software you need for your data science project. Fortunately, you can always use apt-get and pip to install individual Ubuntu and Python packages, respectively. For example:

Step 2: Configure EC2 instance

In order to launch an EC2 instance, you need to be logged in to AWS. If you do not yet have an AWS account, select "I am a new user" and click the "Sign in" button.

Once you're logged in to AWS, you can select the type of EC2 instance you want to run. Only the t1.micro type is eligible for the free usage tier.

Choose your preferred instance type and press the "Next" button. You may safely ignore the settings on the next two screen.

Giving your instance a name is useful for when you are running multiple instances, but it is not required.

The settings on the next screen ("Configure Security Group") determine through which ports you can access your Data Science Toolbox. Port 22 is open by default, which allows you log in. If you would like to be able to use IPython notebook, you need to click the "Add rule" button and add a "custom TCP rule" for port "8888" and source "Anywhere". The result should look like the screenshot below. These settings cannot be changed once the EC2 instance is running.

You can now review your settings and click the "Launch" button. Both the "dst" version and "ami" id shown in the top may be different.

Step 3: Create key pair

In order to log in to the EC2 instance, you need to have an AWS key pair. A screen will pop up where you can either use an existing key pair (if you already have one), or create a new one.

Give the key pair a name and press the "Download Key Pair" button. Remember the location where you save the file. If everything went well, you will see something like the following screen. Press the "View Instances" button.

You now see an overview of all your EC2 instances. It will take a few moments before your Data Science Toolbox is assigned a public DNS.

Step 4: Log in (on Mac OS X and Linux)

If you are running Mac OS X or some other UNIX-like operating system, you can log in to the Data Science Toolbox from the terminal. First, you need to make sure that the permissions on your key pair file you downloaded earlier are not too open:

Step 4: Log in (on Microsoft Windows)

Step 5: Set up IPython Notebook (optional)

If you like to be able to run IPython Notebook on your Data Science Toolbox, invoke the following command to create a password-protected profile:

ubuntu@ip-172-31-26-198:~$ dst setup base

(Note that ubuntu@ip-172-31-26-198:~ indicates that this command should be run on the Data Science Toolbox. Your IP address may be different.) To start the IPython Notebook server, run:

ubuntu@ip-172-31-26-198:~$ sudo ipython notebook --profile=dst

You can now access the IPython Notebook server at https://<public dns>:8888. Because the SSL certificate is self-signed, you may get a warning message from your browser. The image below shows how Chrome complains about this. Because you know what's on the server-side, you can just click on the "Proceed anyway" button.

Step 6: Install additional software packages and bundles (optional)

It's unlikely that the Data Science Toolbox contains all the software you need for your data science project. Fortunately, you can always use apt-get and pip to install individual Ubuntu and Python packages, respectively. For example: