Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

One of the big issue as a data scientist is to configure correctly the data science environment. Sometimes this means installing a lot of packages, waiting for the packages to compile, handing obscure errors, setting everything up to work correctly... and most of the time, this is a pain. But configuring the environment correctly is necessary to reproduce the analysis and share work with others.

For these reasons, I introduced Docker in my data science workflow.

What Is Docker?

Docker is a tool that simplifies the installation process for software engineers. To explain in a very simple way (sorry, Docker gurus, for this definition), Docker creates a super lightweight virtual machine that can be run in very few milliseconds and contains all we need to run our environment in the right way.

With the -v option, /path_your_machine/notebook_folder/ will be mounted into the Docker container at the /Documents path.

This is useful to save the work and to the environment separate from the notebook. I prefer this way to organize my work instead of creating a Docker container that contains the environment and notebook, too.

When the container is up, we can open the Jupyter web interface:

http://127.0.0.1:8007

and when the token is asked we put ‘mynotebook’, or whatever you set into your dockerfile, and that’s all! Now we can work into our new data science environment.

Click on Documents we have all our notebook!

Note: Every change will be saved when the container is stopped.

To test this environment, I used the example of DBSCAN founded on the sk-learn website. This is the link.

When our work is finished, we can stop the container with the command:

docker stop datascience_env

I think Docker is a very important tool for every developer and for every data scientist to deploy and share work. From my point of view, the most important innovation Docker as introduced is a way to describe how to correctly recreate an environment where my code can run (with a Dockerfile). In this way, I can reproduce, every time, the exact environment I used during my development process and I can share the container built with everyone.