6 Easy Steps: Deploy Pivotal's Hadoop on Docker

While Hadoop is becoming more and more mainstream, many development leaders want to speed up and reduce errors in their development and deployment processes (i.e. devops) by using platforms like PaaS and lightweight runtime containers. One of the most interesting, recent stats in the devops arena is that companies with high performing devops processes can ship code 30x more frequently and complete deployment processes 8,000 times faster.

To this end, Docker is a new-but-rising lightweight virtualization solution (more precisely, a lightweight Linux isolation container). Basically, Docker allows you to package and configure a run-time and deploy it on Linux machines—it’s build-once-run-anywhere, isolated like a virtual machine, and runs faster and lighter than traditional VMs. Today, I will show you how two components of the Pivotal Big Data Suite—our Hadoop distribution, Pivotal HD, and our SQL interface on Hadoop, HAWQ—can be quickly and easily set up to run on a developer laptop with Docker.

With the Docker model, we can literally turn heavyweight app environments on and off like a light switch! The steps below typically take less than 30 minutes: 1) Download and Import the Docker Images, 2) Run the Docker Containers, 3) SSH in to the Environment to Start Pivotal HD, 4) Test Hadoop’s HDFS and MapReduce, 5) Start HAWQ—SQL on Hadoop, and 6) Test HAWQ.

Hadoop on Docker—Architecture

This diagram explains the overall deployment of Pivotal HD and HAWQ across several Docker containers. Basically, the workloads will run on a Hadoop master node (e.g. like namenode etc.) with some Hadoop nodes (e.g. datanode etc.) and a HAWQ master with two segment servers.

There are a few other components worth mentioning:

tar files – These are the Docker image files. In the future, we plan to upload these to Docker’s Repository so that you can pull them from Docker. Currently, you need to download a gzip-ed file from our repository.

Containers – These are the Docker containers that contain Pivotal Command Center (our Pivotal HD cluster orchestration tool) and the deployedPivotal HD and HAWQ components. You will NOThave to install and deploy Pivotal HD. It is already built as part of the Docker files!

Other libraries – DNS and SSH servers are set up to work for the cluster.

That’s it. You don’t need any other files—the tar images contain everything you need to set this up on your own laptop or development environment.

Hadoop on Docker—Environments and Prerequisites

Currently, I run this entire environment on my development laptop, and the specs are below. It’s a decent set-up in terms of compute and memory:

Ubuntu 13.10(on Windows 7 using VirtualBox)

2 CPUs are allocated. Intel i5 2.60Ghz

10GB of memory is allocated

In addition, I run this Hadoop on Docker environment on an Amazon Web Services Ubuntu 13.10 64-bit m2.xlarge virtual machine. If you are using Amazon, make sure that your root directory has a plenty of space since /var/lib/docker will be used for image extraction. The AMI is ubuntu-saucy-13.10-amd64-server-20140212.

In theory, this Hadoop on Docker install should work on all Linux systems, but I only tested it on Ubuntu 13.10 64-bit. Also, make sure that you don’t run any containers. (e.g. docker ps command does not return any container IDs.) There are some hardcode values and limitations, which I will fix in future.

6 Simple Steps to start Hadoop with SQL on Docker

1. Download and Import the Hadoop on Docker Images

First, you are going to download the tar ball, extract, and start importing the image into Docker. Remember to check that md5 is the correct hash.

# Exit the shell and find how they are stored in HDFS.hadoop fs -cat /hawq_data/gpseg*/*/*/* # This shows a raw hawq file on hdfs

Well done! Pivotal HD and HAWQ are running on your laptop within the Docker container.

Cleaning Up Your Mess

In my opinion, clean up is the beauty of container solutions like Docker. You can make any mess you like, and, then, you can just kill it—everything is gone. Of course, VMs are a good solution, but stop and start commands take much longer than with a container like Docker. Here is how to clean up your environment: