Give Codeship a try

Want to learn more?

The purpose of this post is to show how powerful and flexible Docker Swarm can be when combined with standard UNIX tools to analyze data in a distributed fashion. To do this, let’s write a simple MapReduce implementation in bash/sh that uses Docker Swarm to schedule Map jobs on nodes across the cluster.

MapReduce is usually implemented when there’s a large dataset to process. For the sake of simplicity and for reproducibility by the reader, we’re using a very small dataset composed of a few megabytes of text files.

This post is not about showing you how to write a MapReduce program. It’s also not about suggesting that MapReduce is best done in this way. Instead, this post is about making you aware that the plain old UNIX tools such as sort, awk, netcat, pv, uniq, xargs, pipe, join, time, and cat can be useful for distributed data processing when running on top of a Docker Swarm cluster.

Because this is only an example, there’s a lot of work to do to gain fault tolerance resilience and redundancy. A solution like the one proposed here can be useful if you happen to have a one-time use case and you don’t want to invest time in something more complicated like Hadoop. If you have a frequent use case, I recommend you use Hadoop instead.

Requirements for Our MapReduce Implementation with Docker Swarm

To reproduce the examples in this post, you’re going to need a few things:

Docker installed on your local machine

A running Swarm cluster (if you don’t have one, don’t worry. I’ll explain how to obtain one for this purpose in a fast and easy way)

Docker Machine installed on your local machine (to set up the Swarm cluster if you don’t already have one)

MapReduce is a programming paradigm with the aim of processing large datasets in a distributed way on a cluster (in our case a Swarm cluster). As the name suggests, MapReduce is composed of two fundamental steps:

Map: The master node takes a large dataset and distributes it to compute nodes to perform analysis on. Each node returns a result.

Reduce: Gather the result of each Map and aggregate them to produce the final answer.

Setting up the Swarm cluster

If you already have a Swarm cluster, you can skip this section. Just ensure that you’re connecting to the Swarm cluster when using the Docker client. For this purpose, you can inspect the DOCKER_HOST environment variable.

I wrote a setup script so we can easily create a Swarm cluster on DigitalOcean. In order to use it, you need a DigitalOcean account and an API key to allow Docker Machine to manage instances for you. You can obtain the API key here.

When you are done with the API key, export it so it can be used in the setup script:

configuration: Here you have two variables used to configure the entire cluster. The agents variable defines how many Swarm agents to put in the cluster while the token variable is populated with the swarm create command that generates a Docker Hub token used by your cluster for service discovery. If you don’t like the token approach, you can use your own discovery service like Consul, ZooKeeper, or Etcd.

Creation of the Swarm master machine: This is the machine that will expose the Docker Remote API via tcp.

Creation of the Swarm agent machines: According to the configuration, a machine will be created on DigitalOcean for each specified name agent1 agent2 and configured to join the cluster of the previously created Swarm manager.

Print information about the generated cluster: When machines are running, the script just prints informations about the generated cluster and how to connect to it with the Docker client.

Now we can finally execute the create-cluster.sh script:

chmod +x create_cluster.sh
./create_cluster.sh

After a few minutes and a few lines of output and if nothing went wrong, we should see something like this:

Please note that the only one that has something under the ACTIVE column is the master. This is because you ran that eval command to configure your shell previously.

Collecting Data for Analysis

Data analysis would be nothing without data to be analyzed. We’re going to use a few transcripts of the latest seasons of the popular British sci-fi series Doctor Who. For this purpose, I created a Gist with a few of them taken from The Doctor Who Transcripts.

Once you’ve cloned the Gist, you should end up with a who-transcripts folder containing 130 transcripts.

Since one of our requirements for this post is that data analysis be done with UNIX tools, we can use AWK for the map program.

In order to be useful for the reduce step, our map program should be able to transform a transcript like this:

[Albion hospital]
(The patients are almost within touching distance.)
DOCTOR: Go to your room.
(The patients in the ward and the child in the house stand still.)
DOCTOR: Go to your room. I mean it. I'm very, very angry with you. I am very, very cross. Go to your room!
(The child and the patients hang their heads in shame and shuffle away. The child leaves the Lloyd's house and the patients get back into bed.)
DOCTOR: I'm really glad that worked. Those would have been terrible last words.
[The Lloyd's dining room]

Argument Checking: The sole purpose of this part is to retrieve and check the needed arguments for jobs execution.

Jobs Execution: This part consists of a for loop that iterates trough transcript files in the provided folder. On each iteration, a container is started, and the map.awk script is copied to it just before being executed. The output of the mapping is redirected to the result.txt file which collects all mapping outputs. The for loop is controlled by the maxprocs variable that determines the maximum number of concurrent jobs.

Containers removal: Used containers should be removed during the for loop; if that doesn’t happen, they are removed after the loop ends.

The scheduler script could be simplified by running the container with the -rm option, but that would require for the map.awk script to be already inside the image before running.

Since the scheduler is capable of transferring needed data to executors, we don’t need anything else, and we can run the scheduler. But before running the scheduler, we have to tell the Docker client to connect to the Swarm cluster instead of the local engine.

eval $(docker-machine env --swarm manager)

This will start the scheduler using the who-transcripts folder with 40 as the maximum number of concurrent jobs.

Great! That’s a key, value <name> <sentence>. So let’s see if we can reduce this data to something useful with this data with a UNIX command:

cat result.txt | sort | uniq -c | sort -fr

The above reduction command sorts the file, filters unique rows, and then sorts them again in reverse order so that the most common words by speaker are shown first. The output of the first 20 lines of this command then is:

9271 DOCTOR the
7728 DOCTOR you
5290 DOCTOR a
5219 DOCTOR i
4928 DOCTOR to
3959 DOCTOR it
3501 DOCTOR and
3476 DOCTOR of
2595 DOCTOR that
2457 DOCTOR in
2316 DOCTOR no
2309 DOCTOR its
2235 DOCTOR is
2150 DOCTOR what
2088 DOCTOR this
2009 DOCTOR me
1865 DOCTOR on
1681 DOCTOR not
1580 DOCTOR just
1531 DOCTOR im

This means that the most common word said by the Doctor is the. Here’s the distribution graph of the 50 most used words:

If you’re interested, I generated this graph using this gnuplot script:

Conclusion

Docker Swarm is a very flexible tool, and UNIX philosophy is more relevant than ever when performing data analysis. Here we showed how a simple task can be distributed on a cluster by mixing Swarm with a few commands — a possible evolution of this is using a more maintainable approach.

A few possible improvements could be:

Use a real programming language instead of AWK and Bash scripts.

Build and push a Docker image featuring all the needed programs (instead of copying them into Alpine on Docker run).

Put data closest to where it’s being processed (in the example, we loaded the data into the cluster at runtime with the scheduler).

Last but not least: Keep in mind that if you start having frequent and more complicated use cases, Hadoop is your friend.

Subscribe via Email

Over 60,000 people from companies like Netflix, Apple, Spotify and O'Reilly are reading our articles. Subscribe to receive a weekly newsletter with articles around Continuous Integration, Docker, and software development best practices.

We promise that we won't spam you. You can unsubscribe any time.

Join the Discussion

Leave us some comments on what you think about this topic or if you like to add something.