Mini-Cluster Part IV : Word Count Benchmark

In this part, we will run a simple Word Count application on the cluster using Hadoop and Spark on various platforms and cluster sizes.

We will run and benchmark the same program on 5 datasets of different sizes on :

A single MinnowBoard MAX, using a multi-threaded simple java application

A real home computer (my laptop), using the same simple java application

MapReduce, using a cluster of 2 to 4 slaves

Spark, using a cluster of 2 to 4 slaves

Using these results we will hopefully be able to answer to the original questions of this section : is a home cluster with such small computers worth it ? How many nodes does it take to be faster than a single node, or faster than a real computer ?

Word Count : Definition, Data Files and Choices

Definition

Word Count is a simple program which as its name suggests, is used to count the number of times each word is found in a text.

The input text is naturally split into different pieces, called blocks, in HDFS (in the case of Hadoop and Spark). Each block is processed line by line to count the number of words.

I have decided to use this program for the benchmark because it is a reference, a kind of “Hello World” program for both MapReduce and Spark. Apache uses this program on both framework websites in their tutorial or examples to illustrate how they work.

Datasets

To get big quantities of text without repetition, I downloaded text files from Project Guternberg. They have an URL to crawl files of a given language and file type.

To download english .txt files from Project Gutenberg in a background task, I used the following commands:

Multithreaded Program for Single PC

Java Code

It uses a Producer-Consumer(s) design pattern with a configurable number of consumers. This is optimal for this kind of program, because :

The Producer is a thread which is responsible for reading the input file and storing lines in a buffer.

Having only one thread read the input is better for an HDD. As discussed in Part I, parallel access is a weakness of a HDD, because of of slow seek times.

The Consumers are threads which read the lines from the buffer and do the word counting.

The best number of consumers is variable, and depends on the CPU. This is why the number of consumers is configurable.

After the word-counting, towards the end of the main function, an alphabetical sorting of the results is made. I did this for debugging reasons, so that the results look like those of MapReduce, which naturally orders the words as a part of its process. This program gives the time it took before and after this ordering. We will ignore the sorting time and use the “before” time.

Compilation and Execution

Example of compilation and execution on the “tiny” text file, for MinnowBoard, using 4 consumer threads :

Benchmark Results

Here are the results for a single MinnowBoard MAX (Intel Atom E3825 Dual Core with 2 GB of RAM) and my Alienware laptop (Intel Core i7-3610QM with 8 GB of RAM). After a several tries, I used the optimal settings for each one, which were :

Compilation and Execution

Then, assuming that the small.txt input file is located in the
hdfs:///user/hduser/small/input folder, and that you want YARN to create the output folder called “output” right next to it :

1

$yarn jar wc.jar MapReduceWordCount small/input small/output

Benchmark Results

For a given dataset size, lower values mean better. Hover over the bars to get the exact values.

Using Spark

Java Code

Spark can also run Python and Scala, which are a lot readable and faster to code than Java. But for the sake of fair benchmarking, i.e for language-related performance and also to make sure that the same logic is applied (especially for regex), I will stick to java.

Here is the modified Spark class, inspired by the official Word Count example by Apache :

Benchmark Results

Spark did not turn out to be a lot faster on Standalone than on YARN. It was only a few seconds (on small datasets) to 1 minute faster on the “huge” dataset. I have included in my benchmark the YARN results for 2 to 4 slaves, and also the Standalone results, for 4 slaves only since it doesn’t change much.

For a given dataset size, lower values mean better. Hover over the bars to get the exact values.

Spark on YARN vs Spark on Standalone

In this environment and for this type of task, Spark Standalone turned out to be slightly faster than on YARN. But running Spark on top of YARN has the following advantages :

Manage the cluster resources together with other YARN apps.

Spark can only scale the number of executors dynamically when run on YARN (using the
spark.dynamicAllocation.* properties). It is not yet supported on Spark Standalone.

Summary and Conclusion

Here is a summary of all the results :

Efficiency of the Cluster

Using MapReduce :

We need at least 3 slaves to beat the Single MinnowBoard (starting on datasets ≥ 10 GB). But 4 slaves are needed to see some progress.

On the “huge” text file MapReduce, the full cluster is 44% faster than the single node.

Using Spark :

2 slaves are enough to beat the single MinnowBoard.

2 slaves with Spark are faster than 4 slaves with MapReduce.

On the “huge” text file, the full cluster is 380% faster than the single node and 266% faster than MapReduce.

I couldn’t beat the Alienware laptop. Shame on me ! But at least I got quite close to it, and my cluster is 380% faster that the original MinnowBoard I had. 🙂

Bonus : Word Count without the improvements

These results were based on my version of the WordCount application, which was adapted for devices with little RAM.

But by removing the transformations that I added (the improvements described earlier) and using the basic Word Count provided by Apache, the word dictionary doesn’t fit into memory for single PCs, which have to use swap memory. Without these improvements, the size of the result output for the “huge” dataset went from 30 MB to 500 MB, which seems to take almost 4 GB of memory in the JVM.

The MinnowBoard was so slow that I didn’t finish the benchmark (would have taken days). The Alienware became a lot slower towards the end when the heap started to be full too. Using this version of the Word Count, the cluster won with Spark.

The Alienware and MinnowBoard single PC programs had the advantage of keeping everything in memory, but we can see that their limit can be easily reached.

MapReduce and Spark are built to handle these problems, by processing fragments (blocks) of the input and by using shuffle operations instead of keeping everything in the same machine’s memory.

MapReduce on 4 slaves is still the slowest, but it certainly looks stable. And Spark rocks.

Nicolas

4 Comments

Xavier said:

Hey Nicolas,
Just wanted to drop a message as you probably put lot of time in this tutorial which is really concise & well explained. Thank you for sharing all this valuable infos. Even if I am not looking to setup a hadoop/sparks cluster, I did setup something quite similar for elasticsearch but without going this far to benchmark the performance gain compared to a single machine.
Anyways, I’ll check sometimes for new awesome articles 🙂
Cheers,
Xavier