Understanding MapReduce via Boggle

Graph theory is a growing part of Big Data. Using graph theory, we can find relationships in networks.

MapReduce is a great platform for traversing graphs. Therefore, one can leverage the power of an Apache Hadoop cluster to efficiently run an algorithm on the graph.

One such graph problem is playing Boggle*. Boggle is played by rolling a group of 16 dice. Each players’ job is find the most number of words spelled out by the dice. These dice are six-sided with a single letter that faces up:

A Boggle roll creates a 4×4 matrix of nodes as the dice settle into their slots. Below is an example of the nodes in a Boggle roll.

Once we get into the dice relationships, things get interesting. Each die is related to the one above, below, to the left and right and the diagonals of each side. The diagram below shows the relationships for each die in the roll.

Let’s look at an example of traversing this graph. In this matrix, let’s say that the letter “U” is at position [0,0]. The letter “L” directly beneath it is at position [0,1]. Let’s try to spell out a word that starts with the letter “L” at [0,1]. Looking at the relationships, we have five possible letters to be the next letter in our word. Let’s go with the letter “A” at [1,0] as shown by the blue line. Our word so far is “LA”.

Continuing on from the letter “A” at [1,0], we only have four possible letters to be the next letter. We can’t use the “L” at [0,1] because the rules of Boggle forbid reusing the same letter or node. We choose the “T” at [1,1] again show by the blue line. Our word so far is “LAT”. Let’s skip ahead by choosing the “T” at [1,3], the “E” at [0,3], and “R” at [1,4] which are all shown by the blue line. Our final word is “LATTER”.

Looking at this graph problem with human eyes makes the problem seem easy. Turning this in to an algorithm that finds for all of the words and scales, though, is a difficult problem. The way the graph’s relationships expand, this is a complex challenge with factorial growth. However, using Hadoop and MapReduce, we can create an efficient and scalable solution.

Coding

The full and working solution to this problem is on my GitHub account. The code includes an Eclipse project for trying out the code yourself.

Let’s take a look at the driver code for the graph traversal. Note that for brevity, all of these code snippets will have portions omitted. Please consult the code on GitHub for the unedited code.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

// Traverse the graph until it is completely traversed

do{

Job job=newJob(configuration);

FileInputFormat.setInputPaths(job,getPath(input,iteration));

FileOutputFormat.setOutputPath(job,getPath(input,iteration+1));

// Create Map only job

job.setMapperClass(BoggleMapper.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(RollGraphWritable.class);

booleansuccess=job.waitForCompletion(true);

iteration++;

}while(true);

This driver code is going to run a single node’s worth of relationships per iteration. In the example above, the “L” going to the “A” is one iteration and the “A” going to the “T” is another iteration. The iterations continue until the entire graph is traversed or all the possible words are found.

Notice the iterations are Map-only jobs; the extra overhead of the reducer isn’t necessary. The output from one iteration is fed into the next iteration to continue to traverse the graph.

We need to create a custom writable for the value to handle the nodes in our graph. This is the RollGraphWritable class.

1

2

ArrayList nodes=newArrayList();

booleanisFinal;

RollGraphWritable keeps track of the nodes that make up the word ArrayList. It also keeps track of the RollGraphWritable‘s status to see if its child or adjacent nodes have been traversed. If the RollGraphWritable‘s children have been emitted, the node is marked as true.

The Node object just represents the place in the Boggle roll from which the individual letters are taken.

1

2

publicshortrow;

publicshortcolumn;

The Node object simply encodes the place in the matrix of the letter’s row and column position. This is just like the method I described previously when manually traversing the graph.

This mapper code chunk handles iterating over the node’s relationships. There is a nested for loop. The first loop iterates over the node’s rows. The second loop iterates over the node’s columns. When a viable (not outside the bounds of the matrix) is found, we check to see if the Node is already in the RollGraphWritable. This check is important because we can’t repeat the same letter twice as we’re traversing the graph. If that Node isn’t in the list, the next Node is added to the list and emitted.

The driver will continue to run this mapper code until the entire graph is traversed or all possible words are found. Once this is done, the driver will start a different MapReduce job.

1

2

3

4

5

6

7

8

9

10

Job job=newJob(configuration);

job.setNumReduceTasks(1);

job.setMapperClass(BoggleWordMapper.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(RollGraphWritable.class);

booleansuccess=job.waitForCompletion(true);

The second MapReduce job is much simpler and only runs once: This job will use the built-in identity reducer that will simply emit whatever the mapper emitted. This reduce will give us all words that appear in the Boggle roll in alphabetical order.

The incoming key is the entire word and the value is RollGraphWritablethat contains the nodes that make up the word.

In the mapper’s setup method, a dictionary containing all the words in the English language are loaded into a HashMap for quick access. Every word that was found by traversing the graph is passed into the mapper. If the word appears in the HashMap, it will be emitted and appear in the final list of words in the Boggle roll.

In Part 2, we’ll explore optimizations to improve the performance of the job.

Jesse Anderson is a Curriculum Developer and Instructor for Cloudera University.

2 responses on “Understanding MapReduce via Boggle”

Very interested to see if the best machine optimizations are the same as humans use naturally, eg. immediately ditch twin-consonant starts unless on a (small enough to be hard-coded?) white-list, Q must nearly always have a U next, etc.