Hadoop for .NET Developers: Implementing a Simple MapReduce Job

In this exercise, we will write and execute a very simple MapReduce job using C# and the .NET SDK. The purpose of this exercise is to illustrate the most basic concepts behind MapReduce.

The job we will create will operate off the integers.txt sample file loaded to the local desktop development environment in previous blog posts. You may remember that file consists of a list of integer values from 1 through 10,000 with each integer occupying its own line.

The map function we will write will accept a line (a single integer), determine whether the value is even or odd, and emit the integer value with a key of “even” or “odd”, accordingly. The reduce function we will write will accept all of integer values associated with a given key, i.e. “even” or “odd”, and then count and sum those values. The reduce function will then emit results in a key-value format with “even” or “odd” again serving as the key and the associated count and sum separated by a simple tab as the value.

To get started, perform the following steps:

1. Launch Visual Studio and open a new C# console application project.

The environment is now ready to write a MapReduce routine. Our next step will be to write a Mapper class:

1. In the program, navigate to the curly brace that defines the end of the Program class.

2. Add the following class to the code:

public class MySimpleMapper : MapperBase

{

public override void Map(string inputLine, MapperContext context)

{

//interpret the incoming line as an integer value

int value = int.Parse(inputLine);

//determine whether value is even or odd

string key = (value % 2 == 0) ? "even" : "odd";

//output key assignment with value

context.EmitKeyValue(key, value.ToString());

}

}

In this code, you will notice the Mapper class is inherits the MapperBase class. We override the Map method of that class to write our own Map method. The Map method accepts data from the input file as a string and a reference to a MapperContext object which is the means by which we will communicate back to the MapReduce environment.

As the incoming data consists of a single integer value, we simply cast it to an int data type and then use a simple modulo calculation to determine whether the value is even or odd, writing “even” or “odd” to the key string variable based on the outcome. This key value along with the original integer value are returned to the MapReduce environment through the context object using the EmitKeyValue function. Note that the EmitKeyValue function only accepts strings so that the integer value is cast to a string data type as part of the call.

Now we will write the reduce function:

1. In the program, navigate to the curly brace that defines the end of the Mapper class.

Our Reducer class inherits the ReducerCombiner class and overrides the Reduce method. The Reduce method accepts a single key value (as a string) and an innumerable collection of its associated values. The Reduce method also receives a reference to the MapReduce environment through the context object.

The code in the Reduce method is very simple. We loop through the collection of values, incrementing a count and adding up the values as we loop. Once this is done, the original key is returned along with the count and sum values we calculated earlier. Notice that the myCount and mySum variables are not explicitly cast to strings. By concatenating these values with a tab-string delimiter, the + operator performs an implicit conversion for us.

Now that we have our Mapper and Reducer classes defined, we can create a MapReduce job. To do this, follow these steps:

At the top of the code, we define a HadoopJobConfiguration object to hold job configuration information such as the folder holding our input files and the folder to where output should be directed. Note that by default a MapReduce job will access all files in the input folder and delete all files in the destination folder. As a result, it is a best practice to use a folder structure and naming scheme that keeps like input files together and outputs from different jobs isolated.

Next, we connect to our local Hadoop cluster. The connection is very simple and in this example we are just using the default user, hadoop, which has a null password.

Next, we execute a MapReduceJob on the cluster and identify our Mapper and Reducer classes as the sources of the Map and Reduce methods it will call. The Execute method has a few implementations which allow us to employ just a Mapper, a Mapper and a Reducer, or a Mapper, Reducer, or Combiner. For this simple job, we want the version that uses the Mapper and Reducer. Configuration information is passed to the Execute method and results are captured by the MapReduceResult object.

The remaining code is just for presentation purposes. We inspect the exit code for the job, post a message about it to our console, and hold the console in place until the user presses a key.

Now it’s time to run the job:

1. From Visual Studio, click the Start button.

2. After a few seconds, the console should appear.

3. Once the job has completed, review the console to verify the job completed successfully.

Now that the job is completed, it’s time to inspect the results:

1. From the desktop, launch the Hadoop Command Prompt.

2. Inspect the contents of the /demo/simple/out folder using the ls function as described in previous posts.

3. Inspect the contents of the /demo/simple/out/part-00000 file using the cat function as described in previous posts.

In this example, MapReduce determined that a single reduce task was adequate to execute that phase of this job. For that reason, there is only one part- file in the output folder.

Sorry, but I do not recognize that exit code. Have you set up a local development environment as described in my earlier post? Also, have you downloaded the integers.txt sample and loaded it to the /demo/simple/in directory of your local cluster? Finally, does the /demo/simple/out directory exist? (It should.)

For a little more insight into what's going on, you might review your cluster's logs. First, capture the name of your job as printed in your console screen. (In the first screenshot in this post, you can see my job was named job_201309070913_0001.) Then, navigate to http://localhost:500030. Scroll down that page and you should find your job by name in the Completed Jobs section. Click the job name to open the log for that job and dig around to see if you can find the issue. Hope this helps.

Regarding your second question, the code here is only intended to be run on a local hadoop cluster. If you are attempting to perform this against HDInsight in Azure cluster, please wait for my upcoming post where I'll address how to run something similar against a cloud-based cluster. There are some differences in how you connect and also how you configure your environment.

Regarding a Windows Azure account, I'm not sure the policy with Ghana but please visit the Windows Azure web site as described in my post on setting up Azure clusters and see where that leads you.

@alex – This code is written using C#. As such, it requires a Windows-based Hadoop cluster to run. Microsoft HDInsight is one such cluster and Hortonworks HDP for Windows (on which HDInsight is based) is another.

If you are using Hadoop Streaming (which is what we are doing here), you have native support for line-oriented text files and JSON. Anything else, you need to work with native Java or fudge it a bit by pulling your data in within the code of your map() function.