01101011 01101111 01100100 01101001 01101110 01100111

Fun with MapReduce (part II)

In this post, I describe how you can use Amazon’s S3 cloud storage to set up your environment before you run the mapreduce job you just created.

Step 1
Go to your S3 console and create a bucket. Bucket names are universal (meaning if I name my bucket “bucket-of-awesome-posts”, then you cannot create one of the same name).

Step 2
Create the following 4 folders in the bucket
– data: this will hold the input data for your mapreduce job (e.g. the text file used by our WordCount program)
– job: this will hold the WordCount app’s jar.
– logs: AWS EMR will automatically log your program’s logs here, under a subfolder named after the mapreduce Job ID (which you can find in the EMR console)
– results: this is where your mapreduce results will go

Now let’s remember the Part I post where we had these two lines:

FileInputFormat.addInputPath(conf, new Path("s3n://[S3_BUCKET]/data/someFile"));
FileOutputFormat.setOutputPath(conf, new Path("s3n://[S3_BUCKET]/results"));

[S3_BUCKET] refers to the bucket you created in step 1. Notice the input path uses the bucket/data subfolder, which you create in step 2. The output path uses the bucket/result subfolder. You may also specify additional subfolders to place under /result programmatically and the subfolders will be automatically created.

By the way, here’s a video of the steps in this post.
This post covers the first 4:30min of the video. Hopefully you’ll be better able to follow along now that you’ve ready through this post at your own speed. The remaining part of the video is covered in part III but feel free to skip ahead too.