Details

This example shows how to use Hadoop Streaming to count the number of times that words occur within a text collection. Hadoop streaming
allows one to execute MapReduce programs written in languages such as Python, Ruby and PHP.

In order to run a Hadoop Streaming job with Amazon Elastic MapReduce this program must be uploaded to Amazon S3. This can be done using tools such as s3cmd or the Firefox plugin S3 Organizer. Luckily this word count example has already been uploaded to Amazon S3 at the location:

s3://elasticmapreduce/samples/wordcount/wordSplitter.py

This can be run on Amazon Elastic MapReduce using the AWS Management Console (https://console.aws.amazon.com). Choose the Amazon Elastic MapReduce tab and then the "Create New Job Flow" button. Next choose the word count example.

You'll notice that the word count example is using the builtin reducer called aggregate. This reducer adds up the counts of words being
emitted by the wordSplitter map function. It knows to use data type Long from the prefix on the words.

It is also possible to run this example using the Elastic MapReduce Command Line Ruby Client with the command (Make sure you replace
my-bucket in the output parameter with the name of one of your Amazon S3 buckets):

Comments

Need more information

A lot of people including me are new to Hadoop streaming. With the word count example a lot of things are clarified, but still the input is not available. Although the job flow creates and executes correctly and creates output the make sense, it is not clear how the input looks
Shivani