The cache-file option provides a good way for using AWS Elastic MapReduce when you have extra data (rather than input data — where input data will be processed via stdin to mapper) , such as parameter file or other kind of information. Also using GZipped input in the extra arguments to let Hadoop decompress data on the fly before passing data to mapper: -jobconf stream.recordreader.compression=gzip . Here’s an example of how to specify the cache-file in boto:

I am using boto to launch the instance run the user-code in the background, but find that it’s not so convenient to debug and get the error information. Here’s finally what I did. Use the traceback to get the error, use smtp to send back the error information to my email. There must be some other cleverer ways to do so, ha ~

Still having the issue using rpy2 in userdata sent to EC2 instance. Everything works well when opening the terminal of the instance created by my own image, but fails when sending userdata via boto. Tried several testing for test the rpy2, like python -m ‘rpy2.tests’, went through well, but still not work.
wired error like:

Figure out that it is due to eliminating (not output anything) when have missing features. Some part of data could have hugh part as missing feature, which cause the map-reduce status not updating for a long time. Basically, the error means that the task stayed in map or reduce for more then allowed time but with no stdin/stdout.

Reporter is a facility for MapReduce applications to report progress, set application-level status messages and update Counters.

Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive. In scenarios where the application takes a significant amount of time to process individual key/value pairs, this is crucial since the framework might assume that the task has timed-out and kill that task. Another way to avoid this is to set the configuration parameter mapred.task.timeout to a high-enough value (or even set it to zero for no time-outs).

Possible to use the customized AMI for Elastic MapReduce on AW: Elastic MapReduce doesn’t support customer AMIs at this time. The service instead has a feature called “Bootstrap Actions” that allows you to pass a reference to a script stored in Amazon S3 and relatedarguments to Elastic MapReduce when creating a job flow. This script is executed on each job flowinstance before the actual job flow runs. This post describes how to create bootstrap actions:http://developer.amazonwebservices.com/connect/entry.jspa?externalID=3938&categoryID=265 (section “Creating a Job Flow with Bootstrap Actions”)

I would like to be able to access and use HDFS directly instead of having to worry about using the S3 bucket for initial or intermediate IO. I am worried about the IO performance of the S3 bucket against the HDFS performance when accessing the S3 bucket.I have seen multiple posts that say it doesn’t matter and others that say it can matter.

HDFS vs S3 provide different benefits; HDFS has lower latency but S3 has higher durability. In terms of long term storage (without compute) S3 is the cheaper option.

Would people recommend using EMR or EC2 with a Hadoop 0.20 image for doing something like this?

EMR is highly tuned to offer the best performance possible with S3.

Does the EMR setup support using the HDFS like this with custom JARs?

Definitely. Intermediate data is stored in HDFS unless you configure things otherwise. You are able to choose whether to use HDFS or S3 for your initial data.

Bootstrap Actions is a feature in Amazon Elastic MapReduce that provides users a way to run custom set-up prior to the execution of their job flow. Bootstrap Actions can be used to install software or configure instances before running your job flow.

Q: How can I use Bootstrap Actions?

You can write a Bootstrap Action script in any language already installed on the job flow instance including Bash,Perl, Python, Ruby, C++, or Java. There are several pre-defined Bootstrap Actions available. Once the script is written, you need to upload it to Amazon S3 and reference its location when you start a job flow. Please refer to the “Developer’s Guide”: http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/ for details on how to use Bootstrap Actions.

Q: How do I configure Hadoop settings for my job flow?

The Elastic MapReduce default Hadoop configuration is appropriate for most workloads. However, based on your job flow’s specific memory and processing requirements, it may be appropriate to tune these settings. For example, if your job flow tasks are memory-intensive, you may choose to use fewer tasks per core and reduce your job tracker heap size. For this situation, a pre-defined Bootstrap Action is available to configure your job flow on startup. See the Configure Memory Intensive Bootstrap Action in the Developer’s Guide for configuration details and usage instructions. An additional predefined bootstrap action is available that allows you to customize your cluster settings to any value of your choice. See the Configure Hadoop Bootstrap Action in the Developer’s Guide for usage instructions.

*If using Putty:
generate the key pair, save the value(ec2.pem file), which is important.
Download the PuTTYGen, import the ec2.pem file, select “Type-of-key-to-generate”: SSH1
Then save the private key. import the key into Putty, and from your running instance, copy the ‘Public DNS’ to the Host-Name of Putty.
Difference with Amazon EC2 Linux (using ‘ec2-user’ ), here the connection->data->auto_loing_user_name is ‘ubuntu’

mount /dev/sdf /backups
#Edited rc.local:
added mount /dev/sdf /backups to the end which mounts the EBS directory on bootup.

Linux Devices: /dev/sdf through /dev/sdp
Note: Newer linux kernels may rename your devices to /dev/xvdf through /dev/xvdp internally, even when the device name entered here (and shown in the details) is /dev/sdf through /dev/sdp

# To update Amazon EC2 instance which starts out with a default set of software. To install security updates and other pieces of software, run the following commands, in order:

Here’s a bunch of commands to setting up the Amazon EC2 Linux Environment. Basically trying to install Python, R, and most common python packages. The resources are limited as compared to if you are setting up a Ubuntu system ( then everything will come much more straight forward) . I still got problem with rpy2. Hope to fix it soon. Besides that, here’s something really bothering, using Amazon Elastic MapReduce, there is no direct/easy way to use such personalized image. Basically, EMR based on Amazon EC2 Linux, so you can’t set it up like Ubuntu. It supports four Amazon EC2 Linux families: Standard, High-CPU, High-Memory, and Cluster Compute Instances. Then, to customize each instance that will do the mapReduce job for you, you need to use the Bootstrap Action, which basical submit a bash command, like below to setting up the environment every time before the job starts.

In my original MapReduce algo, mapper simply loads the json, create index as key and give it to Reducer, Reducer coded in python by calls R via rpy2 and create RandomForest in R. For my case, to bypass this Bootstrap Action, using a python as mapper and a R.script as reducer could be the option.

Elastic MapReduce doesn’t support customer AMIs at this time. The service instead has a feature called “Bootstrap Actions” that allows you to pass a reference to a script stored in Amazon S3 and relatedarguments to Elastic MapReduce when creating a job flow. This script is executed on each job flowinstance before the actual job flow runs. This post describes how to create bootstrap actions: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=3938&categoryID=265 (section “Creating a Job Flow with Bootstrap Actions”)