Archive

Here’s a bunch of commands to setting up the Amazon EC2 Linux Environment. Basically trying to install Python, R, and most common python packages. The resources are limited as compared to if you are setting up a Ubuntu system ( then everything will come much more straight forward) . I still got problem with rpy2. Hope to fix it soon. Besides that, here’s something really bothering, using Amazon Elastic MapReduce, there is no direct/easy way to use such personalized image. Basically, EMR based on Amazon EC2 Linux, so you can’t set it up like Ubuntu. It supports four Amazon EC2 Linux families: Standard, High-CPU, High-Memory, and Cluster Compute Instances. Then, to customize each instance that will do the mapReduce job for you, you need to use the Bootstrap Action, which basical submit a bash command, like below to setting up the environment every time before the job starts.

In my original MapReduce algo, mapper simply loads the json, create index as key and give it to Reducer, Reducer coded in python by calls R via rpy2 and create RandomForest in R. For my case, to bypass this Bootstrap Action, using a python as mapper and a R.script as reducer could be the option.

Elastic MapReduce doesn’t support customer AMIs at this time. The service instead has a feature called “Bootstrap Actions” that allows you to pass a reference to a script stored in Amazon S3 and relatedarguments to Elastic MapReduce when creating a job flow. This script is executed on each job flowinstance before the actual job flow runs. This post describes how to create bootstrap actions: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=3938&categoryID=265 (section “Creating a Job Flow with Bootstrap Actions”)