These are steps that can solve the resolution problem: Create separate Mapper and Reducer classes. Create a package name for the classes and use it and not the default package (foo.Main, foo.MapClass ...). When you're in Eclipse, try the "Extract required libraries into generated JAR" option, instead of "Package required...

For passing multiple files in a streaming step, you need to use file:// to pass the steps as a json file. AWS CLI shorthand syntax uses comma as delimeter to separate a list of args. So when we try to pass in parameters like: "-files","s3://betaestimationtest/mapper.py,s3://betaestimationtest/reducer.py", then the shorthand syntax parser...

The problems you are experiencing is due to the windows file system using the escape character \ (backslash) in its path. Just double it up and you should not have any more problems. Change your mrjob.conf file to: runners: emr: aws_access_key_id: xxxxxxxxxxxxx aws_region: us-east-1 aws_secret_access_key: xxxxxxxx ec2_key_pair: bzy ec2_key_pair_file: C:\\aa.pem...

In the end, I figures it out (and it was, of course, obvious): Here's how I have done it: add a bootstrap action that downloads the JARs on every node, for example you can upload the JARs in your bucket, make them public and then do: wget https://yourbucket/path/somejar.jar -O $HOME/somejar.jar...

From release notes it says, Hive 0.12 supports Hadoop 1.x.y and Hadop2.x.y. So you should be good. 15 October, 2013: release 0.12.0 available This release works with Hadoop 0.20.x, 0.23.x.y, 1.x.y, 2.x.y ...

Simple mistake on my part was making this not run. I had a random semi-colon instead of a period in my aws.internal.ip.of.coordinator IP address. Looking at my configs I just didn't see it. The above code will work on an Amazon EMR multi-node cluster similar to the one above. ...

I faced similar problem recently. From what I researched, it depends. You can get the data out of the "directory" part but not the "filename" part of s3 keys. You can use partition if s3 keys are formatted properly. partition can be queried the same way as columns. here is...

Let us differentiate the differnt layers: There is the infrastructure layer i.e. on which (virtual) machines should the spark job run. Potential options include local clusters of machines or a cluster of virtual machines rented from EC2. Especially when writing a lot of data from/to S3, EC2 could be a...

There is a class boto.emr.bootstrap_action.BootstrapAction for the bootstrap action. Define it like the below. Most of the code is from the boto example page. import boto.emr from boto.emr.bootstrap_action import BootstrapAction action = BootstrapAction(name="Bootstrap to add SimpleCV", path="s3n://<my bucket uri>/bootstrap-simplecv.sh") conn = boto.emr.connect_to_region('us-west-2') jobid = conn.run_jobflow(name='My jobflow', log_uri='s3://<my log uri>/jobflow_logs', steps=[step],...

Using roles and not hard coded keys is a best practice (http://docs.aws.amazon.com/general/latest/gr/aws-access-keys-best-practices.html). An example of this on EMR is the underlying Hadoop FS calls use the role assigned to the EC2 instance in order to generate temporary security credentials. Your application can be built to do the same (http://docs.aws.amazon.com/IAM/latest/UserGuide/roles-usingrole-ec2instance.html) such...

Two things to check : Are you looking in the correct region? Maybe your CLI is starting the cluster in another region than the one you're looking at in the web console. If you are using different users between web console and CLI, are you using the ---visible-to-all-users option in...

Fixed. So it does happen that there were multiple scala jars in the EMR instances, and they weren't coming from my application jar. The 2.10 jar was hiding in /usr/share/aws/emr/emrfs/lib apart from the installed location for the 2.11 binaries under /usr/share/scala. So I got rid of the 2.10 jar in...

So, three things you can do: Use the web interface. Amazon gives you access to this as detailed here Run the query in screen and then if you get disconnected, just reconnect and reattach to your previous session. You can also point the logging to happen to some file instead...

You are getting an error because exchange is a keyword used to move the data in a partition from a table to another table that has the same schema but does not already have that partition for details view Hive Language Manual and HIVE-4095.

Once you click the ApplicationMaster link from ResourceManager webpage, you'll be redirected to ApplicationMaster web ui; as EMR uses EC2 instances and each EC2 instance has 2 IP addresses associated with it, one used for private communication and another for public. EMR uses private ip addresses (private DNS) to setup...

It Seems you are looking for MultipleOutputFormat There is alerady impplementation code here link1 and complete explanation and example code here link2. Just map your outputfile as input filename or whatever you wish, The files will get "/outputfolder/part-nnnnn" for each group, name "part" can be changed, where nnnnn is...

So at the end I have solved this by simply downloading the pig 0.14 to all of the machines in the bootstrap script and overwriting the PIG_HOME by my pig 0.14 location in ~/.bashrc and it has worked for me. (At least for using pig 0.14 for when I'm connected...

If you are using the bootstrap action from https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark the configuration is setup for Spark on YARN. So just set master to yarn-client or yarn-cluster. Be sure to define the number of executors with memory and cores. More details about Spark on YARN at https://spark.apache.org/docs/latest/running-on-yarn.html Addition regarding executor settings for...

I was able to get this to work using the following format on step field of EMRActivity: Basically I changed -d with -hiveconf. Also changed substitution in hive script from to. I think this is a change made on newer version of hive. Below is the changed working code:...