Navigation

You are viewing the documentation for an older version of boto (boto2).

Boto3, the next version of Boto, is now
stable and recommended for general use. It can be used side-by-side with
Boto in the same project, so it is easy to start using Boto3 in your existing
projects as well as new projects. Going forward, API updates and all new
feature work will be focused on Boto3.

At this point the variable conn will point to an EmrConnection object.
In this example, the AWS access key and AWS secret key are passed in to
the method explicitly. Alternatively, you can set the environment variables:

Upon creating a connection to Elastic Mapreduce you will next
want to create one or more jobflow steps. There are two types of steps, streaming
and custom jar, both of which have a class in the boto Elastic Mapreduce implementation.

Creating a streaming step that runs the AWS wordcount example, itself written in Python, can be accomplished by:

Note that this statement does not run the step, that is accomplished later when we create a jobflow.

Additional arguments of note to the streaming jobflow step are cache_files, cache_archive and step_args. The options cache_files and cache_archive enable you to use the Hadoops distributed cache to share files amongst the instances that run the step. The argument step_args allows one to pass additional arguments to Hadoop streaming, for example modifications to the Hadoop job configuration.

Note that this statement does not actually run the step, that is accomplished later when we create a jobflow. Also note that this JarStep does not include a main_class argument since the jar MANIFEST.MF has a Main-Class entry.

The method will not block for the completion of the jobflow, but will immediately return. The status of the jobflow can be determined by:

>>> status=conn.describe_jobflow(jobid)>>> status.stateu'STARTING'

One can then use this state to block for a jobflow to complete. Valid jobflow states currently defined in the AWS API are COMPLETED, FAILED, TERMINATED, RUNNING, SHUTTING_DOWN, STARTING and WAITING.

In some cases you may not have built all of the steps prior to running the jobflow. In these cases additional steps can be added to a jobflow by running:

>>> conn.add_jobflow_steps(jobid,[second_step])

If you wish to add additional steps to a running jobflow you may want to set the keep_alive parameter to True in run_jobflow so that the jobflow does not automatically terminate when the first step completes.

The run_jobflow method has a number of important parameters that are worth investigating. They include parameters to change the number and type of EC2 instances on which the jobflow is executed, set a SSH key for manual debugging and enable AWS console debugging.

By default when all the steps of a jobflow have finished or failed the jobflow terminates. However, if you set the keep_alive parameter to True or just want to halt the execution of a jobflow early you can terminate a jobflow by: