Managing YARN (MRv2) and MapReduce (MRv1)

Note: This page contains references to CDH 5 components or features that have been removed from CDH 6. These references are only applicable if you
are managing a CDH 5 cluster with Cloudera Manager 6. For more information, see Deprecated Items.

CDH supports two versions of the MapReduce computation framework: MRv1 and MRv2, which are implemented by the MapReduce (MRv1) and YARN (MRv2) services. YARN is backwards-compatible with MapReduce. (All jobs that run against MapReduce also run in a YARN
cluster).

The MapReduce v2 (MRv2) or YARN architecture splits the two primary responsibilities of the JobTracker — resource management and job scheduling/monitoring — into separate daemons: a
global ResourceManager and per-application ApplicationMasters. With YARN, the ResourceManager and per-host NodeManagers form the data-computation framework. The ResourceManager service effectively
replaces the functions of the JobTracker, and NodeManagers run on worker hosts instead of TaskTracker daemons. The per-application ApplicationMaster is, in effect, a framework-specific library and
negotiates resources from the ResourceManager and works with the NodeManagers to run and monitor the tasks. For details of this architecture, see Apache Hadoop NextGen MapReduce (YARN).

For information on configuring MapReduce and YARN resource management features, see Resource
Management.

Defaults and Recommendations

In a Cloudera Manager deployment of a CDH cluster, the YARN service is the default MapReduce
computation framework.In CDH 5, the MapReduce 1 service has been deprecated. However, the MapReduce service
is fully supported for backward compatibility through the CDH 5 lifecycle.

For production uses, Cloudera recommends that only one MapReduce framework should be running at any given time. If development needs or other use case requires
switching between MapReduce and YARN, both services can be configured at the same time, but only one should be running (to fully optimize the hardware resources available).

The Activity Monitor role collects information about activities run by the MapReduce service. If MapReduce is not being used and the reporting data is no longer required, then the
Activity Monitor role and database can be removed:

Do one of the following:

Select Clusters > Cloudera
Management Service.

On the Home > Status tab, in Cloudera Management Service table, click the Cloudera Management Service link.

Once you have migrated to YARN and deleted the MapReduce service, you can remove local data from each TaskTracker host. The mapred.local.dir parameter is a
directory on the local filesystem of each TaskTracker that contains temporary data for MapReduce. Once the service is stopped, you can remove this directory to free disk space on each host.

Configuring Alternatives Priority for Services Dependent on MapReduce

The alternatives priority property determines which service—MapReduce or YARN—is used by clients to run MapReduce jobs. The service with a higher value of the property is used. The
MapReduce service alternatives priority is set to 91 and the YARN service is set to 92.

To configure the alternatives priority:

Go to the MapReduce or YARN service.

Click the Configuration tab.

Select Scope > Gateway Default Group.

Select Category > All.

Type Alternatives in Search box.

In the Alternatives Priority property, set the priority value.

Enter a Reason for change, and then click Save Changes to commit the changes.

Redeploy the client configuration.

Configuring MapReduce To Read/Write With Amazon Web Services

These are the steps required to configure MapReduce to read and write with AWS.

Set your hadoop.security.credential.provider.path to the path of the .jceks file in the job configuration so that the
MapReduce framework loads AWS credentials from the .jceks file in HDFS. The following example shows a Teragen MapReduce job that writes to an S3 bucket.

You can specify the variables <hdfs directory>, <file name>, <AWS access key id>, and <AWS secret
access key>. <hdfs directory> is the HDFS directory where you store the .jceks file. <file name> is the name of the .jceks file in HDFS.

If this documentation includes code, including but not limited to, code examples, Cloudera makes this available to you under the terms of the Apache License, Version 2.0, including any required
notices. A copy of the Apache License Version 2.0 can be found here.