Deploying MapReduce v1 (MRv1) on a Cluster

If you use Cloudera Manager, do not use these command-line instructions.

Do not run MRv1 and YARN on the same set of nodes at the same time. This will degrade performance and may result in an unstable cluster deployment. To deploy YARN instead, see
Deploying MapReduce v2 (YARN) on a Cluster. If you have installed CDH 5 from tarballs, the default deployment is YARN.

Specify the hostname and (optionally) port of the JobTracker's RPC server, in the form <host><port>. The default value is local. With the
default value, JobTracker runs on demand when you run a MapReduce job. Do not try to start the JobTracker yourself in this case. If you specify the host other than local, use the hostname (for
example mynamenode) not the IP address.

Edit the mapred.local.dir property to specify the directories where the TaskTracker will store temporary data and intermediate map output files while
running MapReduce jobs. Cloudera recommends that you specify a directory on each of the JBOD mount points: /data/1/mapred/local through /data/N/mapred/local. For example:

Because a TaskTracker that has few functioning local directories will not perform well, Cloudera recommends configuring a health script that checks if the DataNode process is running (if
configured as described under Configuring DataNodes to Tolerate Local Storage Directory Failure, the DataNode will shut down
after the configured number of directory failures). Here is an example health script that exits if the DataNode process is not running:

Set the mapreduce.jobtracker.restart.recover property to true. This ensures that running jobs that fail because of a
system crash or hardware failure are re-run when the JobTracker restarts. A recovered job has the following properties:

It will have the same job ID as when it was submitted.

It will run under the same user as the original job.

It will write to the same output directory as the original job, overwriting any previous output.

It will show as RUNNING on the JobTracker web page after you restart the JobTracker.

Repeat for each TaskTracker.

Configure a health check script for DataNode processes.

Because a TaskTracker that has few functioning local directories will not perform well, Cloudera recommends configuring a health script that checks if the DataNode process is running (if
configured as described under Configuring DataNodes to Tolerate Local Storage Directory Failure, the DataNode will shut down
after the configured number of directory failures). The following is an example health script that exits if the DataNode process is not running:

Create and configure the mapred.system.dir directory in HDFS. The HDFS directory specified by the mapred.system.dir
parameter (by default ${hadoop.tmp.dir}/mapred/system and configure it to be owned by the mapred user.

If this documentation includes code, including but not limited to, code examples, Cloudera makes this available to you under the terms of the Apache License, Version 2.0, including any required
notices. A copy of the Apache License Version 2.0 can be found here.