Running on a Cluster - Hadoop

Now that we are happy with the program running on a small test dataset, we are ready to try it on the full dataset on a Hadoop cluster. Chapter Setting Up aHadoop Cluster overs how to set up a fully distributed cluster, although you can also work through this section on a pseudodistributed cluster.

Packaging

We don’t need to make any modifications to the program to run on a cluster rather than on a single machine, but we do need to package the program as a JAR file to send to the cluster. This is conveniently achieved using Ant, using a task such as this (you can find the complete build file in the example code):

If you have a single job per JAR, then you can specify the main class to run in the JAR file’s manifest. If the main class is not in the manifest, then it must be specified on the command line (as you will see shortly). Also, any dependent JAR files should be packaged in a lib subdirectory in the JAR file. (This is analogous to a Java Web application archive, or WAR file, except in that case the JAR files go in a WEB-INF/lib subdirectoryin the WAR file.)

Launching a Job

To launch the job, we need to run the driver, specifying the cluster that we want to run the job on with the -conf option (we could equally have used the -fs and -jt options):

The runJob() method on JobClient launches the job and polls for progress, writing a line summarizing the map and reduce’s progress whenever either changes. Here’s the output (some lines have been removed for clarity):

The output includes more useful information. Before the job starts, its ID is printed: this is needed whenever you want to refer to the job, in logfiles for example, or when interrogating it via the hadoop job command. When the job is complete, its statistics (known as counters) are printed out. These are very useful for confirming that the job did what you expected. For example, for this job we can see that around 275 GB ofinput data was analyzed (“Map input bytes”), read from around 34 GB of compressed files on HDFS (“HDFS_BYTES_READ”). The input was broken into 101 gzipped files of reasonable size, so there was no problem with not being able to split them.

Job, Task, and Task Attempt IDs:

The format of a job ID is composed of the time that the jobtracker (not the job) started and an incrementing counter maintained by the jobtracker to uniquely identify the job to that instance of the jobtracker. So the job with this ID:

is the second (0002, job IDs are 1-based) job run by the jobtracker which started at 08:11 on April 11, 2009. The counter is formatted with leading zeros to make job IDs sort nicely in directory listings, for example.

However, when the counter reaches 10000 it is not reset, resulting in longer job IDs (which don’t sort so well).Tasks belong to a job, and their IDs are formed by replacing the job prefix of a job ID with a task prefix, and adding a suffix to identify the task within the job. For example:

is the fourth (000003, task IDs are 0-based) map (m) task of the job with ID job_200904110811_0002. The task IDs are created for a job when it is initialized, so they do not necessarily dictate the order that the tasks will be executed in.

Tasks may be executed more than once, due to failure (see “Task Failure” ) or speculative execution (see “Speculative Execution”,) so to identify different instances of a task execution, task attempts are given unique IDs on the jobtracker. For example:

is the first (0, attempt IDs are 0-based) attempt at running task task_200904110811_0002_m_000003. Task attempts are allocated during the job run as needed, so their ordering represents the order that they were created for tasktrackers to run.

The final count in the task attempt ID is incremented by 1,000 if the job is restarted after the jobtracker is restarted and recovers its running jobs.

The MapReduce Web UI

Hadoop comes with a web UI for viewing information about your jobs. It is useful for following a job’s progress while it is running, as well as finding job statistics and logs after the job has completed.

The jobtracker page

A screenshot of the home page is shown in Figure . The first section of the page gives details of the Hadoop installation, such as the version number and when it was compiled, and the current state of the jobtracker (in this case, running), and when it was started.

Next is a summary of the cluster, which has measures of cluster capacity and utilization. This shows the number of maps and reduces currently running on the cluster, the total number of job submissions, the number of tasktracker nodes currently available, and the cluster’s capacity: in terms of the number of map and reduce slots available across the cluster (“Map Task Capacity” and “Reduce Task Capacity”), and the number ofavailable slots per node, on average. The number of tasktrackers that have been blacklisted by the jobtracker is listed as well (blacklisting is discussed in “Tasktracker Failure” ).

Below the summary, there is a section about the job scheduler that is running (here the default). You can click through to see job queues.

Further down, we see sections for running, (successfully) completed, and failed jobs. Each of these sections has a table of jobs, with a row per job that shows the job’s ID, owner, name (as set using JobConf’s setJobName() method, which sets the mapred.job.name property) and progress information.

Finally, at the foot of the page, there are links to the jobtracker’s logs, and the jobtracker’s history: information on all the jobs that the jobtracker has run. The main display displays only 100 jobs (configurable via the mapred.jobtracker.completeuser jobs.maximum property), before consigning them to the history page. Note also that the job history is persistent, so you can find jobs here from previous runs of the jobtracker

Job History

Job history refers to the events and configuration for a completed job. It is retained whether the job was successful or not. Job history is used to support job recovery after a jobtracker restart (see the mapred.jobtracker.restart.recover property), as well as providing interesting information for the user running a job.

Job history files are stored on the local filesystem of the jobtracker in a history subdirectory of the logs directory. It is possible to set the location to an arbitrary Hadoop filesystem via the hadoop.job.history.location property. The jobtracker’s history files are kept for 30 days before being deleted by the system.

A second copy is also stored for the user in the _logs/history subdirectory of the job’s output directory. This location may be overridden by setting hadoop.job.history.user.location. By setting it to the special value none, no user job history is saved, although job history is still saved centrally. A user’s job history files are never deleted by the system.

The history log includes job, task, and attempt events, all of which are stored in a plaintext file. The history for a particular job may be viewed through the web UI, or via the command line, using hadoop job -history (which you point at the job’s output directory).

The job page

Clicking on a job ID brings you to a page for the job, illustrated in below. At the top of the page is a summary of the job, with basic information such as job owner and name, and how long the job has been running for. The job file is the consolidated configuration file for the job, containing all the properties and their values that were in effect during the job run. If you are unsure of what a particular property was set to, you can click through to inspect the file.

While the job is running, you can monitor its progress on this page, which periodically updates itself. Below the summary is a table that shows the map progress and the reduce progress. “Num Tasks” shows the total number of map and reduce tasks for this job (a row for each). The other columns then show the state of these tasks: “Pending” (waiting to run), “Running,” “Complete” (successfully run), “Killed” (tasks that havefailed this column would be more accurately labeled “Failed”). The final column shows the total number of failed and killed task attempts for all the map or reduce tasks for the job (task attempts may be marked as killed if they are a speculative execution duplicate, if the tasktracker they are running on dies or if they are killed by a user). See “Task Failure” for background on task failure.

Further down the page, you can find completion graphs for each task that show their progress graphically. The reduce completion graph is divided into the three phases of the reduce task: copy (when the map outputs are being transferred to the reduce’s tasktracker), sort (when the reduce inputs are being merged), and reduce (when the reduce function is being run to produce the final output). The phases are described in more detail in “Shuffle and Sort”In the middle of the page is a table of job counters. These are dynamically updated during the job run, and provide another useful window into the job’s progress and general health. There is more information about what these counters mean in “Builtin Counters” .

Retrieving the Results

Once the job is finished, there are various ways to retrieve the results. Each reducer produces one output file, so there are 30 part files named part-00000 to part-00029 in the max-temp directory.As their names suggest, a good way to think of these “part” files is as parts of the max-temp “file.”

If the output is large (which it isn’t in this case), then it is important to have multiple parts so that more than one reducer can work in parallel. Usually, if a file is in this partitioned form, it can still be used easilyenough: as the input to another MapReduce job, for example. In some cases, you can exploit the structure of multiple partitions to do a mapside join, for example, (“Map-Side Joins” ) or a MapFilelookup (“An application: Partitioned MapFile lookups” )

This job produces a very small amount of output, so it is convenient to copy it from HDFS to our development machine. The -getmerge option to the hadoop fs command is useful here, as it gets all the files in the directory specified in the source pattern and merges them into a single file on the local filesystem:

We sorted the output, as the reduce output partitions are unordered (owing to the hash partition function). Doing a bit of postprocessing of data from MapReduce is very common, as is feeding it into analysis tools, such as R, a spreadsheet, or even a relational database.

Another way of retrieving the output if it is small is to use the -cat option to print the output files to the console:

On closer inspection, we see that some of the results don’t look plausible. For instance, the maximum temperature for 1951 (not shown here) is 590°C! How do we find out what’s causing this? Is it corrupt input data or a bug in the program?

Debugging a Job

The time-honored way of debugging programs is via print statements, and this is certainly possible in Hadoop. However, there are complications to consider: with programs running on tens, hundreds, or thousands of nodes, how do we find and examine the output of the debug statements, which may be scattered across these nodes? For this particular case, where we are looking for (what we think is) an unusual case, we can use a debug statement to log to standard error, in conjunction with a message to update the task’s status message to prompt us to look in the error log. The web UI makes this easy, as we will see.

We also create a custom counter to count the total number of records with implausible temperatures in the whole dataset. This gives us valuable information about how to deal with the condition if it turns out to be a common occurrence, then we might need to learn more about the condition and how to extract the temperature in these cases, rather than simply dropping the record. In fact, when trying to debug a job, you should always ask yourself if you can use a counter to get the information you need to find out what’s happening. Even if you need to use logging or a status message, it may be useful to use a counter to gauge the extent of the problem. (There is more on counters in “Counters” .)

If the amount of log data you produce in the course of debugging is large, then you’ve got a couple of options. The first is to write the information to the map’s output, rather than to standard error, for analysis and aggregation by the reduce. This approach usually necessitates structural changes to your program, so start with the other techniques first. Alternatively, you can write a program (in MapReduce of course) to analyze thelogs produced by your job.

We add our debugging to the mapper (version 4), as opposed to the reducer, as we want to find out what the source data causing the anomalous output looks like:

If the temperature is over 100°C (represented by 1000, since temperatures are in tenths of a degree), we print a line to standard error with the suspect line, as well as updating the map’s status message using the setStatus() method on Reporter directing us to look in the log. We also increment a counter, which in Java is represented by a field of an enum type. In this program, we have defined a single field OVER_100 as a way to count the number of records with a temperature of over 100°C.With this modification, we recompile the code, re-create the JAR file, then rerun the job, and while it’s running go to the tasks page.

The tasks page

The job page has a number of links for look at the tasks in a job in more detail. For example, by clicking on the “map” link, you are brought to a page that lists information for all of the map tasks on one page. You can also see just the completed tasks. The screenshot in Figure shows a portion of this page for the job run with our debugging statements. Each row in the table is a task, and it provides such information as the startand end times for each task, any errors reported back from the tasktracker, and a link to view the counters for an individual task.

message. Before a task starts, it shows its status as “initializing,” then once it starts reading records it shows the split information for the split it is reading as a filename with a byte offset and length. You can see the status we set for debugging for task task_200904110811_0003_m_000044, so let’s click through to the logs page to find the associated debug message. (Notice, too, that there is an extra counter for this task, sinceour user counter has a nonzero count for this task.)

The task details page

From the tasks page, you can click on any task to get more information about it. The task details page, shown in Figure , shows each task attempt. In this case, there was one task attempt, which completed successfully. The table provides further useful data, such as the node the task attempt ran on, and links to task logfiles and counters.The “Actions” column contains links for killing a task attempt. By default, this is disabled, making the web UI a read-only interface. Set webinterface.private.actions to true to enable the actions links.

By setting webinterface.private.actions to true, you also allow anyone with access to the HDFS web interface to delete files. The dfs.web.ugi property determines the user that the HDFS web UI runs as, thus controlling which files may be viewed and deleted.For map tasks, there is also a section showing which nodes the input split was located on.

By following one of the links to the logfiles for the successful task attempt (you can see the last 4 KB or 8 KB of each logfile, or the entire file), we can find the suspect input record that we logged (the line is wrapped and truncated to fit on the page):

This record seems to be in a different format to the others. For one thing, there are spaces in the line, which are not described in the specification. When the job has finished, we can look at the value of the counter we defined to see how many records over 100°C there are in the whole dataset. Counters are accessible via the web UI or the command line:

The -counter option takes the job ID, counter group name (which is the fully qualified classname here), and the counter name (the enum name). There are only three malformed records in the entire dataset of over a billion records. Throwing out bad records is standard for many big data problems, although we need to be careful in this case, since we are looking for an extreme value the maximum temperature rather than anaggregate measure. Still, throwing away three records is probably not going to change the result.

Hadoop User Logs

Hadoop produces logs in various places, for various audiences. These are summarized in below.

As you have seen in this section, MapReduce task logs are accessible through the web UI, which is the most convenient way to view them. You can also find the logfiles on the local filesystem of the tasktracker that ran the task attempt, in a directory named by the task attempt. If task JVM reuse is enabled (“Task JVM Reuse” ), then each logfile accumulates the logs for the entire JVM run, so multiple task attemptswill be found in each logfile. The web UI hides this by showing only the portion that is relevant for the task attempt being viewed.

It is straightforward to write to these logfiles. Anything written to standard output, or standard error, is directed to the relevant logfile. (Of course, in Streaming, standard output is used for the map or reduce output, so it will not show up in the standard output log.)In Java, you can write to the task’s syslog file if you wish by using the Apache Commons Logging API. The actual logging is done by log4j in this case: the relevant log4j appender is called TLA (Task Log Appender) in the log4j.properties file in Hadoop’s configuration directory.

There are some controls for managing retention and size of task logs. By default, logs are deleted after a minimum of 24 hours (set using the mapred.userlog.retain.hours property). You can also set a cap on the maximum size of each logfile using the mapred.userlog.limit.kb property, which is 0 by default, meaning there is no cap.

Handling malformed data

Capturing input data that causes a problem is valuable, as we can use it in a test to check that the mapper does the right thing:

The record that was causing the problem is of a different format to the other lines we’ve