Lots of fetch failures

Details

Added two expert level configuration properties.
1. "mapreduce.reduce.shuffle.notify.readerror" to know whether to send notification to JobTracker after every read error or not. If the configuration is false, read errors are treated similar to connection errors.
2. "mapreduce.reduce.shuffle.maxfetchfailures" to specify the maximum number of the fetch failures after which the failure will be notified to JobTracker.

Added two expert level configuration properties.
1. "mapreduce.reduce.shuffle.notify.readerror" to know whether to send notification to JobTracker after every read error or not. If the configuration is false, read errors are treated similar to connection errors.
2. "mapreduce.reduce.shuffle.maxfetchfailures" to specify the maximum number of the fetch failures after which the failure will be notified to JobTracker.

Description

Since we upgraded to hadoop-0.20.1 from hadoop0.18.3, we see lot of more map task failures because of 'Too many fetch-failures'.

One of our jobs makes hardly any progress, because of 3000 reduces not able to get map output of 2 trailing maps (with about 80GB output each), which repeatedly are marked as failures because of reduces not being able to get their map output.
One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed mapoutput fetch even after a single try when it was a read error (cr.getError().equals(CopyOutputErrorType.READ_ERROR). I do not think this is a good idea, as trailing map tasks will be attacked by all reduces simultaneously.

Here is a log output of a reduce task:

2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: attempt_200910281903_0028_r_000000_0 copy failed: attempt_200910281903_0028_m_002781_1 from some host
2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289)
at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1220)
2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_200910281903_0028_r_000000_0: Failed fetch #1 from attempt_200910281903_0028_m_002781_1
2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Failed to fetch map-output from attempt_200910281903_0028_m_002781_1 even after MAX_FETCH_RETRIES_PER_MAP retries... or it is a read error, reporting to the JobTracker.

Also I saw a few log messages which look suspicious as if successfully fetched map output is discarded because of the map being marked as failed (because of too many fetch failures). This would make the situation even worse.

This is caused by a behavioral change in hadoop 0.20.1. In 0.18.3, the fetch-retry behavior used to be that each map output fetch will retry N (by default 6) times. Now in 0.20.1, not each map output fetch will have N retries. For a particular map output, only the first fetch attempt will have N retries. If the first N retry fails, subsequent fetch attempt, even it is from a different node, will only have 2 retries before failure. Thus, it greatly increased the chance of having "too many fetch failures".

The line of code is in src/mapred/org/apache/hadoop/mapred/ReduceTask.java, line 2090 to line 2104.

I would argue that if the subsequent map output fetch attempt is from the mapper node, it should only have 2 retries. However, if the map output is from a different mapper node (basically a different map attempt), it should still have N retries.

Qi Liu
added a comment - 30/Oct/09 04:24 This is caused by a behavioral change in hadoop 0.20.1. In 0.18.3, the fetch-retry behavior used to be that each map output fetch will retry N (by default 6) times. Now in 0.20.1, not each map output fetch will have N retries. For a particular map output, only the first fetch attempt will have N retries. If the first N retry fails, subsequent fetch attempt, even it is from a different node, will only have 2 retries before failure. Thus, it greatly increased the chance of having "too many fetch failures".
The line of code is in src/mapred/org/apache/hadoop/mapred/ReduceTask.java, line 2090 to line 2104.
I would argue that if the subsequent map output fetch attempt is from the mapper node, it should only have 2 retries. However, if the map output is from a different mapper node (basically a different map attempt), it should still have N retries.

The algorithm in hadoop-0.20 that determines when to mark a map task as having too many fetch failures seems to have big problems with jobs that have a few out-layers in map task runtimes.

The job mentioned in the description has a couple of map tasks requiring about 3 hours of runtime (1hr application execution time and 2hrs merge phase), with the overall average map runtime much smaller.
The consequence is that all reduces just wait for these few map tasks to finish, and these maps get declared as 'failed' very quickly because of many socket timeout exceptions on the reduce side although there is nothing wrong with the map tasks.

The first attempt to run the job failed after 28hrs (we allow for up to 10 map task failures), with about only 200 reduces successfully fetching map output every time the long running map tasks completed; these map tasks got declared as failed within less than 1 minute after completion.

We deployed an emergency fetch in the ReduceTask, commenting out the portion where a single read error results in a report:

and it helped somewhat.
The job finished after 6 re-executions of the long running map tasks (fortunately, we allow a high number of map task failures), but still took 18hrs instead of optimally less than 4 hrs. About 600 reduces got map output successfully before the map tasks got declared as failed, always because of 5+ socket timeout exceptions (one advantage of hadoop-0.20 to retry a fetch within a few seconds was counter-productive here).

With hadoop-0.18.3 this job needed less than 9 hrs (not optimal but we could live with it).

Take away:
Current algorithm is not general enough to cover all corner cases, in particular jobs with a handful of much longer running map tasks.
It should be improved (there should be an allowance for socket timeout exceptions when a high number of reduces fetch output from a single map task in parallel) or there should be a configuration option to allow to override the threshold on a per job basis.

Christian Kunz
added a comment - 01/Nov/09 03:31 The algorithm in hadoop-0.20 that determines when to mark a map task as having too many fetch failures seems to have big problems with jobs that have a few out-layers in map task runtimes.
The job mentioned in the description has a couple of map tasks requiring about 3 hours of runtime (1hr application execution time and 2hrs merge phase), with the overall average map runtime much smaller.
The consequence is that all reduces just wait for these few map tasks to finish, and these maps get declared as 'failed' very quickly because of many socket timeout exceptions on the reduce side although there is nothing wrong with the map tasks.
The first attempt to run the job failed after 28hrs (we allow for up to 10 map task failures), with about only 200 reduces successfully fetching map output every time the long running map tasks completed; these map tasks got declared as failed within less than 1 minute after completion.
We deployed an emergency fetch in the ReduceTask, commenting out the portion where a single read error results in a report:
if (//cr.getError().equals(CopyOutputErrorType.READ_ERROR) ||
((noFailedFetches >= fetchRetriesPerMap)
&& ((noFailedFetches - fetchRetriesPerMap) % 2) == 0)) {
synchronized (ReduceTask.this) {
taskStatus.addFetchFailedMap(mapTaskId);
and it helped somewhat.
The job finished after 6 re-executions of the long running map tasks (fortunately, we allow a high number of map task failures), but still took 18hrs instead of optimally less than 4 hrs. About 600 reduces got map output successfully before the map tasks got declared as failed, always because of 5+ socket timeout exceptions (one advantage of hadoop-0.20 to retry a fetch within a few seconds was counter-productive here).
With hadoop-0.18.3 this job needed less than 9 hrs (not optimal but we could live with it).
Take away:
Current algorithm is not general enough to cover all corner cases, in particular jobs with a handful of much longer running map tasks.
It should be improved (there should be an allowance for socket timeout exceptions when a high number of reduces fetch output from a single map task in parallel) or there should be a configuration option to allow to override the threshold on a per job basis.

Amareshwari Sriramadasu
added a comment - 02/Nov/09 05:37 Christian, are you using Yahoo! distribution for 0.20?
In branch 0.21, MAPREDUCE-353 makes connect and read timeout configurable for a job. Moreover, Shuffle is simplified by MAPREDUCE-318 . Essentially, HADOOP-3327 is no more there.
Christian, Making connect and read timeout configurable should address this issue, right?

Amareshwari Sriramadasu
added a comment - 08/Dec/09 08:33 In branch 0.21, MAPREDUCE-353 makes connect and read timeout configurable for a job. Moreover, Shuffle is simplified by MAPREDUCE-318 . Essentially, HADOOP-3327 is no more there.
Sorry, just verified the new shuffle code, the code for HADOOP-3327 still exists.

One solution I could see is
"Provide a config option whether to send a notification for read error or not, and make number of retries (currently this is hardcoded to 10 on 0.21) on a fetch failure configurable"
However, for Yahoo! distribution, the config option to send a notification for read error or not, would suffice.

Amareshwari Sriramadasu
added a comment - 08/Dec/09 11:17 One solution I could see is
"Provide a config option whether to send a notification for read error or not, and make number of retries (currently this is hardcoded to 10 on 0.21) on a fetch failure configurable"
However, for Yahoo! distribution, the config option to send a notification for read error or not, would suffice.

Jothi Padmanabhan
added a comment - 08/Dec/09 11:20 Provide a config option whether to send a notification for read error or not
Just to clarify this a little more. Currently a notification is sent for every read error. The config option would be to disable this behavior and treat read errors similar to connection errors.

Tested the patch manually by simulating read timeout and verified with configuration "mapreduce.reduce.shuffle.notify.readerror" set to true, the fetch failure is notified after "mapreduce.reduce.shuffle.maxfetchfailures". With "mapreduce.reduce.shuffle.notify.readerror" set to false, it is notified immediately after the failure.

Amareshwari Sriramadasu
added a comment - 09/Dec/09 10:13 Patch adding the suggested config variables.
Tested the patch manually by simulating read timeout and verified with configuration "mapreduce.reduce.shuffle.notify.readerror" set to true, the fetch failure is notified after "mapreduce.reduce.shuffle.maxfetchfailures". With "mapreduce.reduce.shuffle.notify.readerror" set to false, it is notified immediately after the failure.

verified with configuration "mapreduce.reduce.shuffle.notify.readerror" set to true, the fetch failure is notified after "mapreduce.reduce.shuffle.maxfetchfailures". With "mapreduce.reduce.shuffle.notify.readerror" set to false, it is notified immediately after the failure.

Sorry.. I interchanged true and false in above statement. Verified with configuration "mapreduce.reduce.shuffle.notify.readerror" set to false, the fetch failure is notified after "mapreduce.reduce.shuffle.maxfetchfailures". With "mapreduce.reduce.shuffle.notify.readerror" set to true, it is notified immediately after the failure.

Amareshwari Sriramadasu
added a comment - 09/Dec/09 10:30 verified with configuration "mapreduce.reduce.shuffle.notify.readerror" set to true, the fetch failure is notified after "mapreduce.reduce.shuffle.maxfetchfailures". With "mapreduce.reduce.shuffle.notify.readerror" set to false, it is notified immediately after the failure.
Sorry.. I interchanged true and false in above statement. Verified with configuration "mapreduce.reduce.shuffle.notify.readerror" set to false, the fetch failure is notified after "mapreduce.reduce.shuffle.maxfetchfailures". With "mapreduce.reduce.shuffle.notify.readerror" set to true, it is notified immediately after the failure.

Can we rename maxFetchFailuresBeforeReport to maxFetchFailuresBeforeReporting

I think the documentation in mapred-default for mapreduce.reduce.shuffle.notify.readerror can be changed to probably something like Expert. Flag to decide whether JobTracker should be notified on every read error or not. If the flag is false, read errors are treated similar to connection errors.

Jothi Padmanabhan
added a comment - 09/Dec/09 10:44 Patch looks fine to me, couple of minor nits
Can we rename maxFetchFailuresBeforeReport to maxFetchFailuresBeforeReporting
I think the documentation in mapred-default for mapreduce.reduce.shuffle.notify.readerror can be changed to probably something like Expert. Flag to decide whether JobTracker should be notified on every read error or not. If the flag is false, read errors are treated similar to connection errors .

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

Hadoop QA
added a comment - 09/Dec/09 16:05 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12427455/patch-1171.txt
against trunk revision 888761.
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed core unit tests.
+1 contrib tests. The patch passed contrib unit tests.
Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/308/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/308/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/308/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/308/console
This message is automatically generated.

[exec] -1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] -1 tests included. The patch doesn't appear to include any new or modified tests.
[exec] Please justify why no tests are needed for this patch.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.

Manually verified the configuration is honored by simulating read time out for a map on a tasktracker.

Amareshwari Sriramadasu
added a comment - 10/Dec/09 04:17 Patch for Yahoo! distribution.
test-patch result:
[exec] -1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] -1 tests included. The patch doesn't appear to include any new or modified tests.
[exec] Please justify why no tests are needed for this patch.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
Manually verified the configuration is honored by simulating read time out for a map on a tasktracker.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

Hadoop QA
added a comment - 10/Dec/09 05:12 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12427567/patch-1171-ydist.txt
against trunk revision 889085.
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
-1 patch. The patch command could not apply the patch.
Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/181/console
This message is automatically generated.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

Hadoop QA
added a comment - 10/Dec/09 08:46 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12427572/patch-1171-2.txt
against trunk revision 889085.
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed core unit tests.
+1 contrib tests. The patch passed contrib unit tests.
Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/314/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/314/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/314/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/314/console
This message is automatically generated.

Amareshwari Sriramadasu
added a comment - 10/Dec/09 09:00 The test failure, TestTrackerBlacklistAcrossJobs, is unrelated to the patch. The log has following ZipException :
2009-12-10 07:51:17,307 WARN mapred.TaskTracker (TaskTracker.java:startNewTask(1887)) - Error initializing attempt_20091210075025802_0001_m_000001_0:
java.lang.RuntimeException: java.util.zip.ZipException: ZIP_Read: error reading zip file
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1600)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1408)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1352)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:574)
at org.apache.hadoop.mapred.JobConf.checkAndWarnDeprecation(JobConf.java:1874)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:392)
at org.apache.hadoop.mapred.TaskTracker.localizeJobFiles(TaskTracker.java:925)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:869)
at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1883)
at org.apache.hadoop.mapred.TaskTracker.access$1200(TaskTracker.java:109)
at org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1848)
The same test passes on my machine.