Hadoop provides an optional mode of execution in which the bad records
are detected and skipped in further attempts.

This feature can be used when map/reduce tasks crashes deterministically on
certain input. This happens due to bugs in the map/reduce function. The usual
course would be to fix these bugs. But sometimes this is not possible;
perhaps the bug is in third party libraries for which the source code is
not available. Due to this, the task never reaches to completion even with
multiple attempts and complete data for that task is lost.

In the skipping mode, the map/reduce task maintains the record range which
is getting processed at all times. Before giving the input to the
map/reduce function, it sends this record range to the Task tracker.
If task crashes, the Task tracker knows which one was the last reported
range. On further attempts that range get skipped.

Constructor Detail

SkipBadRecords

Method Detail

getAttemptsToStartSkipping

Get the number of Task attempts AFTER which skip mode
will be kicked off. When skip mode is kicked off, the
tasks reports the range of records which it will process
next to the TaskTracker. So that on failures, TT knows which
ones are possibly the bad records. On further executions,
those are skipped.
Default value is 2.

Parameters:

conf - the configuration

Returns:

attemptsToStartSkipping no of task attempts

setAttemptsToStartSkipping

Set the number of Task attempts AFTER which skip mode
will be kicked off. When skip mode is kicked off, the
tasks reports the range of records which it will process
next to the TaskTracker. So that on failures, TT knows which
ones are possibly the bad records. On further executions,
those are skipped.
Default value is 2.

Parameters:

conf - the configuration

attemptsToStartSkipping - no of task attempts

getAutoIncrMapperProcCount

Get the flag which if set to true,
COUNTER_MAP_PROCESSED_RECORDS is incremented
by MapRunner after invoking the map function. This value must be set to
false for applications which process the records asynchronously
or buffer the input records. For example streaming.
In such cases applications should increment this counter on their own.
Default value is true.

setAutoIncrMapperProcCount

Set the flag which if set to true,
COUNTER_MAP_PROCESSED_RECORDS is incremented
by MapRunner after invoking the map function. This value must be set to
false for applications which process the records asynchronously
or buffer the input records. For example streaming.
In such cases applications should increment this counter on their own.
Default value is true.

getAutoIncrReducerProcCount

Get the flag which if set to true,
COUNTER_REDUCE_PROCESSED_GROUPS is incremented
by framework after invoking the reduce function. This value must be set to
false for applications which process the records asynchronously
or buffer the input records. For example streaming.
In such cases applications should increment this counter on their own.
Default value is true.

setAutoIncrReducerProcCount

Set the flag which if set to true,
COUNTER_REDUCE_PROCESSED_GROUPS is incremented
by framework after invoking the reduce function. This value must be set to
false for applications which process the records asynchronously
or buffer the input records. For example streaming.
In such cases applications should increment this counter on their own.
Default value is true.

getMapperMaxSkipRecords

Get the number of acceptable skip records surrounding the bad record PER
bad record in mapper. The number includes the bad record as well.
To turn the feature of detection/skipping of bad records off, set the
value to 0.
The framework tries to narrow down the skipped range by retrying
until this threshold is met OR all attempts get exhausted for this task.
Set the value to Long.MAX_VALUE to indicate that framework need not try to
narrow down. Whatever records(depends on application) get skipped are
acceptable.
Default value is 0.

Parameters:

conf - the configuration

Returns:

maxSkipRecs acceptable skip records.

setMapperMaxSkipRecords

Set the number of acceptable skip records surrounding the bad record PER
bad record in mapper. The number includes the bad record as well.
To turn the feature of detection/skipping of bad records off, set the
value to 0.
The framework tries to narrow down the skipped range by retrying
until this threshold is met OR all attempts get exhausted for this task.
Set the value to Long.MAX_VALUE to indicate that framework need not try to
narrow down. Whatever records(depends on application) get skipped are
acceptable.
Default value is 0.

Parameters:

conf - the configuration

maxSkipRecs - acceptable skip records.

getReducerMaxSkipGroups

Get the number of acceptable skip groups surrounding the bad group PER
bad group in reducer. The number includes the bad group as well.
To turn the feature of detection/skipping of bad groups off, set the
value to 0.
The framework tries to narrow down the skipped range by retrying
until this threshold is met OR all attempts get exhausted for this task.
Set the value to Long.MAX_VALUE to indicate that framework need not try to
narrow down. Whatever groups(depends on application) get skipped are
acceptable.
Default value is 0.

Parameters:

conf - the configuration

Returns:

maxSkipGrps acceptable skip groups.

setReducerMaxSkipGroups

Set the number of acceptable skip groups surrounding the bad group PER
bad group in reducer. The number includes the bad group as well.
To turn the feature of detection/skipping of bad groups off, set the
value to 0.
The framework tries to narrow down the skipped range by retrying
until this threshold is met OR all attempts get exhausted for this task.
Set the value to Long.MAX_VALUE to indicate that framework need not try to
narrow down. Whatever groups(depends on application) get skipped are
acceptable.
Default value is 0.