Details

Description

Instead of computing the input splits as part of job submission, Hadoop could have a separate "job task type" that computes the input splits, therefore allowing that computation to happen on the cluster.

The following test timeouts occurred in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient:

org.apache.hadoop.mapred.pipes.TestPipeApplication

The test build failed in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app

Thanks, Gera. Nice work and this will be quite useful. Overall it looks good. Per offline discussion with Gera,

1. It is unclear if there is any security related implication such as https://issues.apache.org/jira/browse/MAPREDUCE-5663.
2. The compatibility between new MR client with this feature and cluster with old MR. Given new MR client won't compute the split by default; the job will fail if the cluster still uses old MR. So in this case, new MR client needs to be configured to compute split. For a more general case where new MR client can talk to some cluster with old MR and some cluster with new MR, it will be nice if client can discover if the cluster supports this feature.

Ming Ma
added a comment - 27/Jun/14 23:51 Thanks, Gera. Nice work and this will be quite useful. Overall it looks good. Per offline discussion with Gera,
1. It is unclear if there is any security related implication such as https://issues.apache.org/jira/browse/MAPREDUCE-5663 .
2. The compatibility between new MR client with this feature and cluster with old MR. Given new MR client won't compute the split by default; the job will fail if the cluster still uses old MR. So in this case, new MR client needs to be configured to compute split. For a more general case where new MR client can talk to some cluster with old MR and some cluster with new MR, it will be nice if client can discover if the cluster supports this feature.

Assuming that TestPipeApplication is MAPREDUCE-5868, v05 is ready for review. The code can further be optimized to avoid reading splits back when they are written for the first time. We can incorporate it if the approach is accepted in general. There is plenty of coverage for job submission that helped shape the patch. Since it's mere refactoring, no new functional tests are urgently needed.

Gera Shegalov
added a comment - 27/May/14 22:00 Assuming that TestPipeApplication is MAPREDUCE-5868 , v05 is ready for review. The code can further be optimized to avoid reading splits back when they are written for the first time. We can incorporate it if the approach is accepted in general. There is plenty of coverage for job submission that helped shape the patch. Since it's mere refactoring, no new functional tests are urgently needed.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 javadoc. There were no new javadoc warning messages.

+1 eclipse:eclipse. The patch built with eclipse:eclipse.

+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

Hadoop QA
added a comment - 27/May/14 10:36 -1 overall . Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12646848/MAPREDUCE-207.v05.patch
against trunk revision .
+1 @author . The patch does not contain any @author tags.
-1 tests included . The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javac . The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc . There were no new javadoc warning messages.
+1 eclipse:eclipse . The patch built with eclipse:eclipse.
+1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit . The applied patch does not increase the total number of release audit warnings.
-1 core tests . The following test timeouts occurred in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient:
org.apache.hadoop.mapred.pipes.TestPipeApplication
+1 contrib tests . The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4624//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4624//console
This message is automatically generated.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

One test to try there is what happens when the blocksize is reported as very, very small (you can configure this in swiftfs). in the client this will cause the submitting process to OOM and fail. Presumably the same outcome in the AM is the simplest to implement -we just need to make sure that YARN recognises this as a failure and only tries a couple of times

OOM's as any other AM failure are treated as an Application attempt failure (yarn.resourcemanager.am.max-attempts). We've experienced such issues in production, and it is actually usually indirectly related to splits, i.e. the job state comprising all map and reduce attempts is too big for the default MR-AM container size.

Before doing the work on moving split calculation to MR-AM, I was actually thinking about auto-tuning yarn.app.mapreduce.am.resource.mb and Xmx opts in JobSubmitter. However, even if the split calculation happens in AM, we can come up with an AM-RM RPC like "start a new attempt with the new settings".

Gera Shegalov
added a comment - 14/May/14 21:05 Steve Loughran , thanks for your comment in MAPREDUCE-5887 . Moving it to here.
One test to try there is what happens when the blocksize is reported as very, very small (you can configure this in swiftfs). in the client this will cause the submitting process to OOM and fail. Presumably the same outcome in the AM is the simplest to implement -we just need to make sure that YARN recognises this as a failure and only tries a couple of times
OOM's as any other AM failure are treated as an Application attempt failure ( yarn.resourcemanager.am.max-attempts ). We've experienced such issues in production, and it is actually usually indirectly related to splits, i.e. the job state comprising all map and reduce attempts is too big for the default MR-AM container size.
Before doing the work on moving split calculation to MR-AM, I was actually thinking about auto-tuning yarn.app.mapreduce.am.resource.mb and Xmx opts in JobSubmitter. However, even if the split calculation happens in AM, we can come up with an AM-RM RPC like "start a new attempt with the new settings".

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

The following test timeouts occurred in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient:

Currently in our hadoop applications we calculate the splits before we submit it to the client (then the client simply looks up the existing splits). We do that mainly to influence the reducer count base on the number of splits/map-tasks.
In case hadoop does the splitting on the cluster (which makes sense), it would be nice to have a hook to influence configuration!
Sometimes it also makes sense for us to decide on the map-reduce assembly after we know the splits (different join strategies for different data constellations).

Johannes Zillmann
added a comment - 04/Sep/12 10:49 Currently in our hadoop applications we calculate the splits before we submit it to the client (then the client simply looks up the existing splits). We do that mainly to influence the reducer count base on the number of splits/map-tasks.
In case hadoop does the splitting on the cluster (which makes sense), it would be nice to have a hook to influence configuration!
Sometimes it also makes sense for us to decide on the map-reduce assembly after we know the splits (different join strategies for different data constellations).
Just dumping some ideas here...

As foretold, here is a trivial, preliminary patch to move computation of input-splits inside the cluster - something we've craved for a very long time, as evinced by the interest in this jira and the number of times it comes up on user lists.

This is huge, because it's a significant step towards various improvements such as HTTP-based job submission etc.

Shameless plug for MRv2 - it took me 15 mins on a Sunday night to get this done... glory to MRv2! smile

It needs a tad more work to get delegation tokens on the client side, but it's nearly there.

Arun C Murthy
added a comment - 26/Sep/11 09:39 As foretold, here is a trivial, preliminary patch to move computation of input-splits inside the cluster - something we've craved for a very long time, as evinced by the interest in this jira and the number of times it comes up on user lists.
This is huge, because it's a significant step towards various improvements such as HTTP-based job submission etc.
Shameless plug for MRv2 - it took me 15 mins on a Sunday night to get this done... glory to MRv2! smile
It needs a tad more work to get delegation tokens on the client side, but it's nearly there.

I think that's almost right, Philip. It looks to me like TASK_CLEANUP tasks can be both maps and reduces. The JobTracker will launch them in a reduce slot if they are cleaning up after a reducer. Therefore, isMapTask() might return false when the task is a cleanup task. To check whether a given Task is a plain old map task or plain old reduce task, you can use Task.isMapOrReduce().

This part of the code definitely leaves something to be desired. I believe Arun mentioned he'd look at it as part of JobTracker refactoring in the future.

Matei Zaharia
added a comment - 04/Jul/09 02:20 I think that's almost right, Philip. It looks to me like TASK_CLEANUP tasks can be both maps and reduces. The JobTracker will launch them in a reduce slot if they are cleaning up after a reducer. Therefore, isMapTask() might return false when the task is a cleanup task. To check whether a given Task is a plain old map task or plain old reduce task, you can use Task.isMapOrReduce().
This part of the code definitely leaves something to be desired. I believe Arun mentioned he'd look at it as part of JobTracker refactoring in the future.

I've been poking around here and am running into a fair amount of friction with how different task types are managed.

As far as I can tell, there are several ways that different task types are distinguished:

There's a TaskType enum, which contains MAP, REDUCE, JOB_SETUP, JOB_CLEANUP, and TASK_CLEANUP. This is used quite a bit.

TaskInProgress has isMapTask(), isJobCleanupTask(), isJobSetupTask(). I believe that TIP can report both isMapTask() and isJobCleanupTask() on the same object and that reduces are implied by !isMapTask().

Schedulers and TaskTrackers for the most part only deal with MAP and REDUCE tasks. Really, these are "slot types", since other types of tasks can be run in them. Schedulers are not aware of the "special tasks"---the JobTracker schedules them "manually" on its own.

Philip Zeyliger
added a comment - 04/Jul/09 01:01 I've been poking around here and am running into a fair amount of friction with how different task types are managed.
As far as I can tell, there are several ways that different task types are distinguished:
There's a TaskType enum, which contains MAP, REDUCE, JOB_SETUP, JOB_CLEANUP, and TASK_CLEANUP. This is used quite a bit.
TaskInProgress has isMapTask(), isJobCleanupTask(), isJobSetupTask(). I believe that TIP can report both isMapTask() and isJobCleanupTask() on the same object and that reduces are implied by !isMapTask().
Task uses a hybrid approach. There's MapTask and ReduceTask (a class hierarchy), but there's also isMapTask(), isJobSetupTask(), isTaskCleanupTask(), and isJobCleanuptask().
Schedulers and TaskTrackers for the most part only deal with MAP and REDUCE tasks. Really, these are "slot types", since other types of tasks can be run in them. Schedulers are not aware of the "special tasks"---the JobTracker schedules them "manually" on its own.
Does this sound about right?
– Philip

This patch should reintroduce checkInputSplits into org.apache.hadoop.mapreduce.InputFormat. This method should be documented as optional. It will only be invoked if Java code is doing the submission to detect errors in the user's job configuration, such as missing or read-protected input directory, before the job is submitted to the cluster.

Owen O'Malley
added a comment - 15/Jun/09 17:49 This patch should reintroduce checkInputSplits into org.apache.hadoop.mapreduce.InputFormat. This method should be documented as optional . It will only be invoked if Java code is doing the submission to detect errors in the user's job configuration, such as missing or read-protected input directory, before the job is submitted to the cluster.

Amareshwari Sriramadasu
added a comment - 15/Jun/09 05:10 Isn't it possible to do this as part of the JOB_SETUP task itself?
This can be done. We should move out the creation of setup/cleanup tasks from JobInProgress.initTasks().

Before we do this, I think we should resolve HADOOP-4421. Atleast to the extent of agreeing to a design. Adding one more task, while we are trying to fix problems with the existing ones might make things a tad more difficult to manage.

Hemanth Yamijala
added a comment - 15/Jun/09 04:59 Before we do this, I think we should resolve HADOOP-4421 . Atleast to the extent of agreeing to a design. Adding one more task, while we are trying to fix problems with the existing ones might make things a tad more difficult to manage.

The motivation behind computing the input splits on the cluster is at least two-fold:

It would be great to be able to submit jobs to a cluster using a simple (REST?) API, from many languages. (Similar to HADOOP-5633.) The fact that job submission does a bunch of mapreduce-internal work makes such submission very tricky. We're already seeing how workflow systems (here I'm thinking of Oozie and Pig) run MR jobs simply to launch more MR jobs, while inheriting the scheduling and isolation work that the JobTracker already does.

Sometimes computing the input splits is, in of itself, an operation that would do well to be run in parallel across several machines. For example, splitting inputs may require going through many files on the DFS. Moving input split calculations onto the cluster would pave the way for this to be possible.

Implementation-wise, we already have JOB_SETUP and JOB_CLEANUP tasks, so adding a JOB_SPLIT_CALCULATION, which could be colocated with JOB_SETUP makes some sense.

Philip Zeyliger
added a comment - 14/Jun/09 23:58 The motivation behind computing the input splits on the cluster is at least two-fold:
It would be great to be able to submit jobs to a cluster using a simple (REST?) API, from many languages. (Similar to HADOOP-5633 .) The fact that job submission does a bunch of mapreduce-internal work makes such submission very tricky. We're already seeing how workflow systems (here I'm thinking of Oozie and Pig) run MR jobs simply to launch more MR jobs, while inheriting the scheduling and isolation work that the JobTracker already does.
Sometimes computing the input splits is, in of itself, an operation that would do well to be run in parallel across several machines. For example, splitting inputs may require going through many files on the DFS. Moving input split calculations onto the cluster would pave the way for this to be possible.
Implementation-wise, we already have JOB_SETUP and JOB_CLEANUP tasks, so adding a JOB_SPLIT_CALCULATION, which could be colocated with JOB_SETUP makes some sense.