Pig needs a more efficient DAG execution

Details

Type: Improvement

Status:Resolved

Priority: Major

Resolution:
Duplicate

Affects Version/s:
None

Fix Version/s:
None

Component/s:
None

Labels:

None

Description

The current code uses Hadoop's Job control to execute one stage at a time. The first stage includes all jobs with no dependencies, the second stage jobs that depend only on jobs completed in the first stage, the third stage contains the jobs that depend on jobs from stage 1 and 2, etc.

The problem with this simplistic approach is that each next stages only starts when the previous stage is over which means means that some branches of the DAG are unnecessarily blocked.

We would need to do our own DAG management to solve this issue which would be a pretty significant undertaking. Something we should look at in the future.

Jeff Hammerbacher
added a comment - 17/Nov/10 04:50 Some work in this direction has been done by the Hive team ( HIVE-549 ). There has also been a proposal for Pig and Hive to unify their plan execution frameworks ( HIVE-1107 ), potentially using Oozie.

Santhosh Srinivasan
added a comment - 17/Nov/10 19:41 +1 on the proposal to move to an external workflow execution engine. However note that the use of the workflow execution engine should not be enforced but should be optional.

Jeff Hammerbacher
added a comment - 17/Nov/10 21:24 However note that the use of the workflow execution engine should not be enforced but should be optional.
Certainly agree that we shouldn't disrupt existing users.

Arun C Murthy
added a comment - 17/Nov/10 21:54 +1 on a more efficient DAG execution engine, and for exploring common infrastructure between Pig and Hive.
It's hard to keep this in sync with HIVE-549 , but I'll try.
Jeff and I came up with some requirements:
A way to serialize and exchange this DAG (e.g. Avro, JSON, XML)
A service to execute the DAG and ensure it runs to completion
Ability to modify the DAG on the fly, potentially in reaction to execution of parents of the nodes.
Maybe shared infrastructure for ability to restart the necessary components of the DAG etc.
Given the above, I do not believe Oozie is a right answer, I'd agree with Zheng ( https://issues.apache.org/jira/browse/HIVE-1107?focusedCommentId=12805351&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12805351 ) that enhancing JobControl would probably be the sweet spot - this way Pig, Hive and even Oozie can use it.
Russel Jurney has similar views against using Oozie too: https://issues.apache.org/jira/browse/HIVE-1107?focusedCommentId=12888870&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12888870