Forkjoin validation code is ridiculously slow in some cases

Details

Description

We've had a few users who have run into problems where submitting a workflow appears to hang (in the case of a subworkflow, it's similar but stuck in PREP). It turns out that if you wait long enough, it will actually go through and the workflow will run normally. The problem is that the forkjoin validation code is taking a really long time.

The attached example has a series of 20 forks where each fork has 6 actions (it's based on an actual workflow, but all of the names were changed and the actions were all replaced by simple shell actions). One of our support guys said it took 1-2 hours , but on my computer it was taking 15+ hours (I had to cancel it)
While this example doesn't have any nested forks, those can also take a long time too.

It's easy to verify that it's the forkjoin validation code that's taking so long by looking at a jstack of the Oozie server and seeing deep recursive calls to org.apache.oozie.workflow.lite.LiteWorkflowAppParser.validateForkJoin. I also noticed a lot of sitting around in calls LinkedList.contains.

I think we have 3 options:

See if we can make the existing code faster somehow. Perhaps there's a way to parallelize it? Maybe there's some redundant checking that we can identify and skip? Change some data structures? etc

See if we can write a new way to do this validation. I had originally completely rewritten this code a while ago, and we've since made a few fixes to catch edge cases and things. Perhaps it needs another rewrite?

Try to identify when it's taking a long time and at least let the user know what's happening or something. Right now, it just appears that the Oozie CLI has hung and the job doesn't show up in the Oozie server. Most users aren't going to wait more than a minute or two.

Issue Links

Activity

FYI: I've been slowly working on this and have it roughly halfway done.

We've also seen an issue due to this where a Coordinator gets stuck and you can't even kill it because the Kill command can't acquire the Coordinator's lock since it's being held while the forkjoin validation is taking forever.

Robert Kanter
added a comment - 18/Sep/15 22:41 FYI: I've been slowly working on this and have it roughly halfway done.
We've also seen an issue due to this where a Coordinator gets stuck and you can't even kill it because the Kill command can't acquire the Coordinator's lock since it's being held while the forkjoin validation is taking forever.

Robert Kanter,
Take a look at Purshotam Shah's patch in OOZIE-2345 as well. It parallelizes job submission in fork as we were hitting major slowness with that with 20+ fork jobs with 15m SLA when there was some NN slowness. We did not look into the forkjoin validation though.

Rohini Palaniswamy
added a comment - 21/Sep/15 17:21 Robert Kanter ,
Take a look at Purshotam Shah 's patch in OOZIE-2345 as well. It parallelizes job submission in fork as we were hitting major slowness with that with 20+ fork jobs with 15m SLA when there was some NN slowness. We did not look into the forkjoin validation though.

The problem is that ALL possible execution path is covered by the validateForkJoin() method. For example, the first recursion choses the path F1->A1->A2->A3->J1 and then F2->A7->A8->A9->J2. We go back to check F2->A10->A11->A12->J2. Then the calls return to F1 where the second path is chosen: F1->A4->A5->A6. However, it again re-walks both paths for F2->J2. If you have a lot of fork-join pairs, that's a massive amount of possible paths.

Eg. we have 10 ForkJoin pairs with 5 forks each, that's 500 paths it total. In the attached example, we have 20 pairs with 6 actions, that is 6^20 (3 656 158 440 062 976) paths. No wonder it doesn't finish in 15 hours. Also, at each invocation, we check if we found a cycle - that's the explanation for the lot of LinkedList.contains().

I'm already working on a solution which separately verifies the acyclic property of the graph and then it validates FJ. It's not yet finished, but I did a test run on the example and it took <1ms to complete. So looks like it's working. I'll attach the patch as soon as it passes all unit tests.

Peter Bacsko
added a comment - 11/Jul/16 14:28 - edited HI
I was looking at this problem and I think I found out why it is so slow sometimes. Here is a small graph:
A1--->A2--->A3 A7--->A8--->A9
| | | |
S--> F1 J1-->F2 J2--->E
| | | |
A4--->A5--->A6 A10-->A11-->A12
The problem is that ALL possible execution path is covered by the validateForkJoin() method. For example, the first recursion choses the path F1->A1->A2->A3->J1 and then F2->A7->A8->A9->J2. We go back to check F2->A10->A11->A12->J2. Then the calls return to F1 where the second path is chosen: F1->A4->A5->A6. However, it again re-walks both paths for F2->J2. If you have a lot of fork-join pairs, that's a massive amount of possible paths.
Eg. we have 10 ForkJoin pairs with 5 forks each, that's 500 paths it total. In the attached example, we have 20 pairs with 6 actions, that is 6^20 (3 656 158 440 062 976) paths. No wonder it doesn't finish in 15 hours. Also, at each invocation, we check if we found a cycle - that's the explanation for the lot of LinkedList.contains() .
I'm already working on a solution which separately verifies the acyclic property of the graph and then it validates FJ. It's not yet finished, but I did a test run on the example and it took <1ms to complete. So looks like it's working. I'll attach the patch as soon as it passes all unit tests.

That's a fancy ASCII diagram Peter Bacsko. I realized that there's a ton of paths before, but didn't realize there were that many (I never did the math here).

Anyway, I have a work-in-progress patch that I had been meaning to upload for quite some time, but never got around to it. I'll attach it now. Feel free to use it or steal things from it, but if you already have something, that's fine too. It's been so long since I looked at it, and this whole thing gets complicated, so I don't remember how it works or what else I was planning to do, though there are some comments.

Robert Kanter
added a comment - 11/Jul/16 23:36 That's a fancy ASCII diagram Peter Bacsko . I realized that there's a ton of paths before, but didn't realize there were that many (I never did the math here).
Anyway, I have a work-in-progress patch that I had been meaning to upload for quite some time, but never got around to it. I'll attach it now. Feel free to use it or steal things from it, but if you already have something, that's fine too. It's been so long since I looked at it, and this whole thing gets complicated, so I don't remember how it works or what else I was planning to do, though there are some comments.

Robert Kanter thanks for the patch. After taking a quick look, it roughly has the same approach as mine - the TODOs are useful to think about, I'm not sure we really validate all edge cases. For example, I can't find a test case in TestLiteWorkflow which validates loop detection (error E707). So at least one additional test is necessary, but maybe there are some more.

Peter Bacsko
added a comment - 12/Jul/16 16:32 Robert Kanter thanks for the patch. After taking a quick look, it roughly has the same approach as mine - the TODOs are useful to think about, I'm not sure we really validate all edge cases. For example, I can't find a test case in TestLiteWorkflow which validates loop detection (error E707). So at least one additional test is necessary, but maybe there are some more.

I extracted all validation stuff from LiteWorkflowAppParser to a separate class. Acyclic validation was already there but I changed it slightly (with this approach we can actually print the path that leads to the loop - I stole this from the existing validateForkJoin)

I think we need extra validation for transitions, eg. we need to make sure that a node (other than End and Kill) points to someting. I don't think this is checked right now. Or is it checked on schema level?

Add acyclic test (at least one)

Add test with a big fork-join WF, like the one which is attached. Probably should run on a separate thread just in case it doesn't terminate (I've seen this in Robert's patch)

Peter Bacsko
added a comment - 13/Jul/16 14:16 I submitted my patch for review.
Couple of things:
I extracted all validation stuff from LiteWorkflowAppParser to a separate class. Acyclic validation was already there but I changed it slightly (with this approach we can actually print the path that leads to the loop - I stole this from the existing validateForkJoin)
I think we need extra validation for transitions, eg. we need to make sure that a node (other than End and Kill) points to someting. I don't think this is checked right now. Or is it checked on schema level?
Add acyclic test (at least one)
Add test with a big fork-join WF, like the one which is attached. Probably should run on a separate thread just in case it doesn't terminate (I've seen this in Robert's patch)
We might think about other edge cases and tests

+1 PATCH_APPLIES+1 CLEAN+1 RAW_PATCH_ANALYSIS
. +1 the patch does not introduce any @author tags
. +1 the patch does not introduce any tabs
. +1 the patch does not introduce any trailing spaces
. +1 the patch does not introduce any line longer than 132
. +1 the patch does adds/modifies 2 testcase(s)+1 RAT
. +1 the patch does not seem to introduce new RAT warnings+1 JAVADOC
. +1 the patch does not seem to introduce new Javadoc warnings+1 COMPILE
. +1 HEAD compiles
. +1 patch compiles
. +1 the patch does not seem to introduce new javac warnings+1 BACKWARDS_COMPATIBILITY
. +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
. +1 the patch does not modify JPA files+1 TESTS
. Tests run: 1787+1 DISTRO
. +1 distro tarball builds with the patch

+1 PATCH_APPLIES+1 CLEAN+1 RAW_PATCH_ANALYSIS
. +1 the patch does not introduce any @author tags
. +1 the patch does not introduce any tabs
. +1 the patch does not introduce any trailing spaces
. +1 the patch does not introduce any line longer than 132
. +1 the patch does adds/modifies 4 testcase(s)+1 RAT
. +1 the patch does not seem to introduce new RAT warnings+1 JAVADOC
. +1 the patch does not seem to introduce new Javadoc warnings+1 COMPILE
. +1 HEAD compiles
. +1 patch compiles
. +1 the patch does not seem to introduce new javac warnings+1 BACKWARDS_COMPATIBILITY
. +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
. +1 the patch does not modify JPA files-1 TESTS
. Tests run: 1790
. Tests failed: 8
. Tests errors: 0

+1 PATCH_APPLIES+1 CLEAN+1 RAW_PATCH_ANALYSIS
. +1 the patch does not introduce any @author tags
. +1 the patch does not introduce any tabs
. +1 the patch does not introduce any trailing spaces
. +1 the patch does not introduce any line longer than 132
. +1 the patch does adds/modifies 4 testcase(s)+1 RAT
. +1 the patch does not seem to introduce new RAT warnings+1 JAVADOC
. +1 the patch does not seem to introduce new Javadoc warnings-1 COMPILE
. -1 HEAD does not compile
. -1 patch does not compile
. +1 the patch does not seem to introduce new javac warnings+1 BACKWARDS_COMPATIBILITY
. +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
. +1 the patch does not modify JPA files-1 TESTS - patch does not compile, cannot run testcases+1 DISTRO
. +1 distro tarball builds with the patch

Hadoop QA
added a comment - 28/Jul/16 03:34 Testing JIRA OOZIE-1978
Cleaning local git workspace
----------------------------
+1 PATCH_APPLIES
+1 CLEAN
+1 RAW_PATCH_ANALYSIS
. +1 the patch does not introduce any @author tags
. +1 the patch does not introduce any tabs
. +1 the patch does not introduce any trailing spaces
. +1 the patch does not introduce any line longer than 132
. +1 the patch does adds/modifies 4 testcase(s)
+1 RAT
. +1 the patch does not seem to introduce new RAT warnings
+1 JAVADOC
. +1 the patch does not seem to introduce new Javadoc warnings
-1 COMPILE
. -1 HEAD does not compile
. -1 patch does not compile
. +1 the patch does not seem to introduce new javac warnings
+1 BACKWARDS_COMPATIBILITY
. +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
. +1 the patch does not modify JPA files
-1 TESTS - patch does not compile, cannot run testcases
+1 DISTRO
. +1 distro tarball builds with the patch
----------------------------
-1 Overall result, please check the reported -1(s)
The full output of the test-patch run is available at
. https://builds.apache.org/job/oozie-trunk-precommit-build/3081/

+1 PATCH_APPLIES+1 CLEAN+1 RAW_PATCH_ANALYSIS
. +1 the patch does not introduce any @author tags
. +1 the patch does not introduce any tabs
. +1 the patch does not introduce any trailing spaces
. +1 the patch does not introduce any line longer than 132
. +1 the patch does adds/modifies 3 testcase(s)+1 RAT
. +1 the patch does not seem to introduce new RAT warnings+1 JAVADOC
. +1 the patch does not seem to introduce new Javadoc warnings+1 COMPILE
. +1 HEAD compiles
. +1 patch compiles
. +1 the patch does not seem to introduce new javac warnings+1 BACKWARDS_COMPATIBILITY
. +1 the patch does not change any JPA Entity/Colum/Basic/Lob/Transient annotations
. +1 the patch does not modify JPA files-1 TESTS
. Tests run: 1800
. Tests failed: 1
. Tests errors: 0

Robert Kanter
added a comment - 16/Sep/16 01:38 +1 LGTM
Though given the complexity of this patch, it would be good to have at least one other person review it (e.g. Purshotam Shah , Rohini Palaniswamy , Satish Subhashrao Saley , Jaydeep Vishwakarma , Abhishek Bafna ).