Description

Many dependent projects like HBase, Hive, Pig, etc, depend on MiniMRCluster for writing tests. Many users do as well. MiniMRCluster, however, exposes MR implementation details like the existence of TaskTrackers, JobTrackers, etc, since it was used by MR1 for testing the server implementations as well.

This JIRA is to create a new interface which could be implemented either by MR1 or MR2 that exposes only the client-side portions of the MR framework. Ideally it would be "recompile-compatible" with MiniMRCluster for most applications, and the MR1 implementation could be backported to 20x branch. Thus, dependent projects like HBase could migrate to this implementation and test against both MR1 and MR2. We can also use this to port over the current functional tests that use only the client-side features of MiniMRCluster.

Ahmed Radwan
added a comment - 18/Nov/11 10:38 > should MiniMRCluster be deprecated in 23/trunk in favour of MiniMRClientClusterFactory.
Agree, attaching a small amendment to deprecate MiniMRCluster on trunk, so it is clear to use MiniMRClientClusterFactory instead.

Siddharth Seth
added a comment - 17/Nov/11 21:43 As part of this, should MiniMRCluster be deprecated in 23/trunk in favour of MiniMRClientClusterFactory, or an additional note in MiniMRCluster pointing people to MiniMRClientClusterFactory.

> Should we call it something other than MiniMRCluster to prevent confusion?

Yes. The patch for 0.20 does not change MiniMRCluster, and for 23/trunk it provides an implementation of MiniMRCluster for the operations that are possible when running with YARN (the jobtracker/tasktracker-specific methods will throw UnsupportedOperationException). In addition, a new interface called MiniMRClientCluster is introduced (in 20/23/trunk) that downstream projects can start using.

Tom White
added a comment - 17/Nov/11 20:55 > Should we call it something other than MiniMRCluster to prevent confusion?
Yes. The patch for 0.20 does not change MiniMRCluster, and for 23/trunk it provides an implementation of MiniMRCluster for the operations that are possible when running with YARN (the jobtracker/tasktracker-specific methods will throw UnsupportedOperationException). In addition, a new interface called MiniMRClientCluster is introduced (in 20/23/trunk) that downstream projects can start using.

Alejandro Abdelnur
added a comment - 17/Nov/11 20:42 MiniMRCluster is on purpose to allow downstream projects to continue using the same API for testing for all their legacy testcases with none/minimal disruption.
For new stuff folks should use MiniMRClientCluster via MiniMRClientClusterFactory.
I'm planning to commit this now, are we OK?

Hadoop QA
added a comment - 14/Nov/11 06:42 +1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12503565/MAPREDUCE-3169-0.20-security_rev2.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 8 new or modified tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in .
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1299//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1299//console
This message is automatically generated.

Hadoop QA
added a comment - 14/Nov/11 06:33 +1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12503564/MAPREDUCE-3169-trunk_rev3.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 10 new or modified tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in .
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1298//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1298//console
This message is automatically generated.

> Why do some operations in MiniMRCluster do nothing, while others throw a UnsupportedOperationException?

This was a subjective assessment, so methods that are tightly coupled with MR1 will throw an UnsupportedOperationException so they'll noisily fail. Other methods are just unimplemented as the corresponding functionality is abscent in MR2. I can change all of them to throw an UnsupportedOperationException but wanted to be as flexible as possible with old code.

Ahmed Radwan
added a comment - 14/Nov/11 06:02 Thanks Tom!
Please see updated patch incorporating your comments.
> Why do some operations in MiniMRCluster do nothing, while others throw a UnsupportedOperationException?
This was a subjective assessment, so methods that are tightly coupled with MR1 will throw an UnsupportedOperationException so they'll noisily fail. Other methods are just unimplemented as the corresponding functionality is abscent in MR2. I can change all of them to throw an UnsupportedOperationException but wanted to be as flexible as possible with old code.
> Please add class-level javadoc.
Added to all new classes.
> MiniMRClientClusterFactory should use Class<?>, not a raw Class.
Change incorporated in the new patch.
> MiniMRYarnClusterAdapter might be better named MiniMRClientYarnCluster.
The name used is "MiniMRClientCluster". The MiniMRYarnClusterAdapter name is used by the adapter for MiniMRYarnCluster providing the MiniMRClientCluster interface.

Tom White
added a comment - 08/Nov/11 17:34 Looks good to me.
Why do some operations in MiniMRCluster do nothing, while others throw a UnsupportedOperationException?
Please add class-level javadoc.
MiniMRClientClusterFactory should use Class<?>, not a raw Class.
MiniMRYarnClusterAdapter might be better named MiniMRClientYarnCluster.

Hadoop QA
added a comment - 08/Nov/11 00:43 +1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12502804/MAPREDUCE-3169-trunk_rev2.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 10 new or modified tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in .
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1265//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1265//console
This message is automatically generated.

For organization purpose, please note that the actual changes made to specific tests for the purpose of migration are tacked in MAPREDUCE-3369. This current ticket is only tracking general interfaces changes.

Ahmed Radwan
added a comment - 07/Nov/11 21:05 For organization purpose, please note that the actual changes made to specific tests for the purpose of migration are tacked in MAPREDUCE-3369 . This current ticket is only tracking general interfaces changes.

A revised patch for trunk. The patch mainly adds a mock class MiniMRCluster that wraps the new interfaces, so old tests can run with minimal changes. There are also some minor changes to the patch based on issues faced during tests migration.

Ahmed Radwan
added a comment - 07/Nov/11 21:02 A revised patch for trunk. The patch mainly adds a mock class MiniMRCluster that wraps the new interfaces, so old tests can run with minimal changes. There are also some minor changes to the patch based on issues faced during tests migration.

Tom White
added a comment - 01/Nov/11 16:50 Another nice benefit of this approach (either of Todd's 1 or 2) is that we could write a LocalJobRunner implementation of MiniMRCluster and get increased test coverage (e.g. by writing a parameterized test superclass, see http://junit.sourceforge.net/javadoc/org/junit/runners/Parameterized.html ). This would be a follow up JIRA.

I'm suggesting that MiniMRCluster becomes a wrapper of MiniMRClientCluster. The constructors would be kept not break old tests, only propagating configs that make sense to MiniMRClientCluster.

In addition MiniMRCluster API should be either trimmed of MR1 specific methods or make them throw UnsupportedOperationException.

While this would break some tests, most likely will be MR1 tests only. And, if it is the case for other projects, then those projects will have to make some changes to accomodate MR2. (although I'd believe this would be minimal or NIL).

Alejandro Abdelnur
added a comment - 28/Oct/11 22:48 I'm suggesting that MiniMRCluster becomes a wrapper of MiniMRClientCluster. The constructors would be kept not break old tests, only propagating configs that make sense to MiniMRClientCluster.
In addition MiniMRCluster API should be either trimmed of MR1 specific methods or make them throw UnsupportedOperationException.
While this would break some tests, most likely will be MR1 tests only. And, if it is the case for other projects, then those projects will have to make some changes to accomodate MR2. (although I'd believe this would be minimal or NIL).

I don't think we can feasibly update MiniMRCluster, since its constructors have lots of MR1-isms. If we break the constructor APIs, then we can't have upstream projects (eg HBase) compile against 20 and 23 interchangeably.

Just to make sure we're on the same page, here are the requirements as I see it:

1. Upstream projects must be able to start/stop a mini-cluster suitable for testing MR jobs in a close-to-real-life environment.
2. When building against 0.20 they should start MR1 clusters, when building against 0.23 they should start YARN/MR2 clusters.
3. Upstream projects should not be expected to move to any API that isn't supported by 0.20 (lots of people will continue to run 0.20 for at least a year if not longer)
4. Existing MR1 tests that don't depend on MR1 specifics should be updated to run against MR2 in trunk/23.

Are we all agreed on the above?

The open questions are:
1. Should we repurpose the existing classes or create a new class?
2. Do we need to do changes to 0.20.x series to create a compatible API?

I'm not sure exactly what you mean by "Re-implement old MiniMRCluster to be a facade on MiniMRClientCluster keeping only meaningful methods" – but it seems like you're suggesting API changes on MiniMRCluster in trunk. That would break requirement #3 above.

Todd Lipcon
added a comment - 28/Oct/11 21:50 I don't think we can feasibly update MiniMRCluster, since its constructors have lots of MR1-isms. If we break the constructor APIs, then we can't have upstream projects (eg HBase) compile against 20 and 23 interchangeably.
Just to make sure we're on the same page, here are the requirements as I see it:
1. Upstream projects must be able to start/stop a mini-cluster suitable for testing MR jobs in a close-to-real-life environment.
2. When building against 0.20 they should start MR1 clusters, when building against 0.23 they should start YARN/MR2 clusters.
3. Upstream projects should not be expected to move to any API that isn't supported by 0.20 (lots of people will continue to run 0.20 for at least a year if not longer)
4. Existing MR1 tests that don't depend on MR1 specifics should be updated to run against MR2 in trunk/23.
Are we all agreed on the above?
The open questions are:
1. Should we repurpose the existing classes or create a new class?
2. Do we need to do changes to 0.20.x series to create a compatible API?
I'm not sure exactly what you mean by "Re-implement old MiniMRCluster to be a facade on MiniMRClientCluster keeping only meaningful methods" – but it seems like you're suggesting API changes on MiniMRCluster in trunk. That would break requirement #3 above.

Re-implement old MiniMRCluster to be a facade on MiniMRClientCluster keeping only meaningful methods (h-mr-p/h-mr-regression-test)

Use MiniMRCluster for all (70 odd) testcases that use it for regression purposes (the current 'ant tests'). For the rest (20 odd) testcases that use MR1 specific stuff, we just delete those tests.

MiniMRCluster would in h-mr-p/h-mr-regression-test module, other projects (pig, hive, oozie, etc) would use it from there. It would not be avail in MR2 modules (to force all new stuff to use the new MiniMRClientCluster)

Alejandro Abdelnur
added a comment - 28/Oct/11 20:06 How about the following approach?
Don't do anything pre 0.23
For 0.23/trunk:
Have new MiniMRClientCluster (h-mr-p/h-mr-c/h-mr-c-jc)
Re-implement old MiniMRCluster to be a facade on MiniMRClientCluster keeping only meaningful methods (h-mr-p/h-mr-regression-test)
Use MiniMRCluster for all (70 odd) testcases that use it for regression purposes (the current 'ant tests'). For the rest (20 odd) testcases that use MR1 specific stuff, we just delete those tests.
MiniMRCluster would in h-mr-p/h-mr-regression-test module, other projects (pig, hive, oozie, etc) would use it from there. It would not be avail in MR2 modules (to force all new stuff to use the new MiniMRClientCluster)

We could go either route - the original proposal just means we need to change 100+ tests in Hadoop (in at least two branches) at the same time. I just checked and HBase, Pig and Hive use MiniMRCluster in only a few places, so the change there is manageable.

Tom White
added a comment - 27/Oct/11 23:07 We could go either route - the original proposal just means we need to change 100+ tests in Hadoop (in at least two branches) at the same time. I just checked and HBase, Pig and Hive use MiniMRCluster in only a few places, so the change there is manageable.

Yea, all of the MR-isms in the constructor is what worries me. I don't really want to maintain an API long-term with JT details, etc. Making this API a blessed part of 0.23 just means we have to hold on to it longer, no?

Todd Lipcon
added a comment - 27/Oct/11 22:37 Yea, all of the MR-isms in the constructor is what worries me. I don't really want to maintain an API long-term with JT details, etc. Making this API a blessed part of 0.23 just means we have to hold on to it longer, no?

Right, but in the migration path I'm suggesting, if they run against 0.20 they use the jobtracker implementation of MiniMRCluster, and if they run against 0.23 they use the YARN implementation. There are some MR1-isms in those constructors (e.g. jobTrackerPort) that we would keep, but nothing that I can see that would cause a problem. Does that make sense?

Tom White
added a comment - 27/Oct/11 22:19 Right, but in the migration path I'm suggesting, if they run against 0.20 they use the jobtracker implementation of MiniMRCluster, and if they run against 0.23 they use the YARN implementation. There are some MR1-isms in those constructors (e.g. jobTrackerPort) that we would keep, but nothing that I can see that would cause a problem. Does that make sense?

The issue is that HBase, Pig, and Hive obtain MiniMRClusters directly from a constructor, rather than any kind of factory method. So I think renaming it is impossible without causing a breaking change for all of those – even if they only use the "client operations".

Todd Lipcon
added a comment - 27/Oct/11 22:09 The issue is that HBase, Pig, and Hive obtain MiniMRClusters directly from a constructor, rather than any kind of factory method. So I think renaming it is impossible without causing a breaking change for all of those – even if they only use the "client operations".

I was thinking about a way of implementing this which is less invasive to tests that use MiniMRCluster by keeping its client interface the same.

In 0.20 (and 0.22) we would have a MiniMR1Cluster class that is a copy of the existing MiniMRCluster, then we would remove the non-client-side methods from MiniMRCluster. (Test code is not a part of the published API so it is acceptable to change it incompatibly, and in fact this change would minimize the amount of changes needed by tests.) The few tests that use the framework-specific features of MiniMRCluster would be changed to use MiniMR1Cluster, the others would not need changing.

In 0.23 and trunk we would have the same cutdown MiniMRCluster and the tests that use it would be moved to the mvn tree.

HBase, Pig, and Hive would not need to change unless they use non-client-side methods (I don't know to what extent they do).

Tom White
added a comment - 27/Oct/11 22:03 I was thinking about a way of implementing this which is less invasive to tests that use MiniMRCluster by keeping its client interface the same.
In 0.20 (and 0.22) we would have a MiniMR1Cluster class that is a copy of the existing MiniMRCluster, then we would remove the non-client-side methods from MiniMRCluster. (Test code is not a part of the published API so it is acceptable to change it incompatibly, and in fact this change would minimize the amount of changes needed by tests.) The few tests that use the framework-specific features of MiniMRCluster would be changed to use MiniMR1Cluster, the others would not need changing.
In 0.23 and trunk we would have the same cutdown MiniMRCluster and the tests that use it would be moved to the mvn tree.
HBase, Pig, and Hive would not need to change unless they use non-client-side methods (I don't know to what extent they do).

Sure, I wanted first to get feedback on this core change, before going forward and changing a lot of dependent tests (mainly client-side, input/output formats, etc.) that can be updated to use the new interface.

Ahmed Radwan
added a comment - 27/Oct/11 21:56 Sure, I wanted first to get feedback on this core change, before going forward and changing a lot of dependent tests (mainly client-side, input/output formats, etc.) that can be updated to use the new interface.

There are lots of tests that use MiniMRCluster - these need to be ported to use the new interface and moved into the mvn tree. Those which test JobTracker and TaskTracker internals shouldn't be updated, of course, but there are many client-only tests that serve as regression tests for MR which we need to be running. I'd like to see this done as a part of this work since going through this exercise will likely inform the MiniMRClientCluster API.

Tom White
added a comment - 27/Oct/11 21:27 Looks good.
There are lots of tests that use MiniMRCluster - these need to be ported to use the new interface and moved into the mvn tree. Those which test JobTracker and TaskTracker internals shouldn't be updated, of course, but there are many client-only tests that serve as regression tests for MR which we need to be running. I'd like to see this done as a part of this work since going through this exercise will likely inform the MiniMRClientCluster API.

I am attaching two patches; one for trunk and the other for 0.20-security.

I have added a new test case for both patches. It is a basic MR test for checking job submission and basic counters. The test classes in both patches are identical, but it'll use MiniMRYarnCluster on truck and MiniMRCluster on 0.20-security.

Ahmed Radwan
added a comment - 27/Oct/11 21:08 I am attaching two patches; one for trunk and the other for 0.20-security.
I have added a new test case for both patches. It is a basic MR test for checking job submission and basic counters. The test classes in both patches are identical, but it'll use MiniMRYarnCluster on truck and MiniMRCluster on 0.20-security.