It seems we're back at having a successful build being the exception rather than the rule.In this case it was some(?) timeout (all tests ran and passed), but many previous runs had at least one test failing.

Flaky tests are useless. They do not add confidence to a run, and worse they add noise, which requires us to manually filter the good from the bad runs but looking at the results.

There was talk about separating the flaky tests from the good ones.Short term I propose to disable or remove every test that failed more than once in the last 10 runs.This is getting quite frustrating.

Agreed. Looking at TestSplitLogManager.testUnassignedTimeout.It has some pretty tight timeouts (500ms), which is likely to be a problem on slow (or overloaded) build machines.I'm doubling the timeouts.Anyway, the current run just got past these too tests, let's hope there're no other failures.Then we can tackle these tests in 0.94.4 and 0.96.

I "fixed" the top three failing tests (mostly just race conditions with bad timeouts). "Fixed" is in quotes, because just looked at the point where the tests failed and made a better guess about how long to wait. Waiting some fixed amount of time is almost always bad in tests (unless it is a long wait as a safety guard to ensure the test will eventually end), but that was the fastest avenue to get them to pass.Will also look at the other tests. I think we should get to 50% pass rate of the jenkins build soon, and then to at least 80% pass rate.This might mean disabling some of the bad tests.-- Lars

Agreed. Looking at TestSplitLogManager.testUnassignedTimeout.It has some pretty tight timeouts (500ms), which is likely to be a problem on slow (or overloaded) build machines.I'm doubling the timeouts.Anyway, the current run just got past these too tests, let's hope there're no other failures.Then we can tackle these tests in 0.94.4 and 0.96.

> Here're the test that failed recently without a fix:>>> TestSplitLogManager.testUnassignedTimeout x 3> TestSplitLogManager.testMultipleResubmits> TestSplitTransactionOnCluster.testShutdownFixupWhenDaughterHasSplit x 2> TestSplitTransactionOnCluster.testMasterRestartWhenSplittingIsPartial>> TestSplitTransactionOnCluster.testShouldThrowIOExceptionIfStoreFileSizeIsEmptyAndSHouldSuccessfullyExecuteRollback> TestCatalogTrackerOnCluster.testBadOriginalRootLocation> TestDistributedLogSplitting.testDelayedDeleteOnFailure> TestScannerTimeout.test3686a> TestReplication.testVerifyRepJob> TestReplication.queueFailover> TestFromClientSideWithCoprocessor.testPoolBehavior> TestColumnSeeking.testDuplicateVersions>>>> Based on that at least TestSplitLogManager.testUnassignedTimeout should> get the axe (or be investigated)>> -- Lars>> ----- Original Message -----> From: Jimmy Xiang <[EMAIL PROTECTED]>> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]>> Cc:> Sent: Wednesday, November 14, 2012 12:12 PM> Subject: Re: Build failed in Jenkins: HBase-0.94 #589>> I agree. +1>> We can keep a list of flaky tests so that we can fix them later on.>> Thanks,> Jimmy>> On Wed, Nov 14, 2012 at 11:55 AM, lars hofhansl <[EMAIL PROTECTED]>> wrote:> > Sigh.> >> > It seems we're back at having a successful build being the exception> rather than the rule.> > In this case it was some(?) timeout (all tests ran and passed), but many> previous runs had at least one test failing.> >> > Flaky tests are useless. They do not add confidence to a run, and worse> they add noise, which requires us to manually filter the good from the bad> runs but looking at the results.> >> > There was talk about separating the flaky tests from the good ones.> > Short term I propose to disable or remove every test that failed more> than once in the last 10 runs.> >> >> > This is getting quite frustrating.> >> > -- Lars> >> >> > ----- Original Message -----> > From: Apache Jenkins Server <[EMAIL PROTECTED]>> > To: [EMAIL PROTECTED]> > Cc:> > Sent: Wednesday, November 14, 2012 11:13 AM> > Subject: Build failed in Jenkins: HBase-0.94 #589> >> > See <https://builds.apache.org/job/HBase-0.94/589/>> >

> I "fixed" the top three failing tests (mostly just race conditions with> bad timeouts). "Fixed" is in quotes, because just looked at the point where> the tests failed and made a better guess about how long to wait. Waiting> some fixed amount of time is almost always bad in tests (unless it is a> long wait as a safety guard to ensure the test will eventually end), but> that was the fastest avenue to get them to pass.>>> Will also look at the other tests. I think we should get to 50% pass rate> of the jenkins build soon, and then to at least 80% pass rate.> This might mean disabling some of the bad tests.>>> -- Lars>>>> ________________________________> From: lars hofhansl <[EMAIL PROTECTED]>> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>> Sent: Wednesday, November 14, 2012 12:40 PM> Subject: Re: Build failed in Jenkins: HBase-0.94 #589>> Agreed. Looking at TestSplitLogManager.testUnassignedTimeout.> It has some pretty tight timeouts (500ms), which is likely to be a problem> on slow (or overloaded) build machines.> I'm doubling the timeouts.>>> Anyway, the current run just got past these too tests, let's hope there're> no other failures.> Then we can tackle these tests in 0.94.4 and 0.96.>> -- Lars>>> ----- Original Message -----> From: Ted Yu <[EMAIL PROTECTED]>> To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]>> Cc:> Sent: Wednesday, November 14, 2012 12:30 PM> Subject: Re: Build failed in Jenkins: HBase-0.94 #589>> I noticed the test failures in TestSplitTransactionOnCluster>> 0.94.3 has fix for region splitting issue. I think we should pay a little> attention fixing TestSplitTransactionOnCluster so that it passes more> often.>> Cheers>> On Wed, Nov 14, 2012 at 12:18 PM, lars hofhansl <[EMAIL PROTECTED]>> wrote:>> > Here're the test that failed recently without a fix:> >> >> > TestSplitLogManager.testUnassignedTimeout x 3> > TestSplitLogManager.testMultipleResubmits> > TestSplitTransactionOnCluster.testShutdownFixupWhenDaughterHasSplit x 2> > TestSplitTransactionOnCluster.testMasterRestartWhenSplittingIsPartial> >> >> TestSplitTransactionOnCluster.testShouldThrowIOExceptionIfStoreFileSizeIsEmptyAndSHouldSuccessfullyExecuteRollback> > TestCatalogTrackerOnCluster.testBadOriginalRootLocation> > TestDistributedLogSplitting.testDelayedDeleteOnFailure> > TestScannerTimeout.test3686a> > TestReplication.testVerifyRepJob> > TestReplication.queueFailover> > TestFromClientSideWithCoprocessor.testPoolBehavior> > TestColumnSeeking.testDuplicateVersions> >> >> >> > Based on that at least TestSplitLogManager.testUnassignedTimeout should> > get the axe (or be investigated)> >> > -- Lars> >> > ----- Original Message -----> > From: Jimmy Xiang <[EMAIL PROTECTED]>> > To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]>> > Cc:> > Sent: Wednesday, November 14, 2012 12:12 PM> > Subject: Re: Build failed in Jenkins: HBase-0.94 #589> >> > I agree. +1> >> > We can keep a list of flaky tests so that we can fix them later on.> >> > Thanks,> > Jimmy> >> > On Wed, Nov 14, 2012 at 11:55 AM, lars hofhansl <[EMAIL PROTECTED]>> > wrote:> > > Sigh.> > >> > > It seems we're back at having a successful build being the exception> > rather than the rule.> > > In this case it was some(?) timeout (all tests ran and passed), but> many> > previous runs had at least one test failing.> > >> > > Flaky tests are useless. They do not add confidence to a run, and worse> > they add noise, which requires us to manually filter the good from the> bad> > runs but looking at the results.> > >> > > There was talk about separating the flaky tests from the good ones.> > > Short term I propose to disable or remove every test that failed more

------------------------------------------Started by an SCM changeBuilding remotely on ubuntu3 in workspace <https://builds.apache.org/job/HBase-0.94/ws/>Cleaning local Directory trunkChecking out https://svn.apache.org/repos/asf/hbase/branches/0.94ERROR: Failed to check out https://svn.apache.org/repos/asf/hbase/branches/0.94org.tmatesoft.svn.core.SVNException: svn: OPTIONS /repos/asf/hbase/branches/0.94 failed at org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection.request(HTTPConnection.java:298) at org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection.request(HTTPConnection.java:283) at org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection.request(HTTPConnection.java:271) at org.tmatesoft.svn.core.internal.io.dav.DAVConnection.exchangeCapabilities(DAVConnection.java:533) at org.tmatesoft.svn.core.internal.io.dav.DAVConnection.open(DAVConnection.java:98) at org.tmatesoft.svn.core.internal.io.dav.DAVRepository.openConnection(DAVRepository.java:1011) at org.tmatesoft.svn.core.internal.io.dav.DAVRepository.getLatestRevision(DAVRepository.java:180) at org.tmatesoft.svn.core.wc.SVNBasicClient.getRevisionNumber(SVNBasicClient.java:482) at org.tmatesoft.svn.core.wc.SVNBasicClient.getLocations(SVNBasicClient.java:876) at org.tmatesoft.svn.core.wc.SVNBasicClient.createRepository(SVNBasicClient.java:534) at org.tmatesoft.svn.core.wc.SVNUpdateClient.doCheckout(SVNUpdateClient.java:901) at hudson.scm.subversion.CheckoutUpdater$1.perform(CheckoutUpdater.java:84) at hudson.scm.subversion.WorkspaceUpdater$UpdateTask.delegateTo(WorkspaceUpdater.java:136) at hudson.scm.SubversionSCM$CheckOutTask.perform(SubversionSCM.java:788) at hudson.scm.SubversionSCM$CheckOutTask.invoke(SubversionSCM.java:769) at hudson.scm.SubversionSCM$CheckOutTask.invoke(SubversionSCM.java:753) at hudson.FilePath$FileCallableWrapper.call(FilePath.java:2309) at hudson.remoting.UserRequest.perform(UserRequest.java:118) at hudson.remoting.UserRequest.perform(UserRequest.java:48) at hudson.remoting.Request$2.run(Request.java:326) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:679)Caused by: org.tmatesoft.svn.core.SVNErrorMessage: svn: OPTIONS /repos/asf/hbase/branches/0.94 failed at org.tmatesoft.svn.core.SVNErrorMessage.create(SVNErrorMessage.java:200) at org.tmatesoft.svn.core.SVNErrorMessage.create(SVNErrorMessage.java:146) at org.tmatesoft.svn.core.SVNErrorMessage.create(SVNErrorMessage.java:89) ... 26 moreCaused by: org.tmatesoft.svn.core.SVNException: svn: OPTIONS request failed on '/repos/asf/hbase/branches/0.94'svn: timed out waiting for server at org.tmatesoft.svn.core.internal.wc.SVNErrorManager.error(SVNErrorManager.java:64) at org.tmatesoft.svn.core.internal.wc.SVNErrorManager.error(SVNErrorManager.java:51) at org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection._request(HTTPConnection.java:662) at org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection.request(HTTPConnection.java:292) ... 25 moreCaused by: org.tmatesoft.svn.core.SVNErrorMessage: svn: OPTIONS request failed on '/repos/asf/hbase/branches/0.94' at org.tmatesoft.svn.core.SVNErrorMessage.create(SVNErrorMessage.java:200) at org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection._request(HTTPConnection.java:660) ... 26 moreCaused by: org.tmatesoft.svn.core.SVNErrorMessage: svn: timed out waiting for server at org.tmatesoft.svn.core.SVNErrorMessage.create(SVNErrorMessage.java:200) at org.tmatesoft.svn.core.internal.io.dav.http.HTTPConnection._request(HTTPConnection.java:412) ... 26 moreCaused by: java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:327) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:193) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:180) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384) at java.net.Socket.connect(Socket.java:546) at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:584) at org.tmatesoft.svn.core.internal.util.SVNSocketConnection.run(SVNSocketConnection.java:57) ... 1 moreFATAL: nulljava.lang.NullPointerException at java.util.ArrayList.addAll(ArrayList.java:530) at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:743) at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:685) at hudson.model.AbstractProject.checkout(AbstractProject.java:1325) at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:682) at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:88) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:587) at hudson.model.Run.execute(Run.java:1543) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:236)

This is a bunch of failures. It is a reverse of the trend where we wereheading toward zero. Build #39 is running now. Lets see what it says. Will start backing out the above changes to see which brought on the above.

It's interesting that this build happened on Ubuntu 3.As I wrote a few days back, the vast majority of all successful 0.94 test runs happened on Ubuntu 3.When you look at the trunk build, the *only* successful runs happened on Ubuntu 3.

I noticed that integration tests ran as part of the build... one of themhas 30-minutes runtime by default and so it failed.Did anything change recently w/the setup? I thought the tests didn't run aspart of the build before.

> I noticed that integration tests ran as part of the build... one of them> has 30-minutes runtime by default and so it failed.> Did anything change recently w/the setup? I thought the tests didn't run as> part of the build before.>>Hmm..... They should not be running as part of regular build.

I don't know who added them actually. They're executed in 0.95 as well Ithink: they are executed with mvn install.They seem to need ~1 hour to run and failed 2 of the last 10 builds (sothey fail 33% of the time when they are executed).

For now, on trunk, I removed them by setting a dummy value to it.test.

On the long term, the best option I see is to set another build, dependingon trunk, to execute them. This would save some time on the main build,make the main build consistent with the precommit one, while still havingthe ITests running oftenOn Wed, Apr 10, 2013 at 8:35 AM, Stack <[EMAIL PROTECTED]> wrote:

> On Tue, Apr 9, 2013 at 5:01 PM, Sergey Shelukhin <[EMAIL PROTECTED]> >wrote:>> > I noticed that integration tests ran as part of the build... one of them> > has 30-minutes runtime by default and so it failed.> > Did anything change recently w/the setup? I thought the tests didn't run> as> > part of the build before.> >> >> Hmm..... They should not be running as part of regular build.>> [INFO] HBase - Integration Tests ......................... FAILURE> [35:19.292s]>>> Will look in morning. Maybe nkeywal has some input? (He just came> online).>>> St.Ack>

> I don't know who added them actually. They're executed in 0.95 as well I> think: they are executed with mvn install.> They seem to need ~1 hour to run and failed 2 of the last 10 builds (so> they fail 33% of the time when they are executed).>> For now, on trunk, I removed them by setting a dummy value to it.test.>>Thanks N. I did the same in 0.95. I filedhttps://issues.apache.org/jira/browse/HBASE-8319 to figure how to turn themoff again so they are not inline w/ general unit test suite.

> On the long term, the best option I see is to set another build, depending> on trunk, to execute them. This would save some time on the main build,> make the main build consistent with the precommit one, while still having> the ITests running often>>Yes.

I think it.test is needed to package them properly.What causes them to be executed in the first place? Do we now have verifystep? Wasn't the case before.

On Wed, Apr 10, 2013 at 9:38 AM, Stack <[EMAIL PROTECTED]> wrote:

> On Wed, Apr 10, 2013 at 12:16 AM, Nicolas Liochon <[EMAIL PROTECTED]>> wrote:>> > I don't know who added them actually. They're executed in 0.95 as well I> > think: they are executed with mvn install.> > They seem to need ~1 hour to run and failed 2 of the last 10 builds (so> > they fail 33% of the time when they are executed).> >> > For now, on trunk, I removed them by setting a dummy value to it.test.> >> >> Thanks N. I did the same in 0.95. I filed> https://issues.apache.org/jira/browse/HBASE-8319 to figure how to turn> them> off again so they are not inline w/ general unit test suite.>>>> > On the long term, the best option I see is to set another build,> depending> > on trunk, to execute them. This would save some time on the main build,> > make the main build consistent with the precommit one, while still having> > the ITests running often> >> >> Yes.>> St.Ack>

When you do install, it implies verify.And using 'mvn install' makes sense as it more or less needed when usingmaven multimodules.What we could do is something like "mvn clean install -DskipTests && mvntest"On Wed, Apr 10, 2013 at 8:43 PM, Sergey Shelukhin <[EMAIL PROTECTED]>wrote:

> I think it.test is needed to package them properly.> What causes them to be executed in the first place? Do we now have verify> step? Wasn't the case before.>> On Wed, Apr 10, 2013 at 9:38 AM, Stack <[EMAIL PROTECTED]> wrote:>> > On Wed, Apr 10, 2013 at 12:16 AM, Nicolas Liochon <[EMAIL PROTECTED]>> > wrote:> >> > > I don't know who added them actually. They're executed in 0.95 as well> I> > > think: they are executed with mvn install.> > > They seem to need ~1 hour to run and failed 2 of the last 10 builds (so> > > they fail 33% of the time when they are executed).> > >> > > For now, on trunk, I removed them by setting a dummy value to it.test.> > >> > >> > Thanks N. I did the same in 0.95. I filed> > https://issues.apache.org/jira/browse/HBASE-8319 to figure how to turn> > them> > off again so they are not inline w/ general unit test suite.> >> >> >> > > On the long term, the best option I see is to set another build,> > depending> > > on trunk, to execute them. This would save some time on the main build,> > > make the main build consistent with the precommit one, while still> having> > > the ITests running often> > >> > >> > Yes.> >> > St.Ack> >>

> When you do install, it implies verify.> And using 'mvn install' makes sense as it more or less needed when using> maven multimodules.> What we could do is something like "mvn clean install -DskipTests && mvn> test">>> On Wed, Apr 10, 2013 at 8:43 PM, Sergey Shelukhin <[EMAIL PROTECTED]> >wrote:>> > I think it.test is needed to package them properly.> > What causes them to be executed in the first place? Do we now have verify> > step? Wasn't the case before.> >> > On Wed, Apr 10, 2013 at 9:38 AM, Stack <[EMAIL PROTECTED]> wrote:> >> > > On Wed, Apr 10, 2013 at 12:16 AM, Nicolas Liochon <[EMAIL PROTECTED]>> > > wrote:> > >> > > > I don't know who added them actually. They're executed in 0.95 as> well> > I> > > > think: they are executed with mvn install.> > > > They seem to need ~1 hour to run and failed 2 of the last 10 builds> (so> > > > they fail 33% of the time when they are executed).> > > >> > > > For now, on trunk, I removed them by setting a dummy value to> it.test.> > > >> > > >> > > Thanks N. I did the same in 0.95. I filed> > > https://issues.apache.org/jira/browse/HBASE-8319 to figure how to turn> > > them> > > off again so they are not inline w/ general unit test suite.> > >> > >> > >> > > > On the long term, the best option I see is to set another build,> > > depending> > > > on trunk, to execute them. This would save some time on the main> build,> > > > make the main build consistent with the precommit one, while still> > having> > > > the ITests running often> > > >> > > >> > > Yes.> > >> > > St.Ack> > >> >>

> When you do install, it implies verify.> And using 'mvn install' makes sense as it more or less needed when using> maven multimodules.> What we could do is something like "mvn clean install -DskipTests && mvn> test">>There is also -DskipIntegrationTests Let me try it and update docaccordingly (I did not realize that mvn verify was implied by mvn install-- ugh... that immutable, ineluctable lifecycle of mvns must have beenmade for mickey mouse projects....).

St.Ack

> On Wed, Apr 10, 2013 at 8:43 PM, Sergey Shelukhin <[EMAIL PROTECTED]> >wrote:>> > I think it.test is needed to package them properly.> > What causes them to be executed in the first place? Do we now have verify> > step? Wasn't the case before.> >> > On Wed, Apr 10, 2013 at 9:38 AM, Stack <[EMAIL PROTECTED]> wrote:> >> > > On Wed, Apr 10, 2013 at 12:16 AM, Nicolas Liochon <[EMAIL PROTECTED]>> > > wrote:> > >> > > > I don't know who added them actually. They're executed in 0.95 as> well> > I> > > > think: they are executed with mvn install.> > > > They seem to need ~1 hour to run and failed 2 of the last 10 builds> (so> > > > they fail 33% of the time when they are executed).> > > >> > > > For now, on trunk, I removed them by setting a dummy value to> it.test.> > > >> > > >> > > Thanks N. I did the same in 0.95. I filed> > > https://issues.apache.org/jira/browse/HBASE-8319 to figure how to turn> > > them> > > off again so they are not inline w/ general unit test suite.> > >> > >> > >> > > > On the long term, the best option I see is to set another build,> > > depending> > > > on trunk, to execute them. This would save some time on the main> build,> > > > make the main build consistent with the precommit one, while still> > having> > > > the ITests running often> > > >> > > >> > > Yes.> > >> > > St.Ack> > >> >>

Yeah -- I don't have anything definitive yet but I believe I had a caselocally where a MR2 MRAppMaster didn't shutdown; yarn children were leftaround and the input format scan tests and TestMulti* MR2 tests tended tofail.

My point is the test was passing before but now it is failing, whichimplies a commit changed something.On Thu, May 9, 2013 at 5:18 PM, Ted Yu <[EMAIL PROTECTED]> wrote:

> It was one possible cause for the test failure.>> We are in search of the best solution for that problem.>> On May 9, 2013, at 2:10 AM, Andrew Purtell <[EMAIL PROTECTED]> wrote:>> > Are you sure you have the correct JIRA? That has not been committed yet.> >> > On Thu, May 9, 2013 at 5:07 PM, Ted Yu <[EMAIL PROTECTED]> wrote:> >> >> HBASE-8365> >> >> >> >> >> > --> > Best regards,> >> > - Andy> >> > Problems worthy of attack prove their worth by hitting back. - Piet Hein> > (via Tom White)>

I've found that some tests hang and don't even show up in the maven out putat all in a single run and than I needed to do a compare with a good oneinstead. For the hadoop2 builds I've been sending emails about, I taken aknown good build (a hadoop1 version of the same build), did a grep/sort andthen compare it against the failed run's grep/sort to build a list ofmissing tests.

I need to add this to apurtell's jenkins run to automatcally report this.

We just ended a long string of successful builds. What in the recentcommits do we think messed it up? Here is what was committed and the teststhat failed. I'm going to see how next run goes and if it fails too, willrevert changes starting w/ what looks the most suspect.

Yeah. I saw that. Does anybody have an idea why this is happening?I looked through the pom and other changes, but did not see any change that could this. Since this happens only on the EC2 machines, maybe M2_REPO was changed or something.I'll check...