I'm thinking the visibility looks to be the cause because there are no other reason to fail the test. (I couldn't reproduce the error locally, so I couldn't confirmed that the visibility is the real cause.)

The patch is safe to apply.

The test failures are unrelated because the patch changes TestGangliaMetrics only.

Akira Ajisaka
added a comment - 18/Dec/15 03:06 +1, thanks Masatake.
I'm thinking the visibility looks to be the cause because there are no other reason to fail the test. (I couldn't reproduce the error locally, so I couldn't confirmed that the visibility is the real cause.)
The patch is safe to apply.
The test failures are unrelated because the patch changes TestGangliaMetrics only.

Akira Ajisaka
added a comment - 18/Dec/15 04:19 I've committed this to trunk, branch-2, branch-2.8, and branch-2.7. Thanks Masatake Iwasaki for the contribution.
Please reopen this issue if the test is still failing.

java.lang.AssertionError: Missing metrics: test.s1rec.Xxx
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.apache.hadoop.metrics2.impl.TestGangliaMetrics.checkMetrics(TestGangliaMetrics.java:161)
at org.apache.hadoop.metrics2.impl.TestGangliaMetrics.testGangliaMetrics2(TestGangliaMetrics.java:139)

Masatake Iwasaki
added a comment - 12/Jan/16 16:23 I saw this test failure again today. The fix seemed to be not enough.
java.lang.AssertionError: Missing metrics: test.s1rec.Xxx
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.apache.hadoop.metrics2.impl.TestGangliaMetrics.checkMetrics(TestGangliaMetrics.java:161)
at org.apache.hadoop.metrics2.impl.TestGangliaMetrics.testGangliaMetrics2(TestGangliaMetrics.java:139)

The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.

+1

mvninstall

8m 14s

trunk passed

+1

compile

8m 56s

trunk passed with JDK v1.8.0_66

+1

compile

9m 34s

trunk passed with JDK v1.7.0_91

+1

checkstyle

0m 17s

trunk passed

+1

mvnsite

1m 10s

trunk passed

+1

mvneclipse

0m 14s

trunk passed

+1

findbugs

1m 56s

trunk passed

+1

javadoc

0m 59s

trunk passed with JDK v1.8.0_66

+1

javadoc

1m 10s

trunk passed with JDK v1.7.0_91

+1

mvninstall

1m 39s

the patch passed

+1

compile

8m 55s

the patch passed with JDK v1.8.0_66

+1

javac

8m 55s

the patch passed

+1

compile

9m 28s

the patch passed with JDK v1.7.0_91

+1

javac

9m 28s

the patch passed

-1

checkstyle

0m 18s

Patch generated 1 new checkstyle issues in hadoop-common-project/hadoop-common (total was 23, now 23).

+1

mvnsite

1m 9s

the patch passed

+1

mvneclipse

0m 15s

the patch passed

+1

whitespace

0m 0s

Patch has no whitespace issues.

-1

findbugs

2m 11s

hadoop-common-project/hadoop-common introduced 1 new FindBugs issues.

+1

javadoc

0m 58s

the patch passed with JDK v1.8.0_66

+1

javadoc

1m 9s

the patch passed with JDK v1.7.0_91

-1

unit

8m 1s

hadoop-common in the patch failed with JDK v1.8.0_66.

+1

unit

8m 23s

hadoop-common in the patch passed with JDK v1.7.0_91.

+1

asflicense

0m 25s

Patch does not generate ASF License warnings.

76m 44s

Reason

Tests

FindBugs

module:hadoop-common-project/hadoop-common

Inconsistent synchronization of org.apache.hadoop.metrics2.sink.ganglia.AbstractGangliaSink.datagramSocket; locked 66% of time Unsynchronized access at AbstractGangliaSink.java:66% of time Unsynchronized access at AbstractGangliaSink.java:[line 279]

The first patch and addendum was based on wrong assumption.. Sorry about that.

We need to make sure that MockDatagramSocket#getCapturedSend in test thread happens after the last call of MockDatagramSocket#send in sink thread. Given the synchronization between MetricsSystemImpl#publishMetricsNow and MetricsSink#putMetrics, making the getCapturedSend and send should be sufficient.

Replacing ArrayList with CopyOnWriteArrayList is unnecessary and misleading. It should be reverted in the next patch.

Masatake Iwasaki
added a comment - 13/Jan/16 08:29 The first patch and addendum was based on wrong assumption.. Sorry about that.
We need to make sure that MockDatagramSocket#getCapturedSend in test thread happens after the last call of MockDatagramSocket#send in sink thread. Given the synchronization between MetricsSystemImpl#publishMetricsNow and MetricsSink#putMetrics , making the getCapturedSend and send should be sufficient.
Replacing ArrayList with CopyOnWriteArrayList is unnecessary and misleading. It should be reverted in the next patch.

Masatake Iwasaki
added a comment - 13/Jan/16 08:40 attaching addendum.02.
I need to run the TestGangliaMetrics repeatedly for more long time to veryfy the fix since the failure is rare on my environment.

java.lang.AssertionError: Mismatch in record count: expected:<6> but was:<27>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.apache.hadoop.metrics2.impl.TestGangliaMetrics.checkMetrics(TestGangliaMetrics.java:164)
at org.apache.hadoop.metrics2.impl.TestGangliaMetrics.testGangliaMetrics2(TestGangliaMetrics.java:138)

Akira Ajisaka
added a comment - 14/Jan/16 07:17 Hi Masatake Iwasaki , thank you for continuing the work.
I have one question. Is the test failure related to your addendum patch?
https://builds.apache.org/job/PreCommit-HADOOP-Build/8399/testReport/org.apache.hadoop.metrics2.impl/TestGangliaMetrics/testGangliaMetrics2/
java.lang.AssertionError: Mismatch in record count: expected:<6> but was:<27>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.apache.hadoop.metrics2.impl.TestGangliaMetrics.checkMetrics(TestGangliaMetrics.java:164)
at org.apache.hadoop.metrics2.impl.TestGangliaMetrics.testGangliaMetrics2(TestGangliaMetrics.java:138)

Hmm... The submitted patch is addressing the issue that expected result is not found. Mismatch in record count: expected:<6> but was:<27> means it got results more than expected. I think it is another problem but maybe related. I'm digging and will address the issue too here.

Masatake Iwasaki
added a comment - 14/Jan/16 08:22 Hmm... The submitted patch is addressing the issue that expected result is not found. Mismatch in record count: expected:<6> but was:<27> means it got results more than expected. I think it is another problem but maybe related. I'm digging and will address the issue too here.

If metrics is published by the timer in MetricsSystemImpl after sinks are registered and before metrics system is stopped, we get metrics more than expected, though I could not reproduce this without adding artificial delay on my environment.

We can avoid the situation by setting long publishing interval. Since we manually publish metrics by calling publishMetricsNow, we don't need periodic publishing. The configuration to set the interval must be *.period rather than default.period.

Masatake Iwasaki
added a comment - 21/Jun/16 09:16 If metrics is published by the timer in MetricsSystemImpl after sinks are registered and before metrics system is stopped, we get metrics more than expected, though I could not reproduce this without adding artificial delay on my environment.
// register the sinks
ms.register( "gsink30" , "gsink30 desc" , gsink30);
ms.register( "gsink31" , "gsink31 desc" , gsink31);
ms.publishMetricsNow(); // publish the metrics
ms.stop();
We can avoid the situation by setting long publishing interval. Since we manually publish metrics by calling publishMetricsNow , we don't need periodic publishing. The configuration to set the interval must be *.period rather than default.period .
I will upload a patch addressing this.

Steve Loughran
added a comment - 28/Jun/16 11:53 I'm still seeing failures here, such as in HADOOP-13323
If we can't stabilise this test, I think we should just cull it. It fails so often it is ignored, so if a regression did show up, we wouldn't catch it

Thanks for reporting this, Steve Loughran. The test log of HADOOP-13323 indicates there is a race with TestMetricsSystemImpl. The TestGangliaMetrics#testGangliaMetrics2 sets *.period to 120 but 8 was used.

Masatake Iwasaki
added a comment - 28/Jun/16 13:50 Thanks for reporting this, Steve Loughran . The test log of HADOOP-13323 indicates there is a race with TestMetricsSystemImpl . The TestGangliaMetrics#testGangliaMetrics2 sets *.period to 120 but 8 was used.
2016-06-27 15:21:31,480 INFO impl.MetricsSystemImpl (MetricsSystemImpl.java:startTimer(375)) - Scheduled snapshot period at 8 second(s).
I will upload additional patch or open another issue if I need major refactoring of the test.

Vinod Kumar Vavilapalli
added a comment - 20/Jul/16 17:27 I'm about to release 2.7.3, so this can't be in open state unless the original patch gets reverted. Instead, let's open a new JIRA for more fixes.

Mingliang Liu
added a comment - 28/Jul/16 21:13 Was this resolved? Seeing this in recent run https://builds.apache.org/job/PreCommit-HADOOP-Build/10106/testReport/org.apache.hadoop.metrics2.impl/TestGangliaMetrics/testGangliaMetrics2/ Thanks.