Description

The ZKFC doesn't properly handle the case where the monitored service fails to become active. Currently, it catches the exception and logs a warning, but then continues on, after calling quitElection(). This causes a NPE when it later tries to use the same zkClient instance while handling that same request. There is a test case, but the test case doesn't ensure that the node that had the failure is later able to recover properly.

Hadoop QA
added a comment - 27/Mar/12 05:56 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12520062/hadoop-8220.txt
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 6 new or modified tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests:
org.apache.hadoop.ha.TestFailoverController
org.apache.hadoop.ha.TestZKFailoverController
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/789//testReport/
Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/789//console
This message is automatically generated.

The ZKFC doesn't properly handle the case where the monitored service fails to become active. Currently, it catches the exception and logs a warning, but then continues on, after calling quitElection(). This causes a NPE when it later tries to use the same zkClient instance while handling that same request. There is a test case, but the test case doesn't ensure that the node that had the failure is later able to recover properly.

How does NPE occur when the elector makes sure the client is recreated upon rejoining the election? Which zkClient are you talking about?
What is the purpose of adding the sleep? Could you please elaborate?

Bikas Saha
added a comment - 27/Mar/12 09:17 Test missing for changes in ActiveStandbyElector.
Do you really want to commit the logs added to ActiveStandbyTestUtil?
The ZKFC doesn't properly handle the case where the monitored service fails to become active. Currently, it catches the exception and logs a warning, but then continues on, after calling quitElection(). This causes a NPE when it later tries to use the same zkClient instance while handling that same request. There is a test case, but the test case doesn't ensure that the node that had the failure is later able to recover properly.
How does NPE occur when the elector makes sure the client is recreated upon rejoining the election? Which zkClient are you talking about?
What is the purpose of adding the sleep? Could you please elaborate?

I'll add a new test to the ActiveStandbyElector-specific code for this. I was testing it via the "integration test", but you're right that adding to the unit tests makes sense too.

How does NPE occur when the elector makes sure the client is recreated upon rejoining the election? Which zkClient are you talking about?

The NPE occurred in the previous code because we had the following sequence:

createNode succeeded

called ZKFC becomeActive() callback

becomeActive() throws exception

ZKFC had a catch() clause which called quitElection () (it turned out this wasn't the right behavior)

quitElection() nulled out zkClient

ActiveStandbyElector called monitorNode(), which tried to use zkClient, which had just been nulled out.

The new behavior avoids this, since the error handling patch is in ActiveStandbyElector itself. This makes it easier to get the right semantics.

What is the purpose of adding the sleep? Could you please elaborate?

Without the sleep, it will do a tight loop retrying to become active. This generates a lot of log spew and has little actual benefit. If instead we retry only once a second, then (a) the logs are more readable, and (b) if there is another StandbyNode in the cluster, it will get a chance to try to become active.

Todd Lipcon
added a comment - 27/Mar/12 18:03 I'll add a new test to the ActiveStandbyElector-specific code for this. I was testing it via the "integration test", but you're right that adding to the unit tests makes sense too.
How does NPE occur when the elector makes sure the client is recreated upon rejoining the election? Which zkClient are you talking about?
The NPE occurred in the previous code because we had the following sequence:
createNode succeeded
called ZKFC becomeActive() callback
becomeActive() throws exception
ZKFC had a catch() clause which called quitElection () (it turned out this wasn't the right behavior)
quitElection() nulled out zkClient
ActiveStandbyElector called monitorNode(), which tried to use zkClient, which had just been nulled out.
The new behavior avoids this, since the error handling patch is in ActiveStandbyElector itself. This makes it easier to get the right semantics.
What is the purpose of adding the sleep? Could you please elaborate?
Without the sleep, it will do a tight loop retrying to become active. This generates a lot of log spew and has little actual benefit. If instead we retry only once a second, then (a) the logs are more readable, and (b) if there is another StandbyNode in the cluster, it will get a chance to try to become active.
I will add a comment to this effect in the code.

1. becomeActive() could have two return values at the appClient (true or false). For example, NN might decide that it cannot become active since it does not have access to the resources to become active. We would need to handle these two return values. If the return value is false, FC will give up the ephemeral znode for one "timeout" iteration to allow other NN's FC to take over the znode.

2. becomeActive() should be protected by a timeout also. If NN is taking far too long to return, FC should declare failure and give up the lock. Otherwise, it is a deadlock.

As you commented on the other jira, it might be useful to create a seperate branch for automatic failover. There will be lots of corner cases to deal with.

Hari Mankude
added a comment - 27/Mar/12 18:15 Todd,
Couple of comments on the design rather than the code changes
1. becomeActive() could have two return values at the appClient (true or false). For example, NN might decide that it cannot become active since it does not have access to the resources to become active. We would need to handle these two return values. If the return value is false, FC will give up the ephemeral znode for one "timeout" iteration to allow other NN's FC to take over the znode.
2. becomeActive() should be protected by a timeout also. If NN is taking far too long to return, FC should declare failure and give up the lock. Otherwise, it is a deadlock.
As you commented on the other jira, it might be useful to create a seperate branch for automatic failover. There will be lots of corner cases to deal with.

The new behavior avoids this, since the error handling patch is in ActiveStandbyElector itself. This makes it easier to get the right semantics.

Ah. Now I get it. The elector should be robust against client code (ZKFC in this case). I like Hari's proposal of using a return value to inform about fail/success of becoming active. I am not that familiar with standard practices in Java - are return values preferred or exceptions?

This generates a lot of log spew and has little actual benefit. If instead we retry only once a second, then (a) the logs are more readable, and (b) if there is another StandbyNode in the cluster, it will get a chance to try to become active.

I did not understand where the tight loop is? Do you mean (Elector gets lock<->ZKFC fails to becomes active)?
I do not have any data on the trade off between 1) letting the last active become active again with log spew 2) letting another standby become active by making the last active sleep. But for arguments sake I would prefer 1). IMO having continuity in the active node would reduce the overheads of client/datanode failover etc.

becomeActive() should be protected by a timeout also. If NN is taking far too long to return, FC should declare failure and give up the lock. Otherwise, it is a deadlock.

Hari, this seems similar to the alternative proposed in HADOOP-8205 about trying to make sure that the transition to active is short.

Bikas Saha
added a comment - 27/Mar/12 18:53 The new behavior avoids this, since the error handling patch is in ActiveStandbyElector itself. This makes it easier to get the right semantics.
Ah. Now I get it. The elector should be robust against client code (ZKFC in this case). I like Hari's proposal of using a return value to inform about fail/success of becoming active. I am not that familiar with standard practices in Java - are return values preferred or exceptions?
This generates a lot of log spew and has little actual benefit. If instead we retry only once a second, then (a) the logs are more readable, and (b) if there is another StandbyNode in the cluster, it will get a chance to try to become active.
I did not understand where the tight loop is? Do you mean (Elector gets lock<->ZKFC fails to becomes active)?
I do not have any data on the trade off between 1) letting the last active become active again with log spew 2) letting another standby become active by making the last active sleep. But for arguments sake I would prefer 1). IMO having continuity in the active node would reduce the overheads of client/datanode failover etc.
becomeActive() should be protected by a timeout also. If NN is taking far too long to return, FC should declare failure and give up the lock. Otherwise, it is a deadlock.
Hari, this seems similar to the alternative proposed in HADOOP-8205 about trying to make sure that the transition to active is short.

Ah. Now I get it. The elector should be robust against client code (ZKFC in this case). I like Hari's proposal of using a return value to inform about fail/success of becoming active. I am not that familiar with standard practices in Java - are return values preferred or exceptions?

You got it. Exceptions are generally preferred for cases like this – since we have to handle the error condition regardless of whether it's a usual error or whether it was something like a NPE or other truly exceptional condition. So even with a boolean return type, we'd need a try/catch clause. Does that make sense? (I also had originally made it return boolean but then changed it to an exception)

I did not understand where the tight loop is? Do you mean (Elector gets lock<->ZKFC fails to becomes active)?

Yep. In my test I saw that the standby would retry in a tight loop like that:

Succeed in getting lock

Call becomeActive()

drop ZK session (lock disappears)

reconnect to ZK

Goto 1

I simply inserted a sleep between dropping the connection and reconnecting. This gives the old active a better chance to become active again (or if there is a third node in the future, it would have a chance to take the lock). In the future we may want to add some random jitter and exponential backoff, but at this point let's keep it simple.

Todd Lipcon
added a comment - 27/Mar/12 22:11 Ah. Now I get it. The elector should be robust against client code (ZKFC in this case). I like Hari's proposal of using a return value to inform about fail/success of becoming active. I am not that familiar with standard practices in Java - are return values preferred or exceptions?
You got it. Exceptions are generally preferred for cases like this – since we have to handle the error condition regardless of whether it's a usual error or whether it was something like a NPE or other truly exceptional condition. So even with a boolean return type, we'd need a try/catch clause. Does that make sense? (I also had originally made it return boolean but then changed it to an exception)
I did not understand where the tight loop is? Do you mean (Elector gets lock<->ZKFC fails to becomes active)?
Yep. In my test I saw that the standby would retry in a tight loop like that:
Succeed in getting lock
Call becomeActive()
drop ZK session (lock disappears)
reconnect to ZK
Goto 1
I simply inserted a sleep between dropping the connection and reconnecting. This gives the old active a better chance to become active again (or if there is a third node in the future, it would have a chance to take the lock). In the future we may want to add some random jitter and exponential backoff, but at this point let's keep it simple.

Otherwise I dont see why lock disappears.
If yes, then this might be ok, since by design we are deciding to sleep and let someone else take a shot at becoming active because we are having trouble doing so. Could you please add this in comments so that the sleeping is explained.

Now that I am looking at the patch in a less sleep state I think the following might be a better flow because the actions on failure and success are in one place

if (becomeActive())

{
monitorActiveStatus();
}

else

{
LOG.warn("Failed to become active. Rejoin election to try again but sleep before that to let someone else try.");
reJoinElection(SLEEP_AFTER_FAILURE_TO_BECOME_ACTIVE);
}

This puts the fail/success handling of becomeActive() in 1 place and avoids the calling of becomeActive() have a side-effect of also calling rejoinElection() on failure.
What do you think?

Bikas Saha
added a comment - 27/Mar/12 22:41 Do you mean?
1. Succeed in getting lock
2. Call becomeActive()
3. ZKFC fails to become active. Call quitElection.
4. drop ZK session (lock disappears)
5. reconnect to ZK
6. Goto 1
Otherwise I dont see why lock disappears.
If yes, then this might be ok, since by design we are deciding to sleep and let someone else take a shot at becoming active because we are having trouble doing so. Could you please add this in comments so that the sleeping is explained.
Now that I am looking at the patch in a less sleep state I think the following might be a better flow because the actions on failure and success are in one place
if (becomeActive())
{
monitorActiveStatus();
}
else
{
LOG.warn("Failed to become active. Rejoin election to try again but sleep before that to let someone else try.");
reJoinElection(SLEEP_AFTER_FAILURE_TO_BECOME_ACTIVE);
}
This puts the fail/success handling of becomeActive() in 1 place and avoids the calling of becomeActive() have a side-effect of also calling rejoinElection() on failure.
What do you think?

Yep, your updated description of the "tight loop" is exactly right. Sorry, I didn't note the fact that becomeActive() throws an exception in this scenario.

New draft of the patch attached.

Added a true unit test for the new changes, in addition to the functional test from the prior revision (TestActiveStandbyElector#testFailToBecomeActive)

Change the control flow so that the success and error cases are kept near each other (suggested by Bikas above)

Changed the sleep calls to be wrapped in a sleepFor(ms) function, so it's easy to disable the sleeping behavior in the unit tests. Otherwise the tests ran longer for no good reason.

In response to a couple comments above that got lost in the discussion:

2. becomeActive() should be protected by a timeout also. If NN is taking far too long to return, FC should declare failure and give up the lock. Otherwise, it is a deadlock.

This is really difficult to do reliably, since there's no good way to 'cancel' the callback. The transitionToActive RPC itself should have a timeout attached – it's much more straightforward to do that than to try to make ActiveStandbyElector guard against arbitrary code running too long in the callback. I added a note to the javadoc indicating this.

Do you really want to commit the logs added to ActiveStandbyTestUtil?

Yes, I found that when I had a test failure due to timeout, it was difficult to debug, since I couldn't easily tell which node had the lock at the time the test timed out. I rate-limited the logging to only two per second, so it shouldn't make the logs too noisy, while retaining the advantage of seeing what's going on better if there is a timeout.

Todd Lipcon
added a comment - 28/Mar/12 23:06 Yep, your updated description of the "tight loop" is exactly right. Sorry, I didn't note the fact that becomeActive() throws an exception in this scenario.
New draft of the patch attached.
Added a true unit test for the new changes, in addition to the functional test from the prior revision (TestActiveStandbyElector#testFailToBecomeActive)
Change the control flow so that the success and error cases are kept near each other (suggested by Bikas above)
Changed the sleep calls to be wrapped in a sleepFor(ms) function, so it's easy to disable the sleeping behavior in the unit tests. Otherwise the tests ran longer for no good reason.
In response to a couple comments above that got lost in the discussion:
2. becomeActive() should be protected by a timeout also. If NN is taking far too long to return, FC should declare failure and give up the lock. Otherwise, it is a deadlock.
This is really difficult to do reliably, since there's no good way to 'cancel' the callback. The transitionToActive RPC itself should have a timeout attached – it's much more straightforward to do that than to try to make ActiveStandbyElector guard against arbitrary code running too long in the callback. I added a note to the javadoc indicating this.
Do you really want to commit the logs added to ActiveStandbyTestUtil?
Yes, I found that when I had a test failure due to timeout, it was difficult to debug, since I couldn't easily tell which node had the lock at the time the test timed out. I rate-limited the logging to only two per second, so it shouldn't make the logs too noisy, while retaining the advantage of seeing what's going on better if there is a timeout.

Hadoop QA
added a comment - 28/Mar/12 23:27 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12520334/hadoop-8220.txt
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 9 new or modified tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests:
org.apache.hadoop.fs.viewfs.TestViewFsTrash
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/793//testReport/
Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/793//console
This message is automatically generated.

Bikas Saha
added a comment - 29/Mar/12 07:19 I dont think this comment is correct
// The failure to become active is already logged above.
There is no log above. Actually you might just want to move the log above and remove the comment altogether.
+1

Hadoop QA
added a comment - 29/Mar/12 23:20 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12520520/hadoop-8220.txt
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 9 new or modified tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests:
org.apache.hadoop.fs.viewfs.TestViewFsTrash
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HADOOP-Build/798//testReport/
Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/798//console
This message is automatically generated.

Actually moving the error handling code to the call site (instead of inside becomeActive()) introduced a bug, since we call becomeActive() from another spot as well, in the StatCallback. So we need to have similar code there, or move the error handling back up into becomeActive()

Todd Lipcon
added a comment - 30/Mar/12 00:49 Actually moving the error handling code to the call site (instead of inside becomeActive()) introduced a bug, since we call becomeActive() from another spot as well, in the StatCallback. So we need to have similar code there, or move the error handling back up into becomeActive()

Todd Lipcon
added a comment - 30/Mar/12 05:51 New version addresses the above issue, handling failure to become active in the StatCallback as well.
I also added a new unit test for this case.
As part of the stress tests in HADOOP-8228 , I also have added a test case which covers this codepath with a real ZK server (verified with Clover)

Hadoop QA
added a comment - 30/Mar/12 06:05 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12520563/hadoop-8220.txt
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 9 new or modified tests.
-1 patch. The patch command could not apply the patch.
Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/801//console
This message is automatically generated.

Patch looks pretty good to me, Todd. A few little comments. +1 once these are addressed:

Any reason we shouldn't make SLEEP_AFTER_FAILURE_TO_BECOME_ACTIVE configurable?

There's some inconsistency in capitalization between "reJoinElection" and "rejoinElectionAfterFailureToBecomeActive"

Might want to do a s/System.currentTimeMillis/Util.now/g

Any reason we shouldn't make LOG_INTERVAL_MS configurable?

Add @VisibleForTesting to sleepFor, since it would be private (and probably static) otherwise. Maybe even add a comment saying why it's not static.

Considering the comment says "after sleeping for a short period" in TestActiveStandbyElector#testFailToBecomeActive, maybe also verify that sleepFor was called? Likewise in testFailToBecomeActiveAfterZKDisconnect.

Aaron T. Myers
added a comment - 30/Mar/12 19:04 Patch looks pretty good to me, Todd. A few little comments. +1 once these are addressed:
Any reason we shouldn't make SLEEP_AFTER_FAILURE_TO_BECOME_ACTIVE configurable?
There's some inconsistency in capitalization between "reJoinElection" and "rejoinElectionAfterFailureToBecomeActive"
Might want to do a s/System.currentTimeMillis/Util.now/g
Any reason we shouldn't make LOG_INTERVAL_MS configurable?
Add @VisibleForTesting to sleepFor, since it would be private (and probably static) otherwise. Maybe even add a comment saying why it's not static.
Considering the comment says "after sleeping for a short period" in TestActiveStandbyElector#testFailToBecomeActive, maybe also verify that sleepFor was called? Likewise in testFailToBecomeActiveAfterZKDisconnect.

Any reason we shouldn't make SLEEP_AFTER_FAILURE_TO_BECOME_ACTIVE configurable?

Currently, ActiveStandbyElector doesn't take a Configuration object. I think many of the parameters should be changed to be configured via Configuration, but I didn't want to make this into a bigger scoped change.

There's some inconsistency in capitalization between "reJoinElection" and "rejoinElectionAfterFailureToBecomeActive"

Changed to consistently use "reJoin" to match the previously existing code.

Might want to do a s/System.currentTimeMillis/Util.now/g

The Util class is in HDFS, but this code is in common. We don't seem to have an equivalent in common.

Any reason we shouldn't make LOG_INTERVAL_MS configurable?

It's just test code, so seemed unnecessary.

Add @VisibleForTesting to sleepFor, since it would be private (and probably static) otherwise. Maybe even add a comment saying why it's not static.

Considering the comment says "after sleeping for a short period" in TestActiveStandbyElector#testFailToBecomeActive, maybe also verify that sleepFor was called? Likewise in testFailToBecomeActiveAfterZKDisconnect.

Done. I made the overridden method keep a tally of number of slept millis, and added asserts to the tests to make sure it slept for some time when rejoining.

Todd Lipcon
added a comment - 30/Mar/12 20:27 Any reason we shouldn't make SLEEP_AFTER_FAILURE_TO_BECOME_ACTIVE configurable?
Currently, ActiveStandbyElector doesn't take a Configuration object. I think many of the parameters should be changed to be configured via Configuration, but I didn't want to make this into a bigger scoped change.
There's some inconsistency in capitalization between "reJoinElection" and "rejoinElectionAfterFailureToBecomeActive"
Changed to consistently use "reJoin" to match the previously existing code.
Might want to do a s/System.currentTimeMillis/Util.now/g
The Util class is in HDFS, but this code is in common. We don't seem to have an equivalent in common.
Any reason we shouldn't make LOG_INTERVAL_MS configurable?
It's just test code, so seemed unnecessary.
Add @VisibleForTesting to sleepFor, since it would be private (and probably static) otherwise. Maybe even add a comment saying why it's not static.
Considering the comment says "after sleeping for a short period" in TestActiveStandbyElector#testFailToBecomeActive, maybe also verify that sleepFor was called? Likewise in testFailToBecomeActiveAfterZKDisconnect.
Done. I made the overridden method keep a tally of number of slept millis, and added asserts to the tests to make sure it slept for some time when rejoining.

Hadoop QA
added a comment - 30/Mar/12 20:35 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12520652/hadoop-8220.txt
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 9 new or modified tests.
-1 patch. The patch command could not apply the patch.
Console output: https://builds.apache.org/job/PreCommit-HADOOP-Build/805//console
This message is automatically generated.