Details

Description

In the case when a client connection is closed due to socket error instead of the client calling close explicitly, it would be nice to enable the session associated with that client to time out faster than the negotiated session timeout. This would enable a zookeeper ensemble that is acting as a dynamic discovery provider to remove ephemeral nodes for crashed clients quickly, while allowing for a longer heartbeat-based timeout for java clients that need to do long stop-the-world GC.

I propose doing this by setting the timeout associated with the crashed session to "minSessionTimeout".

1) the client is connected to follower1
2) the client has problems talking to follower1, so it closes the connection
3) the client connects to follower2
4) follower1 detects the closed connection and sets the connection timeout to min
5) the client is idle for min timeout and the leader expires the connection

the race condition is steps 3) and 4). if follower1 doesn't detect the dead connection fast enough, it can improperly set the timeout.

Benjamin Reed
added a comment - 08/Nov/10 19:23 how do you deal with the following race condition:
1) the client is connected to follower1
2) the client has problems talking to follower1, so it closes the connection
3) the client connects to follower2
4) follower1 detects the closed connection and sets the connection timeout to min
5) the client is idle for min timeout and the leader expires the connection
the race condition is steps 3) and 4). if follower1 doesn't detect the dead connection fast enough, it can improperly set the timeout.

That is a valid concern. For the system I am implementing, I would rather aggressively time out connections I believe to be closed with the risk of occasionally hitting this particular edge case (my client can automatically re-connect and re-establish its ephemeral data if necessary), but it's worth thinking about whether it is possible to avoid.
I realize the extreme end of this argument is just to set the session timeout lower and let the gc-ing clients re-establish their state but I want the best of both worlds.

Camille Fournier
added a comment - 08/Nov/10 19:37 That is a valid concern. For the system I am implementing, I would rather aggressively time out connections I believe to be closed with the risk of occasionally hitting this particular edge case (my client can automatically re-connect and re-establish its ephemeral data if necessary), but it's worth thinking about whether it is possible to avoid.
I realize the extreme end of this argument is just to set the session timeout lower and let the gc-ing clients re-establish their state but I want the best of both worlds.

After some thought one approach to fix this might be to have the leader send the cnxn info through the session touch call in the case of PING, and only allow the timeout for a session to be lowered if the requester is the current owner of that session. It feels like a hack (you probably wouldn't want to force a valid "owner" to be checked for each touch) but I think it would solve that particular race condition.

Camille Fournier
added a comment - 08/Nov/10 22:49 After some thought one approach to fix this might be to have the leader send the cnxn info through the session touch call in the case of PING, and only allow the timeout for a session to be lowered if the requester is the current owner of that session. It feels like a hack (you probably wouldn't want to force a valid "owner" to be checked for each touch) but I think it would solve that particular race condition.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

Hadoop QA
added a comment - 09/Nov/10 08:10 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12459059/ZOOKEEPER-922.patch
against trunk revision 1032882.
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
-1 patch. The patch command could not apply the patch.
Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/2//console
This message is automatically generated.

Patrick Hunt
added a comment - 09/Nov/10 17:17 Hi Camille, the patch has to be created from the top most directory ("trunk") for hudson to apply the patch correctly, please see:
http://wiki.apache.org/hadoop/ZooKeeper/HowToContribute
(basically checkout trunk, make changes, do "svn diff" at the toplevel)
Thanks!

I wasn't really expecting this patch to be applied by the build, since it is just an illustration of a possible solution for the problem (and has no unit tests or anything). Do you still want to run it through the build given that?

Camille Fournier
added a comment - 09/Nov/10 19:35 I wasn't really expecting this patch to be applied by the build, since it is just an illustration of a possible solution for the problem (and has no unit tests or anything). Do you still want to run it through the build given that?

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

Hadoop QA
added a comment - 09/Nov/10 19:36 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12459059/ZOOKEEPER-922.patch
against trunk revision 1033155.
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
-1 patch. The patch command could not apply the patch.
Console output: https://hudson.apache.org/hudson/job/PreCommit-ZOOKEEPER-Build/7//console
This message is automatically generated.

Patrick Hunt
added a comment - 10/Nov/10 01:32 @camille NP, although it makes it easier for us (reviewers) if all the patches are consistent. For future reference then. Thanks.
ps. you might get more insightful review if you post to apache's new reviewboard server: https://reviews.apache.org
Regards.

Benjamin Reed
added a comment - 10/Nov/10 23:21 the problem with your corner case is that you can end up with a leader who thinks it is still the leader, but zookeeper thinks the leader is dead and allows another leader to take over.
there may be a way to do this reliably, but we need to vet the design first.

Flavio Junqueira
added a comment - 16/Nov/10 15:27 Hi! I'm confused by this proposal. What happens if the client disconnects form one server and moves to another? Or you want to be able to disable that feature as well?

If the client connects to another server within whatever the time they set the minSessionTimeout to, the client should heartbeat the session and it shouldn't get timed out. Otherwise, their session will be expired and they'll get the standard expired session scenario.
If you are working in a setup where you think that unexpected disconnects will largely be due to clients crashing and you want ephemeral data aggressively removed in that scenario, with this design you set the minSessionTimeout to a low value and allow the ZK to quickly timeout those sessions. If you are working in a setup where unexpected disconnects are more likely to be due to network problems, or you want to give data a longer time to survive, you have the option of setting the timeout to a higher value (ideally the same as the negotiated session timeout, but that might require a code change to match negotiation), which should give the same behavior as now.

Camille Fournier
added a comment - 16/Nov/10 15:36 If the client connects to another server within whatever the time they set the minSessionTimeout to, the client should heartbeat the session and it shouldn't get timed out. Otherwise, their session will be expired and they'll get the standard expired session scenario.
If you are working in a setup where you think that unexpected disconnects will largely be due to clients crashing and you want ephemeral data aggressively removed in that scenario, with this design you set the minSessionTimeout to a low value and allow the ZK to quickly timeout those sessions. If you are working in a setup where unexpected disconnects are more likely to be due to network problems, or you want to give data a longer time to survive, you have the option of setting the timeout to a higher value (ideally the same as the negotiated session timeout, but that might require a code change to match negotiation), which should give the same behavior as now.

I think I understand your motivation, but I'm not sure it will work the way you expect it to work. I'm afraid that you might end end up getting lots of false positives due to delays introduced by the environment (e.g., jvm gc). Let me clarify one thing first: when you refer to clients crashing, are you thinking about the jvm crashing or the whole machine becoming unavailable? Basically my question is if you really expect connections to be cleanly closed or not.

Flavio Junqueira
added a comment - 16/Nov/10 15:57 I think I understand your motivation, but I'm not sure it will work the way you expect it to work. I'm afraid that you might end end up getting lots of false positives due to delays introduced by the environment (e.g., jvm gc). Let me clarify one thing first: when you refer to clients crashing, are you thinking about the jvm crashing or the whole machine becoming unavailable? Basically my question is if you really expect connections to be cleanly closed or not.

This only changes the timeout on case of a detected exception that closes the socket (see the patch). The purpose in fact is to enable environments with machines that may have long GC pause times to have long max session timeouts, while still clearing the ephemeral nodes of crashed clients more quickly.
The only crash I am intending to deal with here is the crash that causes an exception closing the socket on the server side. This can also be caused by a switch failure, but in my environment it is much much much much more likely to be caused by the client process crashing. I don't expect to be able to perfectly deal with all cases of client crash, because there are some that don't cause the socket to close and that can't be differentiated from a client doing a long full GC.

Camille Fournier
added a comment - 16/Nov/10 16:47 This only changes the timeout on case of a detected exception that closes the socket (see the patch). The purpose in fact is to enable environments with machines that may have long GC pause times to have long max session timeouts, while still clearing the ephemeral nodes of crashed clients more quickly.
The only crash I am intending to deal with here is the crash that causes an exception closing the socket on the server side. This can also be caused by a switch failure, but in my environment it is much much much much more likely to be caused by the client process crashing. I don't expect to be able to perfectly deal with all cases of client crash, because there are some that don't cause the socket to close and that can't be differentiated from a client doing a long full GC.

Benjamin Reed
added a comment - 16/Nov/10 20:20 if we had a foolproof way to tell that a client is down, we could do this fast expire. the methods you are proposing are not foolproof and will lead to problems exactly when you most want them not to.
the timeout interactions you are talking about are problematic. it's really hard to get them right.
one way that i can see this working is to not allow clients to reconnect to other servers. in that can a socket reset would indicate an expired session. is this acceptable to you?

I'm interested in hearing the problems that you believe it would lead to in more detail. To me, this feels like a reasonable compromise solution to a tough problem. If the problem you foresee is a client and server getting disconnected from each other but both staying alive, and this causing weirdness leading to a session expiration for the client on reconnecting to another server, for my particular scenario that is fine. I have a wrapped ZK client that is highly tolerant to all sorts of failures and has no problem resetting its state. I realize that may not be acceptable for other users, and I would not propose this solution without either community agreement that this risk, if well-documented, is ok, or a fix for that problem. But I don't know what other problems you are seeing and while I might be able to solve them if you help me see what they are, I can't do anything on vague suppositions of problematic circumstances. Don't get me wrong, I'm not married to this solution, but I am interested in some solution if possible.

It seems to me that not allowing clients to reconnect to other servers causes a host of other problems and is a worse solution for people that would not want this fast expiration forced on them. In what scenarios can a client not reconnect to another server? All? Obviously that won't fly because even I would not want to have all of my sessions expire in the case of an ensemble member dying and clients failing over. If we only want to do this where my code is doing the "touchAndClose" (ie, when the server the client was connected to sees a failure-based disconnect), then we see exactly the same potential problem outlined above where the client could still be alive but have a switch go down and disconnect it from the server. Now it tries to fail over and its session is always dead. I'm not convinced off the bat that that is any better than letting it try to fail over and risking a potential session timeout race, which I think could possibly be fixed by associating the client session with the server currently maintaining it (already done but not passed through on ticks).

What did you mean in the earlier comment about this causing leadership election issues? Does this actually interact with that at all? This is the kind of thing I could use guidance on. Or we can let this whole idea drop, but it does seem that more people than me are interested so might be worth hashing it out.

Camille Fournier
added a comment - 16/Nov/10 20:45 I'm interested in hearing the problems that you believe it would lead to in more detail. To me, this feels like a reasonable compromise solution to a tough problem. If the problem you foresee is a client and server getting disconnected from each other but both staying alive, and this causing weirdness leading to a session expiration for the client on reconnecting to another server, for my particular scenario that is fine. I have a wrapped ZK client that is highly tolerant to all sorts of failures and has no problem resetting its state. I realize that may not be acceptable for other users, and I would not propose this solution without either community agreement that this risk, if well-documented, is ok, or a fix for that problem. But I don't know what other problems you are seeing and while I might be able to solve them if you help me see what they are, I can't do anything on vague suppositions of problematic circumstances. Don't get me wrong, I'm not married to this solution, but I am interested in some solution if possible.
It seems to me that not allowing clients to reconnect to other servers causes a host of other problems and is a worse solution for people that would not want this fast expiration forced on them. In what scenarios can a client not reconnect to another server? All? Obviously that won't fly because even I would not want to have all of my sessions expire in the case of an ensemble member dying and clients failing over. If we only want to do this where my code is doing the "touchAndClose" (ie, when the server the client was connected to sees a failure-based disconnect), then we see exactly the same potential problem outlined above where the client could still be alive but have a switch go down and disconnect it from the server. Now it tries to fail over and its session is always dead. I'm not convinced off the bat that that is any better than letting it try to fail over and risking a potential session timeout race, which I think could possibly be fixed by associating the client session with the server currently maintaining it (already done but not passed through on ticks).
What did you mean in the earlier comment about this causing leadership election issues? Does this actually interact with that at all? This is the kind of thing I could use guidance on. Or we can let this whole idea drop, but it does seem that more people than me are interested so might be worth hashing it out.

HI Camille, Say a client disconnects from server A and reconnects to server B, same session. Server A believes the session should be expired because it received an exception. Server B believes the session should stay alive, since the client just reconnected. What should we do in this case? Kill the session or not?

Our suggestion is to have an option that enables fast expiration and disables clients moving sessions to other servers. We are certainly not proposing to remove the second functionality from ZooKeeper altogether.

Flavio Junqueira
added a comment - 17/Nov/10 11:23 HI Camille, Say a client disconnects from server A and reconnects to server B, same session. Server A believes the session should be expired because it received an exception. Server B believes the session should stay alive, since the client just reconnected. What should we do in this case? Kill the session or not?
Our suggestion is to have an option that enables fast expiration and disables clients moving sessions to other servers. We are certainly not proposing to remove the second functionality from ZooKeeper altogether.

From my point of view, a solution that enables faster expiration but disables clients moving sessions to other servers is not a solution I would use. I am not willing to take the massive hit of restarting possibly huge numbers of sessions in the case of a single node failure. I expect the case where a disconnect happens and the client is actually still alive to be vanishingly rare. My clients will die all the time, my ensemble members might die occasionally, if a switch dies, there are much bigger problems than some overaggressive session expiration.

So, Server A has a connection with the client. The switch between client and A dies and both see an error disconnect. Possible operations (in some order) after this point:
A sends a ping on that session with a lower session timeout
Client connects to B, which will touch the session table with the negotiated session timeout
Client starts heartbeating

Scenarios:
1) If A sends the ping with the lower session timeout, and the client cannot connect to B before the session expires, the session is expired and no harm no foul in my opinion. Sessions expiring due to lag on failover are a possibility that anyone using ZK should be defensively programming against.

2) Due to a lag on A's part, it did not send the timeout-lowering ping until after the client had connected to B. Client's session timeout is set lower until it heartbeats to B and B pings the leader.
Or, the client might not respond to the heartbeat in this sensitive interval, causing it to have its session disconnected. This could quite possibly be solved by actually checking that a ping is coming from the current owner of a session if it is trying to set the timeout lower than the current timeout. The session tracker has the current owner stored. I wouldn't want to have to check this on every ping, but it's quite easy to add the logic back that checks if the new timeout is lower than the existing timeout, and then check to see if the pinger is the current owner. That might require code changes we don't want to do, but it seems possible. Alternatively, the session just unexpectedly times out. I'm writing defensive code against all possible failures of the ZK, so a session timeout is not a huge deal to me.

3) A pings the leader during the client connection negotiation with B. I suspect there are several possible interleavings here. I would also expect that again the worst case should be that the client sees a session expired error. This is the area to dig into more carefully. If there is an interleaving that could leave the session open forever, or cause ensemble instability, that would be a probably deal-breaker.

Camille Fournier
added a comment - 17/Nov/10 15:01 From my point of view, a solution that enables faster expiration but disables clients moving sessions to other servers is not a solution I would use. I am not willing to take the massive hit of restarting possibly huge numbers of sessions in the case of a single node failure. I expect the case where a disconnect happens and the client is actually still alive to be vanishingly rare. My clients will die all the time, my ensemble members might die occasionally, if a switch dies, there are much bigger problems than some overaggressive session expiration.
So, Server A has a connection with the client. The switch between client and A dies and both see an error disconnect. Possible operations (in some order) after this point:
A sends a ping on that session with a lower session timeout
Client connects to B, which will touch the session table with the negotiated session timeout
Client starts heartbeating
Scenarios:
1) If A sends the ping with the lower session timeout, and the client cannot connect to B before the session expires, the session is expired and no harm no foul in my opinion. Sessions expiring due to lag on failover are a possibility that anyone using ZK should be defensively programming against.
2) Due to a lag on A's part, it did not send the timeout-lowering ping until after the client had connected to B. Client's session timeout is set lower until it heartbeats to B and B pings the leader.
Or, the client might not respond to the heartbeat in this sensitive interval, causing it to have its session disconnected. This could quite possibly be solved by actually checking that a ping is coming from the current owner of a session if it is trying to set the timeout lower than the current timeout. The session tracker has the current owner stored. I wouldn't want to have to check this on every ping, but it's quite easy to add the logic back that checks if the new timeout is lower than the existing timeout, and then check to see if the pinger is the current owner. That might require code changes we don't want to do, but it seems possible. Alternatively, the session just unexpectedly times out. I'm writing defensive code against all possible failures of the ZK, so a session timeout is not a huge deal to me.
3) A pings the leader during the client connection negotiation with B. I suspect there are several possible interleavings here. I would also expect that again the worst case should be that the client sees a session expired error. This is the area to dig into more carefully. If there is an interleaving that could leave the session open forever, or cause ensemble instability, that would be a probably deal-breaker.

camille, i also think disabling moving sessions is not a good idea or very useful, but it seems to be the only way to have sensible semantics.

may i suggest that we take this discussion a bit higher? i think there are fundamental assumptions that you are making that i'm questioning. can you write up a high-level design and state your assumptions? i can't quite see how the math works out between the client-server timeouts, connect timeouts, and lower session timeout. i'm also not clear on how much you are relying on a connection reset for the failure detection.

Benjamin Reed
added a comment - 18/Nov/10 20:49 camille, i also think disabling moving sessions is not a good idea or very useful, but it seems to be the only way to have sensible semantics.
may i suggest that we take this discussion a bit higher? i think there are fundamental assumptions that you are making that i'm questioning. can you write up a high-level design and state your assumptions? i can't quite see how the math works out between the client-server timeouts, connect timeouts, and lower session timeout. i'm also not clear on how much you are relying on a connection reset for the failure detection.

Ok, here's a recap of what the problem is, what the boundaries of the problem are from my point of view, and what the current solution proposed above lacks. If the boundaries of the problem and solution are unacceptable from the POV of the rest of the community, then I guess we're at an impasse. So please take some time to read:

Problem description: When a client crashes, its ephemeral nodes need to wait until the negotiated session timeout has passed before they will be removed by the leader.

We would like for clients that have crashed to have their ephemeral nodes removed quickly. The duration of visibility of "stale" ephemeral nodes (that is, those that were created by now-dead clients) is directly correlated to the window of time in which the system is in an incorrect state, for the purposes of our use case (dynamic discovery).

Without changing the code at all, we could simulate this by lowering the session timeout for all nodes. However, that would cause a different kind of possible inconsistent system state. For one, if there are clients that are doing long full-pause Garbage Collection, their sessions will time out despite the fact that they are actually still alive (a very real likelihood in our working environment). In another case, if one of the ensemble members dies and clients have to fail over, a very short session timeout could also result in prematurely-killed sessions for otherwise live clients. We would like to be able to detect likely cases of client crash and clean up their sessions quickly, while having a longer session timeout for clients we believe to be connected.

We are willing to tolerate both a small number of false positives (believing clients crashed when they are alive) as well as a small number of false negatives (believing clients alive, and waiting for the full session timeout before removing them, when they have crashed). Given the nature of systems and networks, it is impossible to tell 100% of the time whether a client is truly alive or dead (a switch could crash, the client could GC, etc), and the occasional missed guess is acceptable so long as the system otherwise retains the general coherence and correctness guarantees.

Any solution to this problem must retain the ability for client sessions to migrate between ensemble members in the case where the client sees a disconnection from the ZK cluster due to the ensemble member crashing .

Current system fundamentals:

The only way that a server can "see" a client crash is through an error that causes the socket to close and throw an exception (NIOServerCnxn:doIO). If a client crashes without this socket closing (say, by having the network cable to that server pulled), the server will not see a socket close and will have to time out normally. This is an acceptable edge condition from our point of view.

Additionally, it is possible that a server will "see" a client crash when in fact the socket was closed unexpectedly on both ends, due to a scenario like a network switch failure. This would result in a false positive crash detection by the zk server, and possibly result in the client's session being timed out before the client has a chance to fail over to a different server. This is also an acceptable edge condition from our point of view.

The session timeouts are controlled by the SessionTracker, which is maintained by the current leader. That tracker table is updated every time the leader receives a record of pings from its followers. Sessions are associated with an "owner", which is the current ensemble member thought to be maintaining the session, however, the "owner" is not checked in the case of a ping.

Proposal:

When we see an exception on the socket resulting in a socket close, we lower the timeout for the session associated with that connection. If the client does not reconnect within this shortened window, the session is timed out and ephemeral state is removed.

The simplest version of the change can be seen in the first patch submitted to ZOOKEEPER-922. This change does the following:

In NIOServerCnxn:doIO, when an exception is caught that is not from the client explicitly calling close, instead of just closing the connection, we "touch" the SessionTracker with a timeout set by the user (minSessionTimeout), then close the connection.

This results in one of two workflows. Followers will insert the sessionId and sessionTimeout in their touchTable, which will be sent to the Leader on the next ping. The Leader will then call SessionTrackerImpl.touchSession. In the case of the leader being the one to touchAndClose, it will directly call SessionTrackerImpl.touchSession, which has been modified to allow the session to have its expiration time set lower as well as higher.

These changes have been verified to produce a functional (not necessarily bug-free) implementation of the desired spec.

Possible issues:

1. If a client and a server each see a socket disconnection due to a network switch failure, the client will have a shorter time window in which to fail over to a different server before its session is timed out. This is fine with me, but since the shorter timeout is configurable, for those users for whom this risk is not worth the benefit, setting the minSessionTimeout to be the same as the negotiated session timeout will mitigate this problem. Therefore I'm not going to attempt to fix this.

2. If a client and a server both see disconnections but the client manages to fail over and migrate its session before the original server sends its session tracker update with the reduced session timeout, the client could potentially have its session timed out if it does not heartbeat in the reduced timeout window, despite having failed over. This reduced timeout window would only last until the new zk ensemble member re-pinged for that client, but there is a window of vulnerability. This could be fixed before making this change.

Are we all on the same page so far with this? All I want to do is enable fast failing for those who want it, if they are willing to accept the possibility that certain network failures could cause over-aggressive session timeout for clients that are not actually dead.

Camille Fournier
added a comment - 10/Jan/11 22:31 Ok, here's a recap of what the problem is, what the boundaries of the problem are from my point of view, and what the current solution proposed above lacks. If the boundaries of the problem and solution are unacceptable from the POV of the rest of the community, then I guess we're at an impasse. So please take some time to read:
Problem description: When a client crashes, its ephemeral nodes need to wait until the negotiated session timeout has passed before they will be removed by the leader.
We would like for clients that have crashed to have their ephemeral nodes removed quickly. The duration of visibility of "stale" ephemeral nodes (that is, those that were created by now-dead clients) is directly correlated to the window of time in which the system is in an incorrect state, for the purposes of our use case (dynamic discovery).
Without changing the code at all, we could simulate this by lowering the session timeout for all nodes. However, that would cause a different kind of possible inconsistent system state. For one, if there are clients that are doing long full-pause Garbage Collection, their sessions will time out despite the fact that they are actually still alive (a very real likelihood in our working environment). In another case, if one of the ensemble members dies and clients have to fail over, a very short session timeout could also result in prematurely-killed sessions for otherwise live clients. We would like to be able to detect likely cases of client crash and clean up their sessions quickly, while having a longer session timeout for clients we believe to be connected.
We are willing to tolerate both a small number of false positives (believing clients crashed when they are alive) as well as a small number of false negatives (believing clients alive, and waiting for the full session timeout before removing them, when they have crashed). Given the nature of systems and networks, it is impossible to tell 100% of the time whether a client is truly alive or dead (a switch could crash, the client could GC, etc), and the occasional missed guess is acceptable so long as the system otherwise retains the general coherence and correctness guarantees.
Any solution to this problem must retain the ability for client sessions to migrate between ensemble members in the case where the client sees a disconnection from the ZK cluster due to the ensemble member crashing .
Current system fundamentals:
The only way that a server can "see" a client crash is through an error that causes the socket to close and throw an exception (NIOServerCnxn:doIO). If a client crashes without this socket closing (say, by having the network cable to that server pulled), the server will not see a socket close and will have to time out normally. This is an acceptable edge condition from our point of view.
Additionally, it is possible that a server will "see" a client crash when in fact the socket was closed unexpectedly on both ends, due to a scenario like a network switch failure. This would result in a false positive crash detection by the zk server, and possibly result in the client's session being timed out before the client has a chance to fail over to a different server. This is also an acceptable edge condition from our point of view.
The session timeouts are controlled by the SessionTracker, which is maintained by the current leader. That tracker table is updated every time the leader receives a record of pings from its followers. Sessions are associated with an "owner", which is the current ensemble member thought to be maintaining the session, however, the "owner" is not checked in the case of a ping.
Proposal:
When we see an exception on the socket resulting in a socket close, we lower the timeout for the session associated with that connection. If the client does not reconnect within this shortened window, the session is timed out and ephemeral state is removed.
The simplest version of the change can be seen in the first patch submitted to ZOOKEEPER-922 . This change does the following:
In NIOServerCnxn:doIO, when an exception is caught that is not from the client explicitly calling close, instead of just closing the connection, we "touch" the SessionTracker with a timeout set by the user (minSessionTimeout), then close the connection.
This results in one of two workflows. Followers will insert the sessionId and sessionTimeout in their touchTable, which will be sent to the Leader on the next ping. The Leader will then call SessionTrackerImpl.touchSession. In the case of the leader being the one to touchAndClose, it will directly call SessionTrackerImpl.touchSession, which has been modified to allow the session to have its expiration time set lower as well as higher.
These changes have been verified to produce a functional (not necessarily bug-free) implementation of the desired spec.
Possible issues:
1. If a client and a server each see a socket disconnection due to a network switch failure, the client will have a shorter time window in which to fail over to a different server before its session is timed out. This is fine with me, but since the shorter timeout is configurable, for those users for whom this risk is not worth the benefit, setting the minSessionTimeout to be the same as the negotiated session timeout will mitigate this problem. Therefore I'm not going to attempt to fix this.
2. If a client and a server both see disconnections but the client manages to fail over and migrate its session before the original server sends its session tracker update with the reduced session timeout, the client could potentially have its session timed out if it does not heartbeat in the reduced timeout window, despite having failed over. This reduced timeout window would only last until the new zk ensemble member re-pinged for that client, but there is a window of vulnerability. This could be fixed before making this change.
Are we all on the same page so far with this? All I want to do is enable fast failing for those who want it, if they are willing to accept the possibility that certain network failures could cause over-aggressive session timeout for clients that are not actually dead.

camille is it fair to characterize this proposal as a way to mark a client as failed faster, when compelling evidence (the socket exception) indicates that the client has failed.

i think it is a good idea. there are a couple of things that i see as problematic:

1. it really only covers process failures. if the machine fails, you will not get a connection reset.
2. you only get the connection reset if you try to send something to the machine, so unless a watch triggers or there was a request outstanding that completes after the process fails, you will not get a reset.

i was looking a bit closer at the false detections: a client needs to move to a new server, so it closes a connection and then connects to a new server. i think if we made the new connection before the close, the suspicion from the old server would be ignored.

it would be cool to get this feature if we can address the above. i was thinking that perhaps we could do something to address this with ZOOKEEPER-702. have you looked at that issue?

Benjamin Reed
added a comment - 13/Jan/11 15:57 camille is it fair to characterize this proposal as a way to mark a client as failed faster, when compelling evidence (the socket exception) indicates that the client has failed.
i think it is a good idea. there are a couple of things that i see as problematic:
1. it really only covers process failures. if the machine fails, you will not get a connection reset.
2. you only get the connection reset if you try to send something to the machine, so unless a watch triggers or there was a request outstanding that completes after the process fails, you will not get a reset.
i was looking a bit closer at the false detections: a client needs to move to a new server, so it closes a connection and then connects to a new server. i think if we made the new connection before the close, the suspicion from the old server would be ignored.
it would be cool to get this feature if we can address the above. i was thinking that perhaps we could do something to address this with ZOOKEEPER-702 . have you looked at that issue?

Yes, "a way to mark a client as failed faster when comeplling evidence indicates that the client has failed" is exactly right.

When you say:
"you only get the connection reset if you try to send something to the machine, so unless a watch triggers or there was a request outstanding that completes after the process fails, you will not get a reset."

Do you mean in the case where the machine fails?

In general, I don't mind if we don't fail fast in the cases of a machine failing or a network cable being unplugged. It is more important to me to catch the majority of failure cases (process failure) than the rarer cases of machine failure.

Presuming that we would detect a machine failure by sending a reverse ping of some sort, my biggest concern would be additional network traffic. Anything more complex than pinging every minSessionTimeout or so would probably turn this into a major undertaking.

Camille Fournier
added a comment - 03/Feb/11 21:08 Sorry for the delay in responding.
Yes, "a way to mark a client as failed faster when comeplling evidence indicates that the client has failed" is exactly right.
When you say:
"you only get the connection reset if you try to send something to the machine, so unless a watch triggers or there was a request outstanding that completes after the process fails, you will not get a reset."
Do you mean in the case where the machine fails?
In general, I don't mind if we don't fail fast in the cases of a machine failing or a network cable being unplugged. It is more important to me to catch the majority of failure cases (process failure) than the rarer cases of machine failure.
Presuming that we would detect a machine failure by sending a reverse ping of some sort, my biggest concern would be additional network traffic. Anything more complex than pinging every minSessionTimeout or so would probably turn this into a major undertaking.
Haven't looked at 702, will take a glance.

Looked at 702. It is an interesting thing but I'm not convinced it will cleanly account for the two extremes I am concerned with: on the one hand, failing fast when we KNOW we have failed (socket exception), and on the other hand, handling random and potentially very long periods with no ping due to long-running GCs. If the error tolerance always ends up bumping to the longest GC a zk client is seen to perform, I'm not going to get the fast failure guarantees that I really want.

Camille Fournier
added a comment - 03/Feb/11 23:55 Looked at 702. It is an interesting thing but I'm not convinced it will cleanly account for the two extremes I am concerned with: on the one hand, failing fast when we KNOW we have failed (socket exception), and on the other hand, handling random and potentially very long periods with no ping due to long-running GCs. If the error tolerance always ends up bumping to the longest GC a zk client is seen to perform, I'm not going to get the fast failure guarantees that I really want.

Camille Fournier
added a comment - 09/Feb/11 23:05 Thinking more about detecting machine/network failures:
In the case of a network partition that separates my client from my ZK cluster, a heartbeat from the quorum member to the client would detect a failure and invoke the faster timeout. Do we want that?