the client can reconnect to the servers if they die only to renew the session. If the session is invalid this client in invalid.

originally we allowed a client to reconnect and create a new session and foudn that our users were getting more and more confused with such kind of sematincs and found that the above sematics made things easier for them. So I think we should not reconnect and try creating a new session with the same client. does that make sense?

Mahadev konar
added a comment - 22/Jul/08 22:53 james,
here is what the current sematics is —
there is one on one mapping between a client and a session
if a session expires the client becomes invalid
the client can reconnect to the servers if they die only to renew the session. If the session is invalid this client in invalid.
originally we allowed a client to reconnect and create a new session and foudn that our users were getting more and more confused with such kind of sematincs and found that the above sematics made things easier for them. So I think we should not reconnect and try creating a new session with the same client. does that make sense?

So what am I meant to do if I get a SessionExpiredException? I want to reconnect with a new session; as the current connection is totally useless and invalid.

Right now the only option I can see is to recreate from scratch the entire ZooKeeper object?

If so that means we've gotta introduce a ZooKeeperProxy thingy which wraps the ZooKeeper object and just lets it be recreated if a SessionExpiredException occurs. Seems kinda unnecessary when really a ZooKeeper instance is capable of easily recreating the connection to the ZK server when the session is expired.

Maybe the issue is the phrase reconnect() - maybe a method called recreateNewSession() is better? We could also document it that this is only to be called if your client becomes invalid because the ZK server has expired the session?

james strachan
added a comment - 23/Jul/08 12:15 I'm kinda confused by that
So what am I meant to do if I get a SessionExpiredException? I want to reconnect with a new session; as the current connection is totally useless and invalid.
Right now the only option I can see is to recreate from scratch the entire ZooKeeper object?
If so that means we've gotta introduce a ZooKeeperProxy thingy which wraps the ZooKeeper object and just lets it be recreated if a SessionExpiredException occurs. Seems kinda unnecessary when really a ZooKeeper instance is capable of easily recreating the connection to the ZK server when the session is expired.
Maybe the issue is the phrase reconnect() - maybe a method called recreateNewSession() is better? We could also document it that this is only to be called if your client becomes invalid because the ZK server has expired the session?

james strachan
added a comment - 23/Jul/08 14:30 the ZooKeeper will reconnect by default if the socket fails right? So I'm only really talking about reconectingWithNewSession() in those rare cases that the ZK server times out the session

The problem with the reconnect method is that it is easy to use but very hard to use correctly. In general, we try to avoid giving programmers methods that make it easy to hang themselves. We provide plenty of rope, and they are free to make a noose; we just don't provide a tieTheKnot method.

The basic problem has to do with clean up that happens when a session dies. Obviously ephemeral nodes will go away. Our watch reestablishment patches could be updated to make watches reestablish on reconnect, but that would be very confusing to have ephemerals disappear but watches get reestablished.

To ZooKeeper an expired session is a dead client, so by making the ZooKeeper service handle die as well we have clear semantics.

As an historical note, session expiration wasn't always a fatal thing. We used to delivered the session expiration event and then get a new session. The earliest problem our users ran into was with multi-threaded applications. If there are multiple threads going, the watcher object will get the session expired event; the other threads may still be trying to use the ZooKeeper object. If you use the same ZooKeeper object to reconnect with, you can easily end up with a thread that uses the object thinking that it is running in the context of an earlier session and really mess things up. The explicit reconnect alleviates this slightly, but you would still need to do synchronization around the ZooKeeper object and it would be very difficult to get working when you are using different libraries using the same ZooKeeper object.

You are working on a lock class; you can imagine someone else does a barrier class. Who is in charge of doing the reconnect() and deciding when it is safe?

Testing applications that use this would also be a pain given the number of race conditions involved.

So the safest thing to do is have the lock object become invalid when it's ZooKeeper object becomes invalid. An expired session is a really bad thing, we can't hide it nicely.

Benjamin Reed
added a comment - 23/Jul/08 15:20 The problem with the reconnect method is that it is easy to use but very hard to use correctly. In general, we try to avoid giving programmers methods that make it easy to hang themselves. We provide plenty of rope, and they are free to make a noose; we just don't provide a tieTheKnot method.
The basic problem has to do with clean up that happens when a session dies. Obviously ephemeral nodes will go away. Our watch reestablishment patches could be updated to make watches reestablish on reconnect, but that would be very confusing to have ephemerals disappear but watches get reestablished.
To ZooKeeper an expired session is a dead client, so by making the ZooKeeper service handle die as well we have clear semantics.
As an historical note, session expiration wasn't always a fatal thing. We used to delivered the session expiration event and then get a new session. The earliest problem our users ran into was with multi-threaded applications. If there are multiple threads going, the watcher object will get the session expired event; the other threads may still be trying to use the ZooKeeper object. If you use the same ZooKeeper object to reconnect with, you can easily end up with a thread that uses the object thinking that it is running in the context of an earlier session and really mess things up. The explicit reconnect alleviates this slightly, but you would still need to do synchronization around the ZooKeeper object and it would be very difficult to get working when you are using different libraries using the same ZooKeeper object.
You are working on a lock class; you can imagine someone else does a barrier class. Who is in charge of doing the reconnect() and deciding when it is safe?
Testing applications that use this would also be a pain given the number of race conditions involved.
So the safest thing to do is have the lock object become invalid when it's ZooKeeper object becomes invalid. An expired session is a really bad thing, we can't hide it nicely.

So an Elect Leader or Write Lock protocol has to deal with expired sessions and create new sessions; at some point someone has to recreate something. You can pass the buck and say we're not gonna allow the ZooKeeper to reconnect. Then say we're not allowed to have the WriteLock reconnect, then the next and next layer of the onion. But eventually there's gonna be something somewhere that recreates a session

For now I'll work on the assumption we're gonna have to have an object which is a wrapper around a ZooKeeper so that it can handle reconnections by just discarding one ZooKeeper instance and creating another. This object could be shared across Protocols (we might wanna reuse one connection with ZK to make multiple locks for example).

james strachan
added a comment - 23/Jul/08 15:38 I hear you
So an Elect Leader or Write Lock protocol has to deal with expired sessions and create new sessions; at some point someone has to recreate something. You can pass the buck and say we're not gonna allow the ZooKeeper to reconnect. Then say we're not allowed to have the WriteLock reconnect, then the next and next layer of the onion. But eventually there's gonna be something somewhere that recreates a session
For now I'll work on the assumption we're gonna have to have an object which is a wrapper around a ZooKeeper so that it can handle reconnections by just discarding one ZooKeeper instance and creating another. This object could be shared across Protocols (we might wanna reuse one connection with ZK to make multiple locks for example).

james strachan
added a comment - 23/Jul/08 16:20 You can mark this issue as RESOLVED/WILL_NOT_FIX if you like now - I've implemented a ZooKeeperFacade to wrap up the reconnectWithNewSession() logic for ZOOKEEPER-78

This has been a really interesting discussion, appreciate everyone digging in. Before closing this it would be great if we could capture this somewhere. James isn't going to be the only one encountering these issues.

Ben I'll assign to you, can you document this on the ASF wiki, then close? Thanks.

Patrick Hunt
added a comment - 23/Jul/08 17:15 This has been a really interesting discussion, appreciate everyone digging in. Before closing this it would be great if we could capture this somewhere. James isn't going to be the only one encountering these issues.
Ben I'll assign to you, can you document this on the ASF wiki, then close? Thanks.

Patrick Hunt
added a comment - 24/Jul/08 23:06 If I read the comments correctly the attached patch seems to be moot. Also the last comment references a SVN repository, please attach the patch as a "svn diff" and resubmit for review.