Description

When a connection to a ZooKeeper server fails, all of the pending requests
will return an error. In reality the requests should be resubmitted when
the client reestablishes a connection to ZooKeeper.

For read requests, it's no big deal to just reissue the request. For update
requests, the ZooKeeper must be able to detect if the request has been
processed and, if so, return the result of the previous execution;
otherwise, it should process the request.

which adds a ZooKeeperFacade for dealing with reconnecting on session expiration and some helper methods in ProtocolSupport for retrying synchronous operations or blocks of code in light of connection failures

james strachan
added a comment - 24/Jul/08 08:46 BTW this discussion came up recently on the dev lists too...
http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-dev/200807.mbox/%3cec6e67fd0807180945vd72ac6axcfd0851789fb6e5c@mail.gmail.com%3e
To be able to retry operations on conection close (or due to session expiration) there is a patch in https://issues.apache.org/jira/browse/ZOOKEEPER-78
which adds a ZooKeeperFacade for dealing with reconnecting on session expiration and some helper methods in ProtocolSupport for retrying synchronous operations or blocks of code in light of connection failures

james strachan
added a comment - 24/Jul/08 09:37 BTW you can see the code for ProtocolSupport and ZooKeeperFacade as I've checked in the patch for ZOOKEEPER-78 into a temporary sandbox area, details here

it turns out that all the information to do this is split between server and client. the server pushes all updates through the atomic broadcast, even errors. so if the client resends the pending requests to the server when it reconnects, the server should be able to either replay the responses or execute the request. this would eliminate the annoying-to-deal-with CONNECTIONLOSS error.

Benjamin Reed
added a comment - 29/May/09 00:48 it turns out that all the information to do this is split between server and client. the server pushes all updates through the atomic broadcast, even errors. so if the client resends the pending requests to the server when it reconnects, the server should be able to either replay the responses or execute the request. this would eliminate the annoying-to-deal-with CONNECTIONLOSS error.

Mahadev konar
added a comment - 30/Oct/09 23:11 ted, due to some laziness from my side, I havent made much progress on this. I expect to make good progress next week and hope to post a patch within a week or two.

Ted Dunning
added a comment - 30/Oct/09 23:17
I wouldn't call it laziness. At most distraction.
But a lot of ZK users will breathe a sigh of relief when this fix gets deployed!
Thanks for your efforts on this.

sorry folks, I had been working on this jira for sometime and had gotten side tracked by other issues. I will upload a proposal for this jira and whats expected for the users in a while. Please feel free to take a look and comment. I have been making some progress on this and will try to get it in soon.

Mahadev konar
added a comment - 27/Jan/10 02:12 sorry folks, I had been working on this jira for sometime and had gotten side tracked by other issues. I will upload a proposal for this jira and whats expected for the users in a while. Please feel free to take a look and comment. I have been making some progress on this and will try to get it in soon.
thanks

here is design document for zookeeper-22. I realised that the scope of zookeeper-22 is much bigger than i had anticipated. This invloves extensive changes to the leader, connect requests from the client, clean up scripts.

There needs to version checking introduced with this patch so that old clients work with the new servers and old servers work with the new client, to make it all backwards compatible.

Given the scope of this jira, i will be creating sub jiras and mark it as part of this jira since a humongous patch would be hard to get in given that it would touch all critical parts of zookeeper.

Mahadev konar
added a comment - 27/Jan/10 23:27 here is design document for zookeeper-22. I realised that the scope of zookeeper-22 is much bigger than i had anticipated. This invloves extensive changes to the leader, connect requests from the client, clean up scripts.
There needs to version checking introduced with this patch so that old clients work with the new servers and old servers work with the new client, to make it all backwards compatible.
Given the scope of this jira, i will be creating sub jiras and mark it as part of this jira since a humongous patch would be hard to get in given that it would touch all critical parts of zookeeper.

One suggestion, can we make auto retry an option rather than mandatory?
My concern is what if client wants to abort the operation after receiving CONNECTION LOSS event:
He needs to either kill the thread or issue an explict undo operation afterwards, kind of awkward...

qing yan
added a comment - 01/Feb/10 02:29 One suggestion, can we make auto retry an option rather than mandatory?
My concern is what if client wants to abort the operation after receiving CONNECTION LOSS event:
He needs to either kill the thread or issue an explict undo operation afterwards, kind of awkward...

CONNECTION LOSS event refers to the situation where connection to ZK cluster(a quorum of ZK nodes) is lost and application needs to enter the
safe mode, while broken connection to a particular node and failover to another node can be handled transparently.

qing yan
added a comment - 01/Feb/10 06:22 CONNECTION LOSS event refers to the situation where connection to ZK cluster(a quorum of ZK nodes) is lost and application needs to enter the
safe mode, while broken connection to a particular node and failover to another node can be handled transparently.