Description

Description
When a ZooKeeper server loses contact with over half of the other servers in an ensemble ('loses a quorum'), it stops responding to client requests because it cannot guarantee that writes will get processed correctly. For some applications, it would be beneficial if a server still responded to read requests when the quorum is lost, but caused an error condition when a write request was attempted.

This project would implement a 'read-only' mode for ZooKeeper servers (maybe only for Observers) that allowed read requests to be served as long as the client can contact a server.

This is a great project for getting really hands-on with the internals of ZooKeeper - you must be comfortable with Java and networking otherwise you'll have a hard time coming up to speed.

Activity

This is a great idea, but I'm afraid there is a somewhat fundamental problem with this concept.
What you want is if enough nodes "go down" that a quorum can't be formed (at all), the remaining nodes go into read-only mode.

The problem is that if a partition occurs (say, a single server loses contact with the rest of the cluster), but a quorum still exists, we want clients who were connected to the partitioned server to re-connect to a server in the majority. The current design allows for this by denying connections to minority nodes, forcing clients to hunt for the majority. If we allow servers in the minority to keep/accept connections, then clients will end up in read-only mode when they could have simply reconnected to the majority.

It may be possible to accomplish the desired outcome with some client-side and connection protocol changes. Specifically, a flag on the connection request from the client that says "allow read-only connections" - if false, the server will close the connection, allowing the client to hunt for a server in the majority. Once a client has gone through all the servers in the list (and found out that none are in the majority) it could flip the flag to true and connect to any running servers in read-only mode. There is still the question of how to get back out of read only mode (e.g. should we keep hunting in the background for a majority, or just wait until the server we are connected to re-forms a quorum).

Dave Wright
added a comment - 08/May/10 14:26 This is a great idea, but I'm afraid there is a somewhat fundamental problem with this concept.
What you want is if enough nodes "go down" that a quorum can't be formed (at all), the remaining nodes go into read-only mode.
The problem is that if a partition occurs (say, a single server loses contact with the rest of the cluster), but a quorum still exists, we want clients who were connected to the partitioned server to re-connect to a server in the majority. The current design allows for this by denying connections to minority nodes, forcing clients to hunt for the majority. If we allow servers in the minority to keep/accept connections, then clients will end up in read-only mode when they could have simply reconnected to the majority.
It may be possible to accomplish the desired outcome with some client-side and connection protocol changes. Specifically, a flag on the connection request from the client that says "allow read-only connections" - if false, the server will close the connection, allowing the client to hunt for a server in the majority. Once a client has gone through all the servers in the list (and found out that none are in the majority) it could flip the flag to true and connect to any running servers in read-only mode. There is still the question of how to get back out of read only mode (e.g. should we keep hunting in the background for a majority, or just wait until the server we are connected to re-forms a quorum).

Approach described there is similar to what you've proposed: make server distinguish read-only and usual clients.
However, I was thinking that r-o client should go to read-only mode right after server it's tied to is partitioned, without trying to reconnect to majority. But your idea that client should try all servers first is definitely a better option.

Also I think current behavior of ZooKeeper client should remain unchanged.
I mean, there should be either new class for r-o client, or new functionality in current client which is explicitly triggered say by a flag passed to ctor. The idea is not to break code for current users.

Sergey Doroshenko
added a comment - 08/May/10 15:20 Dave, thanks for feedback,
Did you check http://wiki.apache.org/hadoop/ZooKeeper/GSoCReadOnlyMode ?
Approach described there is similar to what you've proposed: make server distinguish read-only and usual clients.
However, I was thinking that r-o client should go to read-only mode right after server it's tied to is partitioned, without trying to reconnect to majority. But your idea that client should try all servers first is definitely a better option.
Also I think current behavior of ZooKeeper client should remain unchanged.
I mean, there should be either new class for r-o client, or new functionality in current client which is explicitly triggered say by a flag passed to ctor. The idea is not to break code for current users.

Sergey Doroshenko
added a comment - 28/May/10 15:55 I have updated wiki page to describe new (quite simple and elegant) approach of implementing server-side part of the read-only mode.
Already discussed this with Henry yesterday.
Take a look if you're interested in the details, and lmk if you have some thoughts about this.