When I kill 1 node (for example : disconnecting, or stopping, or pausing the Ubuntu VM) I have the following behavior:

When I execute this program:
1- I have an exception saying that 1 node is down : Expected behavior (even if we could avoid a long stack trace)
{code}
2012-11-18 08:14:55.830 WARN com.couchbase.client.vbucket.ConfigurationProviderHTTP: Connection problems with URI http://192.168.0.108:8091/pools ...skipping
java.net.ConnectException: Host is down
{code}

Note that it is only happening with PersistTo.TWO
if I use PersistTo.MASTER or PersistTo.ONE : the program is executed with no error and no stop
if I use PersistTo.THREE ( or more) : the program is executed, no stop with the expected observe message : ( Error while storing : Requested persistence to 3 node(s), but only 2 are available.
)

I do believe that's actually expected behavior, but let's talk through it to get your opinion.

We have a couple of options in the state of unexpected failure. one is we try our hardest to get the operation requested of us done and we rely on timeouts to keep from blocking forever. The second is that we keep tabs on our connections, and if the connection is down, we fail operations immediately so as to not have the application code waiting for something that may or may not succeed.

Had you gone in and removed the second node (click 'remove' and 'rebalance'), then the client should have done something similar to when you requested three nodes. The failure you describe above is unexpected. Further, the client library doesn't really know if it's temporary or permanent.

Finally, I do want to note, and I think this is well documented, that many things with Observe protocol under them end in timeouts. This is not the only one. Generally speaking, application code should be ready to do *something* in the case of a timeout.

Matt Ingenthron
added a comment - 20/Nov/12 9:54 AM I do believe that's actually expected behavior, but let's talk through it to get your opinion.
We have a couple of options in the state of unexpected failure. one is we try our hardest to get the operation requested of us done and we rely on timeouts to keep from blocking forever. The second is that we keep tabs on our connections, and if the connection is down, we fail operations immediately so as to not have the application code waiting for something that may or may not succeed.
Had you gone in and removed the second node (click 'remove' and 'rebalance'), then the client should have done something similar to when you requested three nodes. The failure you describe above is unexpected. Further, the client library doesn't really know if it's temporary or permanent.
Finally, I do want to note, and I think this is well documented, that many things with Observe protocol under them end in timeouts. This is not the only one. Generally speaking, application code should be ready to do *something* in the case of a timeout.

Tug explained this further. The PersistTo.THREE check must be happening after doing some operations, which is a bit late considering this operation can never succeed. The failure should be the same with a cluster that has a down node as it is with a cluster that just doesn't have a primary and to replica locations.

Matt Ingenthron
added a comment - 20/Nov/12 10:06 AM Tug explained this further. The PersistTo.THREE check must be happening after doing some operations, which is a bit late considering this operation can never succeed. The failure should be the same with a cluster that has a down node as it is with a cluster that just doesn't have a primary and to replica locations.

The way Rags wrote this code originally was to do the set and then the observe. The observe part is the part that does all of the checking so the set will actually go through an then you will get the error. Similarly there is no checking for downed nodes and I don't think we actually have the ability to do this at the moment, but I may be wrong.

On another note, one other thing I thing is wrong is returning an OperationFuture from all of the observe functions, but it isn't actually an asynchronous function.

Mike Wiederhold
added a comment - 21/Nov/12 12:52 PM The way Rags wrote this code originally was to do the set and then the observe. The observe part is the part that does all of the checking so the set will actually go through an then you will get the error. Similarly there is no checking for downed nodes and I don't think we actually have the ability to do this at the moment, but I may be wrong.
On another note, one other thing I thing is wrong is returning an OperationFuture from all of the observe functions, but it isn't actually an asynchronous function.