Hi,
We have an application with two code paths, one of which uses a secondary
index query and the other, which doesn't. While testing node down scenarios
in our cluster we got a result which surprised (and concerned) me, and I
wanted to find out if the behavior we observed is expected.
Background:
- 6 nodes in the cluster (in order: A, B, C, E, F and G)
- RF = 3
- All operations at QUORUM
- Operation 1: Read by row key followed by write
- Operation 2: Read by secondary index, followed by write
While running a mixed workload of operations 1 and 2, we got the following
results:
* Scenario* * Result* All nodes up All operations succeed One node
downAll operations succeedNodes A and E downAll operations
succeedNodes A and B downOperation 1: ~33% fail
Operation 2: All fail Nodes A and C down Operation 1: ~17% fail
Operation 2: All fail
We had expected (perhaps incorrectly) that the secondary index reads would
fail in proportion to the portion of the ring that was unable to reach
quorum, just as the row key reads did. For both operation types the
underlying failure was an UnavailableException.
The same pattern repeated for the other scenarios we tried. The row key
operations failed at the expected ratios, given the portion of the ring
that was unable to meet quorum because of nodes down, while all the
secondary index reads failed as soon as 2 out of any 3 adjacent nodes were
down.
Is this an expected behavior? Is it documented anywhere? I didn't find it
with a quick search.
The operation doing secondary index query is an important one for our app,
and we'd really prefer that it degrade gracefully in the face of cluster
failures. My plan at this point is to do that query at ConsistencyLevel.ONE
(and accept the increased risk of inconsistency). Will that work?
Thanks in advance,
Jim