I have observed one problem with an inconsistent ring that is
superficially similar (node thinks it's up but peers disagree) and noted
details in CASSANDRA-6082. However, it does not sound like the details
of either the symptoms, or the resolution match what you describe.
If you have not already, running nodetool goossipinfo might give you
more clues than `status`.
On 09/13/2013 10:48 AM, Dave Cowen wrote:
> Hi, all -
>
> We've been running Cassandra 1.1.12 in production since February, and have
> experienced a vexing problem with an arbitrary node "falling out of" or
> separating from the ring on occasion.
>
> When a node "falls out" of the ring, running nodetool ring on the
> misbehaving node shows that the misbehaving node believes that is Up, but
> that the rest of the ring is Down, and the rest of the ring has question
> marks listed for load. nodetool ring on any of the other nodes, however,
> shows the misbehaving node as Down but everything else is up.
>
> Shutting down and restarting the misbehaving node does not result in
> changed behavior. We can only get the misbehaving node to rejoin the ring
> by shutting it down, running nodetool removetoken
> and nodetool removetoken force elsewhere in the ring. After the node's
> token has been removed from the ring, it will rejoin and behave normally
> when it is restarted.
>
> This is not a frequent occurrence - we can go months between this
> happening. It most commonly occurs when a different node is brought down
> and then back up, but it can happen spontaneously. This is also not
> associated with a network connectivity event; we've seen no interruption in
> the nodes being able to communicate over the network. As above, it's also
> not isolated to a single node; we've seen this behavior on multiple nodes.
>
> This has occurred with both the identical seeds specified in cassandra.yaml
> on each node, and also when we remove the node from its own seed list (so
> any seed won't try to auto-bootstrap from itself). Seeds have always been
> up and available.
>
> Has anyone else seen similar behavior? For obvious reasons, we hate seeing
> one of the nodes suddenly "fall out" and require intervention when we flap
> another node, or for no reason at all.
>
> Thanks,
>
> Dave
>