Full disk can result in being marked down

Details

Description

We had a node file up the disk under one of two data directories. The result was that the node stopped making progress. The problem appears to be this (I'll update with more details as we find them):

When new tasks are put onto most queues in Cassandra, if there isn't a thread in the pool to handle the task immediately, the task in run in the caller's thread
(org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor:69 sets the caller-runs policy). The queue in question here is the queue that manages flushes, which is enqueued to from various places in our code (and therefore likely from multiple threads). Assuming that the full disk meant that no threads doing flushing could make progress (it appears that way) eventually any thread that calls the flush code would become stalled.

Assuming our analysis is right (and we're still looking into it) we need to make a change. Here's a proposal so far:

SHORT TERM:

change the TheadPoolExecutor policy to not be caller runs. This will let other threads make progress in the event that one pool is stalled

LONG TERM

It appears that there are n threads for n data directories that we flush to, but they're not dedicated to a data directory. We should have a thread per data directory and have that thread dedicated to that directory

Jonathan Ellis
added a comment - 28/May/12 20:22 Created https://issues.apache.org/jira/browse/CASSANDRA-4292 to pursue the thread-per-disk idea. CASSANDRA-2116 and CASSANDRA-2118 address the issue of what to do when disks error out.

It appears that there are n threads for n data directories that we flush to, but they're not dedicated to a data directory. We should have a thread per data directory and have that thread dedicated to that directory

At least until we cap sstable size (CASSANDRA-1608?), one data volume is going to be the recommended configuration, so this is low priority.

if a disk fills up, we stop trying to write to it

if we're about to write more data to a disk than space available, we don't try and write to that disk

these two Cassandra has always done on compaction. less sure about flush.

the nice thing about writes is that erroring out is almost identical to being completely down for ConsistencyLevel purposes.

we balance data relatively evenly between disks

also low priority given the above.

if a disk is misbehaving for a period of time, we stop using it and assume that data is lost (potentially notify an operator as well)

this is the biggest problem right now: if a disk/volume goes down, the rest of the node (in particular gossip) will keep functioning, so other nodes will continue trying to read from it.

short term the best fix for this is to provide timeout information to the dynamic snitch (CASSANDRA-1905) so it can route around such nodes.

Jonathan Ellis
added a comment - 27/Dec/10 15:57 Updating for the past 10 months' worth of changes:
Our node that hit this condition is essentially dead (its not gossiping or accepting any writes or reads, but is still alive).
This is basically fixed now that flow control is implemented ( CASSANDRA-685 ) and refined ( CASSANDRA-1358 ).
It appears that there are n threads for n data directories that we flush to, but they're not dedicated to a data directory. We should have a thread per data directory and have that thread dedicated to that directory
At least until we cap sstable size ( CASSANDRA-1608 ?), one data volume is going to be the recommended configuration, so this is low priority.
if a disk fills up, we stop trying to write to it
if we're about to write more data to a disk than space available, we don't try and write to that disk
these two Cassandra has always done on compaction. less sure about flush.
the nice thing about writes is that erroring out is almost identical to being completely down for ConsistencyLevel purposes.
we balance data relatively evenly between disks
also low priority given the above.
if a disk is misbehaving for a period of time, we stop using it and assume that data is lost (potentially notify an operator as well)
this is the biggest problem right now: if a disk/volume goes down, the rest of the node (in particular gossip) will keep functioning, so other nodes will continue trying to read from it.
short term the best fix for this is to provide timeout information to the dynamic snitch ( CASSANDRA-1905 ) so it can route around such nodes.

> > change the TheadPoolExecutor policy to not be caller runs. This will let other threads make progress in the event that one pool is stalled
>
> disagree. you can only do this by uncapping the collection, which is a cure worse than the disease. (you go back to being able to make a node GC storm to death really really easily when you give it more data than it can flush)

I'll have to think more about this. I agree that we need to not let the queues grow in an unbounded way, but our current setup (basically all threads can be consumed by one queue and some of them will wait indefinitely for conditions).

We need to decide which kind of failure we want here. Our node that hit this condition is essentially dead (its not gossiping or accepting any writes or reads, but is still alive).

> > It appears that there are n threads for n data directories that we flush to, but they're not dedicated to a data directory. We should have a thread per data directory and have that thread dedicated to that directory
>
> yes, this would be my preferred design. should be straightforward code to write, just hasn't been done yet.

I agree. We'll take a look at this early next week.

> > Perhaps we could use the failure detector on disks?
>
> Not sure what this looks like but I agree our story here needs a lot of improvement.

I'm not entirely sure what this looks like either, but here are the properties I'd like cassandra to have:

if a disk fills up, we stop trying to write to it

if we're about to write more data to a disk than space available, we don't try and write to that disk

we balance data relatively evenly between disks

if a disk is misbehaving for a period of time, we stop using it and assume that data is lost (potentially notify an operator as well)

> Short term my recommendation is to run w/ data files on a single raid0 unless you're sure you'll never get near the filling up point.

This is probably the best advice for new clusters. Unfortunately we can't easily implement this right now.

Ryan King
added a comment - 19/Feb/10 22:29 > > change the TheadPoolExecutor policy to not be caller runs. This will let other threads make progress in the event that one pool is stalled
>
> disagree. you can only do this by uncapping the collection, which is a cure worse than the disease. (you go back to being able to make a node GC storm to death really really easily when you give it more data than it can flush)
I'll have to think more about this. I agree that we need to not let the queues grow in an unbounded way, but our current setup (basically all threads can be consumed by one queue and some of them will wait indefinitely for conditions).
We need to decide which kind of failure we want here. Our node that hit this condition is essentially dead (its not gossiping or accepting any writes or reads, but is still alive).
> > It appears that there are n threads for n data directories that we flush to, but they're not dedicated to a data directory. We should have a thread per data directory and have that thread dedicated to that directory
>
> yes, this would be my preferred design. should be straightforward code to write, just hasn't been done yet.
I agree. We'll take a look at this early next week.
> > Perhaps we could use the failure detector on disks?
>
> Not sure what this looks like but I agree our story here needs a lot of improvement.
I'm not entirely sure what this looks like either, but here are the properties I'd like cassandra to have:
if a disk fills up, we stop trying to write to it
if we're about to write more data to a disk than space available, we don't try and write to that disk
we balance data relatively evenly between disks
if a disk is misbehaving for a period of time, we stop using it and assume that data is lost (potentially notify an operator as well)
> Short term my recommendation is to run w/ data files on a single raid0 unless you're sure you'll never get near the filling up point.
This is probably the best advice for new clusters. Unfortunately we can't easily implement this right now.

> change the TheadPoolExecutor policy to not be caller runs. This will let other threads make progress in the event that one pool is stalled

disagree. you can only do this by uncapping the collection, which is a cure worse than the disease. (you go back to being able to make a node GC storm to death really really easily when you give it more data than it can flush)

> It appears that there are n threads for n data directories that we flush to, but they're not dedicated to a data directory. We should have a thread per data directory and have that thread dedicated to that directory

yes, this would be my preferred design. should be straightforward code to write, just hasn't been done yet.

> Perhaps we could use the failure detector on disks?

Not sure what this looks like but I agree our story here needs a lot of improvement.

Short term my recommendation is to run w/ data files on a single raid0 unless you're sure you'll never get near the filling up point.

Jonathan Ellis
added a comment - 19/Feb/10 04:00 > change the TheadPoolExecutor policy to not be caller runs. This will let other threads make progress in the event that one pool is stalled
disagree. you can only do this by uncapping the collection, which is a cure worse than the disease. (you go back to being able to make a node GC storm to death really really easily when you give it more data than it can flush)
> It appears that there are n threads for n data directories that we flush to, but they're not dedicated to a data directory. We should have a thread per data directory and have that thread dedicated to that directory
yes, this would be my preferred design. should be straightforward code to write, just hasn't been done yet.
> Perhaps we could use the failure detector on disks?
Not sure what this looks like but I agree our story here needs a lot of improvement.
Short term my recommendation is to run w/ data files on a single raid0 unless you're sure you'll never get near the filling up point.