It is easy to fill up a Ceph cluster (FileStore) by running "rados bench write".

Assuming the full and nearfull failsafe ratios have not been changed from their defaults, the expected behavior of such a test is that the cluster will fill up to 96, 97, 98% but not more.

On one cluster, however, it is possible to fill OSDs to 100%, with disastrous consequences. This cluster has 24 OSDs, all on 1TB spinners with external journals on SSDs. The journal partitions are abnormally large (87 GiB).

There is a configuration parameter called osd_failsafe_nearfull_ratio which defaults to 0.90. When the filestore disk usage ratio reaches this point, the OSD state is changed to "near full". The conditional used to determine whether osd_failsafe_nearfull_ratio has been exceeded does not take the journal size into account.

So, here is what might be happening:

1. the journal is periodically flushed to the underlying filestore;2. the OSD stats (including "cur_state", which can be "FULL", "NEAR", or "NONE") are updated only before and after the journal flush operation - not during it;3. when cur_state is "NEAR" or "FULL", the journal flush operation is careful not to fill up the disk, but if it is "NONE", it writes blindly for maximum performance.

Hence Kefu's suggested fix (see comments below), which is to assume the worst case (full journal) when checking whether the nearfull failsafe ratio has been reached, as part of updating the OSD stats.

The command used to fill up the cluster is "sudo rados -p rbd bench 60000 write --no-cleanup", which starts 16 concurrent threads, each writing 4MB objects to the cluster ("Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 60000 seconds or 0 objects").

When I run the rados bench command on this cluster, the OSDs fill up to 97% before the cluster realizes that something is amiss (first appearance of "[ERR] : OSD full dropping all updates") in the log. Even after this message appears, the full ratio rises further, to 98%. After that, nothing more is written to the disk.

Upon closer analysis of the "heartbeat: osd_stat" messages, the OSD fills up at a linear rate until it hits just over 95% full. At this point, the behavior is strange - for a few heartbeats the usage level actually goes down (!).

Interestingly, in this test run, cur_state is never set to "FULL" even though the journal partitions are twice as big as in the first test. Also, and it isn't changed to "NEAR" until the usage ratio reaches 92%. Even after that, usage continues to rise to 95% before levelling off:

Cluster fills up linearly until "9870 MB used, 358 MB avail, 10228 MB total" (96.49%) (and doesn't go up (or down) from there, until I CTRL-C the "rados bench" command and then start it again after some minutes. After restarting it, the heartbeat reports a lower usage figure (96.19%):

yeah, if the journal is abnormally large, when we sync/flush the journal to the filestore, there is chance that the utilization percentage of filestore could reach osd_failsafe_full_ratio before we notice it. maybe we should set a watermark taking the worst case into consideration, where the journal is full when we sync the filestore.

in other words, we should be more conservative when evaluating NEAR_FULL, instead of using

@Kefu, first of all thank you very much for looking at this. Keep in mind that the only way I have been able to trigger the bug so far is by having the journal be irregularly sized (number of sectors not evenly divisible by 2048).

In other words:

- when the journal size (in sectors) is evenly divisible by 2048, the "write-to-full" test works as expected: cluster fills up to 95% and the usage ratio does not rise any further.

- when the journal size is not evenly divisible by 2048, buggy behavior is seen (cluster fills to 98%, possibly even to 100%).

That, combined with the fact that the journal is abnormally large (and full) in both cases, would indicate that the bug is more subtle, right? Still, I will try your patch and report back.