Category: libcrush

The most common cause of unevenness is when only a few thousands PGs, or less, are mapped. This is not enough samples and the variations can be as high as 25%. For instance, when there are two OSDs with the same weight, distributing randomly four PGs among them may lead to one OSD having three PGs and the other only one. This problem would be resolved by having at least thousands of PGs per OSD, but that is not recommended because it would require too many resources.

The other cause of uneven distribution is conditional probability. For a two-replica pool, PGs are mapped to OSDs that must be different: the second OSD is chosen at random, on the condition that it is not the same as the first OSD. When all OSDs have the same probability, this bias is not significant. But when OSDs have different weights it makes a difference. For instance, given nine OSDs with weight 1 and one OSD with weight 5, the smaller OSDs will be overfilled (from 7% to 10%) and the bigger OSD will be ~15% underfilled.

The proposed algorithm fixes both cases by producing new weights that can be used as a weight set in Luminous clusters.

For a given pool the input parameters are:

pool size

numeric id

number of PGs

root of the CRUSH rule (take step)

the CRUSH rule itself

for size in [1,pool size]
copy all weights to the size - 1 weight set
recursively walk the root
for each bucket at a given level in the hierarchy
repeat until the difference with the expected distribution is small
map all PGs in the pool, with size instead of pool size
in the size - 1 weight set
lower the weight of the most overfilled child
increase the weight of the most underfilled child

It is common to change the size of a pool in a Ceph cluster. When increasing the size from 2 to 3, the user expects the existing objects to stay where they are, with new objects being created to provide an additional replica. To preserve this property while optimizing the weights, there needs to be a different weight set for each possible size. This is what the outer loop (for size in [1,pool size]) does.

If the distribution is not as expected at the highest level of the hierarchy, there is no way to fix that at the lowest levels. For instance if a host receives 100 more PGs that it should, the OSDs it contains will inevitably be overfilled. This is why the optimization proceeds from the top of the hierarchy.

When a bucket is given precisely the expected number of PGs and fails to distribute them evenly among its children, the children’s weights can be modified to get closer to the ideal distribution. Increasing the weight of the most underfilled item will capture PGs from the other buckets. And decreasing the weight of the most overfilled will push PGs out of it. A simulation is run to determine precisely which PGs will be distributed to which item because there is no known mathematical formula to calculate that. This is why all PGs are mapped to determine which items are over- or underfilled.

meaning Placement Groups (PGs) will place replicas on different hosts so the cluster can sustain the failure of any host without losing data. The missing replicas are then restored from the surviving replicas (via a process called “backfilling”) and placed on the remaining hosts.

To make sure there is enough space in the cluster to cope with the failure of a host, a certain percentage of free space in the cluster is reserved (from the beginning) and never used. This percentage needs to be adjusted to take into account the most overfull OSD in case the PG distribution is not even.

For instance, in a cluster with five hosts containing two identical disks each, reserving 20% plus 6.45% to account for the most overfull OSD displayed below should be enough:

However this uneven distribution will be different when a host is removed, because that causes a change in the most overfull OSD – and the new most overfull OSD may even be worse (more overfull) than the previous one. In our example cluster, device9 was the most (6.45%) overfull. If the host containing device8 and device9 is removed, device5 becomes the most overfull OSD, and it is worse (8.98%):

It would therefore be better to reserve 28.98% instead of 26.45% to make sure the cluster does not become too full after a host failure. To help with that, the crush analyze command was modified to display the worst case scenario for each bucket type in the crushmap:

In a Ceph cluster with a single pool of 1024 Placement Groups (PGs), the PG distribution among devices will not be as expected. (see Predicting Ceph PG placement for details about this uneven distribution).

In the following, the difference between the expected number of PGs on a given host or device and the actual number of PGs can be as high as 25%. For instance, device4 has a weight of 1 and is expected to get 128 PGs but it only gets 96. And host2 (containing device4) has 83 less PGs than expected.

For each host or device, we now have two weights. First, there is the target weight which is used to calculate the expected number of PGs. For instance, host0 should get 1024 * (target weight host0 = 2 / (target weight host0 = 2 + target weight host1 = 2 + target weight host3 = 4)) PGs, that is 1024 * ( 2 / ( 2 + 2 + 4 ) ) = 256. The second weight is weight set, which is used to choose where a PGs is mapped.

The problem is that the crushmap format only allows one weight per host or device. Although it would be possible to store the target weight outside of the crushmap, it would be inconvenient and difficult to understand. Instead, the crushmap syntax was extended to allow additional weights (via weight set) to be associated with each item (i.e. disk or host in the example below).

The CRUSH function maps Ceph placement groups (PGs) and objects to OSDs. It is used extensively in Ceph clients and daemons as well as in the Linux kernel modules and its CPU cost should be reduced to the minimum.

It is common to define the Ceph CRUSH map so that PGs use OSDs on different hosts. The hierarchy can be as simple as 100 hosts, each containing 7 OSDs. When determining which OSDs belong to a given PG, in this scenario the CRUSH function will hash each of the 100 hosts to the PG, compare the results, and select the OSDs that score the highest. That is 100 + 7 calls to the hash function.

If the hierarchy is modified to have 20 racks, each containing 5 hosts with 7 OSDs per host, we only have 20 + 5 + 7 calls to the hash function. The first 20 calls are to select the rack, the next 5 to select the host in the rack and the last 7 to select the OSD within the host.

In other words, this hierarchical subdivision of the CRUSH map reduces the number of calls to the hash function from 107 to 32. It does not have any influence on how the PGs are mapped to OSDs but it saves CPU.