Navigation

The CRUSH algorithm
determines how to store and retrieve data by computing data storage locations.
CRUSH empowers Ceph clients to communicate with OSDs directly rather than
through a centralized server or broker. With an algorithmically determined
method of storing and retrieving data, Ceph avoids a single point of failure, a
performance bottleneck, and a physical limit to its scalability.

CRUSH maps contain a list of OSDs, a list of
‘buckets’ for aggregating the devices into physical locations, and a list of
rules that tell CRUSH how it should replicate data in a Ceph cluster’s pools. By
reflecting the underlying physical organization of the installation, CRUSH can
model—and thereby address—potential sources of correlated device failures.
Typical sources include physical proximity, a shared power source, and a shared
network. By encoding this information into the cluster map, CRUSH placement
policies can separate object replicas across different failure domains while
still maintaining the desired distribution. For example, to address the
possibility of concurrent failures, it may be desirable to ensure that data
replicas are on devices using different shelves, racks, power supplies,
controllers, and/or physical locations.

When you deploy OSDs they are automatically placed within the CRUSH map under a
host node named with the hostname for the host they are running on. This,
combined with the default CRUSH failure domain, ensures that replicas or erasure
code shards are separated across hosts and a single host failure will not
affect availability. For larger clusters, however, administrators should carefully consider their choice of failure domain. Separating replicas across racks,
for example, is common for mid- to large-sized clusters.

The location of an OSD in terms of the CRUSH map’s hierarchy is
referred to as a crushlocation. This location specifier takes the
form of a list of key and value pairs describing a position. For
example, if an OSD is in a particular row, rack, chassis and host, and
is part of the ‘default’ CRUSH tree (this is the case for the vast
majority of clusters), its crush location could be described as:

root=defaultrow=arack=a2chassis=a2ahost=a2a1

Note:

Note that the order of the keys does not matter.

The key name (left of =) must be a valid CRUSH type. By default
these include root, datacenter, room, row, pod, pdu, rack, chassis and host,
but those types can be customized to be anything appropriate by modifying
the CRUSH map.

Not all keys need to be specified. For example, by default, Ceph
automatically sets a ceph-osd daemon’s location to be
root=defaulthost=HOSTNAME (based on the output from hostname-s).

The crush location for an OSD is normally expressed via the crushlocation
config option being set in the ceph.conf file. Each time the OSD starts,
it verifies it is in the correct location in the CRUSH map and, if it is not,
it moved itself. To disable this automatic CRUSH map management, add the
following to your configuration file in the [osd] section:

A customized location hook can be used to generate a more complete
crush location on startup. The crush location is based on, in order
of preference:

A crushlocation option in ceph.conf.

A default of root=defaulthost=HOSTNAME where the hostname is
generated with the hostname-s command.

This is not useful by itself, as the OSD itself has the exact same
behavior. However, a script can be written to provide additional
location fields (for example, the rack or datacenter), and then the
hook enabled via the config option:

crushlocationhook=/path/to/customized-ceph-crush-location

This hook is passed several arguments (below) and should output a single line
to stdout with the CRUSH location description.:

--clusterCLUSTER--idID--typeTYPE

where the cluster name is typically ‘ceph’, the id is the daemon
identifier (e.g., the OSD number or daemon identifier), and the daemon
type is osd, mds, or similar.

For example, a simple hook that additionally specified a rack location
based on a hypothetical file /etc/rack might be:

The CRUSH map consists of, loosely speaking, a hierarchy describing
the physical topology of the cluster, and a set of rules defining
policy about how we place data on those devices. The hierarchy has
devices (ceph-osd daemons) at the leaves, and internal nodes
corresponding to other physical features or groupings: hosts, racks,
rows, datacenters, and so on. The rules describe how replicas are
placed in terms of that hierarchy (e.g., ‘three replicas in different
racks’).

Devices are individual ceph-osd daemons that can store data. You
will normally have one defined here for each OSD daemon in your
cluster. Devices are identified by an id (a non-negative integer) and
a name, normally osd.N where N is the device id.

Devices may also have a device class associated with them (e.g.,
hdd or ssd), allowing them to be conveniently targetted by a
crush rule.

A bucket is the CRUSH term for internal nodes in the hierarchy: hosts,
racks, rows, etc. The CRUSH map defines a series of types that are
used to describe these nodes. By default, these types include:

osd (or device)

host

chassis

rack

row

pdu

pod

room

datacenter

region

root

Most clusters make use of only a handful of these types, and others
can be defined as needed.

The hierarchy is built with devices (normally type osd) at the
leaves, interior nodes with non-device types, and a root node of type
root. For example,

Each node (device or bucket) in the hierarchy has a weight
associated with it, indicating the relative proportion of the total
data that device or hierarchy subtree should store. Weights are set
at the leaves, indicating the size of the device, and automatically
sum up the tree from there, such that the weight of the default node
will be the total of all devices contained beneath it. Normally
weights are in units of terabytes (TB).

You can get a simple view the CRUSH hierarchy for your cluster,
including the weights, with:

Rules define policy about how data is distributed across the devices
in the hierarchy.

CRUSH rules define placement and replication strategies or
distribution policies that allow you to specify exactly how CRUSH
places object replicas. For example, you might create a rule selecting
a pair of targets for 2-way mirroring, another rule for selecting
three targets in two different data centers for 3-way mirroring, and
yet another rule for erasure coding over six storage devices. For a
detailed discussion of CRUSH rules, refer to CRUSH - Controlled,
Scalable, Decentralized Placement of Replicated Data, and more
specifically to Section 3.2.

In almost all cases, CRUSH rules can be created via the CLI by
specifying the pool type they will be used for (replicated or
erasure coded), the failure domain, and optionally a device class.
In rare cases rules must be written by hand by manually editing the
CRUSH map.

Device classes are implemented by creating a “shadow” CRUSH hierarchy
for each device class in use that contains only devices of that class.
Rules can then distribute data over the shadow hierarchy. One nice
thing about this approach is that it is fully backward compatible with
old Ceph clients. You can view the CRUSH hierarchy with shadow items
with:

A weight set is an alternative set of weights to use when
calculating data placement. The normal weights associated with each
device in the CRUSH map are set based on the device size and indicate
how much data we should be storing where. However, because CRUSH is
based on a pseudorandom placement process, there is always some
variation from this ideal distribution, the same way that rolling a
dice sixty times will not result in rolling exactly 10 ones and 10
sixes. Weight sets allow the cluster to do a numerical optimization
based on the specifics of your cluster (hierarchy, pools, etc.) to achieve
a balanced distribution.

There are two types of weight sets supported:

A compat weight set is a single alternative set of weights for
each device and node in the cluster. This is not well-suited for
correcting for all anomalies (for example, placement groups for
different pools may be different sizes and have different load
levels, but will be mostly treated the same by the balancer).
However, compat weight sets have the huge advantage that they are
backward compatible with previous versions of Ceph, which means
that even though weight sets were first introduced in Luminous
v12.2.z, older clients (e.g., firefly) can still connect to the
cluster when a compat weight set is being used to balance data.

A per-pool weight set is more flexible in that it allows
placement to be optimized for each data pool. Additionally,
weights can be adjusted for each position of placement, allowing
the optimizer to correct for a suble skew of data toward devices
with small weights relative to their peers (and effect that is
usually only apparently in very large clusters but which can cause
balancing problems).

When weight sets are in use, the weights associated with each node in
the hierarchy is visible as a separate column (labeled either
(compat) or the pool name) from the command:

cephosdcrushtree

When both compat and per-pool weight sets are in use, data
placement for a particular pool will use its own per-pool weight set
if present. If not, it will use the compat weight set if present. If
neither are present, it will use the normal CRUSH weights.

Although weight sets can be set up and manipulated by hand, it is
recommended that the balancer module be enabled to do so
automatically.

Per-pool weight sets require that all servers and daemons
run Luminous v12.2.z or later.

Where:

pool-name

Description:

The name of a RADOS pool

Type:

String

Required:

Yes

Example:

rbd

mode

Description:

Either flat or positional. A flat weight set
has a single weight for each device or bucket. A
positional weight set has a potentially different
weight for each position in the resulting placement
mapping. For example, if a pool has a replica count of
3, then a positional weight set will have three weights
for each device and bucket.

For a replicated pool, the primary decision when creating the CRUSH
rule is what the failure domain is going to be. For example, if a
failure domain of host is selected, then CRUSH will ensure that
each replica of the data is stored on a different host. If rack
is selected, then each replica will be stored in a different rack.
What failure domain you choose primarily depends on the size of your
cluster and how your hierarchy is structured.

Normally, the entire cluster hierarchy is nested beneath a root node
named default. If you have customized your hierarchy, you may
want to create a rule nested at some other node in the hierarchy. It
doesn’t matter what type is associated with that node (it doesn’t have
to be a root node).

It is also possible to create a rule that restricts data placement to
a specific class of device. By default, Ceph OSDs automatically
classify themselves as either hdd or ssd, depending on the
underlying type of device being used. These classes can also be
customized.

For an erasure-coded pool, the same basic decisions need to be made as
with a replicated pool: what is the failure domain, what node in the
hierarchy will data be placed under (usually default), and will
placement be restricted to a specific device class. Erasure code
pools are created a bit differently, however, because they need to be
constructed carefully based on the erasure code being used. For this reason,
you must include this information in the erasure code profile. A CRUSH
rule will then be created from that either explicitly or automatically when
the profile is used to create a pool.

The erasure code profiles can be listed with:

cephosderasure-code-profilels

An existing profile can be viewed with:

cephosderasure-code-profileget{profile-name}

Normally profiles should never be modified; instead, a new profile
should be created and used when creating a new pool or creating a new
rule for an existing pool.

An erasure code profile consists of a set of key=value pairs. Most of
these control the behavior of the erasure code that is encoding data
in the pool. Those that begin with crush-, however, affect the
CRUSH rule that is created.

The erasure code profile properties of interest are:

crush-root: the name of the CRUSH node to place data under [default: default].

Over time, we have made (and continue to make) improvements to the
CRUSH algorithm used to calculate the placement of data. In order to
support the change in behavior, we have introduced a series of tunable
options that control whether the legacy or improved variation of the
algorithm is used.

In order to use newer tunables, both clients and servers must support
the new version of CRUSH. For this reason, we have created
profiles that are named after the Ceph version in which they were
introduced. For example, the firefly tunables are first supported
in the firefly release, and will not work with older (e.g., dumpling)
clients. Once a given set of tunables are changed from the legacy
default behavior, the ceph-mon and ceph-osd will prevent older
clients who do not support the new CRUSH features from connecting to
the cluster.

For hierarchies with a small number of devices in the leaf buckets,
some PGs map to fewer than the desired number of replicas. This
commonly happens for hierarchies with “host” nodes with a small
number (1-3) of OSDs nested beneath each one.

For large clusters, some small percentages of PGs map to less than
the desired number of OSDs. This is more prevalent when there are
several layers of the hierarchy (e.g., row, rack, host, osd).

When some OSDs are marked out, the data tends to get redistributed
to nearby OSDs instead of across the entire hierarchy.

The new tunables are:

choose_local_tries: Number of local retries. Legacy value is
2, optimal value is 0.

choose_local_fallback_tries: Legacy value is 5, optimal value
is 0.

choose_total_tries: Total number of attempts to choose an item.
Legacy value was 19, subsequent testing indicates that a value of
50 is more appropriate for typical clusters. For extremely large
clusters, a larger value might be necessary.

chooseleaf_descend_once: Whether a recursive chooseleaf attempt
will retry, or only try once and allow the original placement to
retry. Legacy default is 0, optimal value is 1.

Migration impact:

Moving from argonaut to bobtail tunables triggers a moderate amount
of data movement. Use caution on a cluster that is already
populated with data.

The firefly tunable profile fixes a problem
with the chooseleaf CRUSH rule behavior that tends to result in PG
mappings with too few results when too many OSDs have been marked out.

The new tunable is:

chooseleaf_vary_r: Whether a recursive chooseleaf attempt will
start with a non-zero value of r, based on how many attempts the
parent has already made. Legacy default is 0, but with this value
CRUSH is sometimes unable to find a mapping. The optimal value (in
terms of computational cost and correctness) is 1.

Migration impact:

For existing clusters that have lots of existing data, changing
from 0 to 1 will cause a lot of data to move; a value of 4 or 5
will allow CRUSH to find a valid mapping but will make less data
move.

There were some problems with the internal weights calculated and
stored in the CRUSH map for straw buckets. Specifically, when
there were items with a CRUSH weight of 0 or both a mix of weights and
some duplicated weights CRUSH would distribute data incorrectly (i.e.,
not in proportion to the weights).

The new tunable is:

straw_calc_version: A value of 0 preserves the old, broken
internal weight calculation; a value of 1 fixes the behavior.

Migration impact:

Moving to straw_calc_version 1 and then adjusting a straw bucket
(by adding, removing, or reweighting an item, or by using the
reweight-all command) can trigger a small to moderate amount of
data movement if the cluster has hit one of the problematic
conditions.

This tunable option is special because it has absolutely no impact
concerning the required kernel version in the client side.

The hammer tunable profile does not affect the
mapping of existing CRUSH maps simply by changing the profile. However:

There is a new bucket type (straw2) supported. The new
straw2 bucket type fixes several limitations in the original
straw bucket. Specifically, the old straw buckets would
change some mappings that should have changed when a weight was
adjusted, while straw2 achieves the original goal of only
changing mappings to or from the bucket item whose weight has
changed.

straw2 is the default for any newly created buckets.

Migration impact:

Changing a bucket type from straw to straw2 will result in
a reasonably small amount of data movement, depending on how much
the bucket item weights vary from each other. When the weights are
all the same no data will move, and when item weights vary
significantly there will be more movement.

The jewel tunable profile improves the
overall behavior of CRUSH such that significantly fewer mappings
change when an OSD is marked out of the cluster.

The new tunable is:

chooseleaf_stable: Whether a recursive chooseleaf attempt will
use a better value for an inner loop that greatly reduces the number
of mapping changes when an OSD is marked out. The legacy value is 0,
while the new value of 1 uses the new approach.

Migration impact:

Changing this value on an existing cluster will result in a very
large amount of data movement as almost every PG mapping is likely
to change.

Starting with version v0.74, Ceph will issue a health warning if the
current CRUSH tunables don’t include all the optimal values from the
default profile (see below for the meaning of the default profile).
To make this warning go away, you have two options:

Adjust the tunables on the existing cluster. Note that this will
result in some data movement (possibly as much as 10%). This is the
preferred route, but should be taken with care on a production cluster
where the data movement may affect performance. You can enable optimal
tunables with:

cephosdcrushtunablesoptimal

If things go poorly (e.g., too much load) and not very much
progress has been made, or there is a client compatibility problem
(old kernel cephfs or rbd clients, or pre-bobtail librados
clients), you can switch back with:

cephosdcrushtunableslegacy

You can make the warning go away without making any changes to CRUSH by
adding the following option to your ceph.conf [mon] section:

monwarnonlegacycrushtunables=false

For the change to take effect, you will need to restart the monitors, or
apply the option to running monitors with:

Adjusting these values will result in the shift of some PGs between
storage nodes. If the Ceph cluster is already storing a lot of
data, be prepared for some fraction of the data to move.

The ceph-osd and ceph-mon daemons will start requiring the
feature bits of new connections as soon as they get
the updated map. However, already-connected clients are
effectively grandfathered in, and will misbehave if they do not
support the new feature.

If the CRUSH tunables are set to non-legacy values and then later
changed back to the defult values, ceph-osd daemons will not be
required to support the feature. However, the OSD peering process
requires examining and understanding old maps. Therefore, you
should not run old versions of the ceph-osd daemon
if the cluster has previously used non-legacy CRUSH values, even if
the latest version of the map has been switched back to using the
legacy defaults.

The simplest way to adjust the crush tunables is by changing to a known
profile. Those are:

legacy: the legacy behavior from argonaut and earlier.

argonaut: the legacy values supported by the original argonaut release

bobtail: the values supported by the bobtail release

firefly: the values supported by the firefly release

hammer: the values supported by the hammer release

jewel: the values supported by the jewel release

optimal: the best (ie optimal) values of the current version of Ceph

default: the default values of a new cluster installed from
scratch. These values, which depend on the current version of Ceph,
are hard coded and are generally a mix of optimal and legacy values.
These values generally match the optimal profile of the previous
LTS release, or the most recent release for which we generally except
more users to have up to date clients for.

When a Ceph Client reads or writes data, it always contacts the primary OSD in
the acting set. For set [2,3,4], osd.2 is the primary. Sometimes an
OSD is not well suited to act as a primary compared to other OSDs (e.g., it has
a slow disk or a slow controller). To prevent performance bottlenecks
(especially on read operations) while maximizing utilization of your hardware,
you can set a Ceph OSD’s primary affinity so that CRUSH is less likely to use
the OSD as a primary in an acting set.

cephosdprimary-affinity<osd-id><weight>

Primary affinity is 1 by default (i.e., an OSD may act as a primary). You
may set the OSD primary range from 0-1, where 0 means that the OSD may
NOT be used as a primary and 1 means that an OSD may be used as a
primary. When the weight is <1, it is less likely that CRUSH will select
the Ceph OSD Daemon to act as a primary.