.. title: Ceph CRUSH map with multiple storage tiers
.. slug: ceph-crush-map-with-multiple-storage-tiers
.. date: 2018-01-24 16:40:01 UTC+01:00
.. tags: ceph
.. category:
.. link:
.. description:
.. type: text
At work, we're running a virtualization server that has two kinds of storage built-in: An array of fast SAS disks, and another one of
slow-but-huge SATA disks. We're running OSDs on both of them, and I wanted to distinguish between them when creating RBD images, so that
I could choose the performance characteristics of the pool. I'm not sure if this post is outdated by now (Jan 2018), there's a "class"
thing in crush map all of a sudden. However, here's what we're currently running.
.. TEASER_END
We currently have two pools::
pool 2 'rbd-rmvh-sas' replicated size 2 min_size 1 crush_rule 1 [...]
pool 4 'rbd-rmvh-sata' replicated size 2 min_size 1 crush_rule 2 [...]
``rbd-rmvh`` is the one on the SAS disks, ``rbd-rmvh-sata`` resides on the sata disks. This placement is achieved through careful placement
of the nodes in the CRUSH map, combined with two specifically-crafted CRUSH rules that place replicas accordingly.
Our Ceph CRUSH Tree currently looks like this::
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-10 10.00000 root sata
-11 10.00000 rack rmvh-sata
-8 5.00000 host rmvh002-sata
4 hdd 5.00000 osd.4 up 1.00000 1.00000
-12 5.00000 host rmvh003-sata
6 hdd 5.00000 osd.6 up 1.00000 1.00000
-1 6.00000 root default
-7 6.00000 rack rmvh-sas
-5 3.00000 host rmvh002
3 hdd 3.00000 osd.3 up 1.00000 1.00000
-9 3.00000 host rmvh003
5 hdd 3.00000 osd.5 up 1.00000 1.00000
Note that each OSD resides on a RAID array, not just a single disk. RAID controllers have caches. Caches eliminate latency. We hate latency,
so we love caches, hence we use RAID. This means we only have two OSDs per node.
The trick is having a second hierarchy of buckets in the CRUSH map: One for each kind of storage. If we were to add SSDs into the picture,
we'd make another hierarchy with an `-ssd` prefix. Unfortunately, if you try to do this with the default settings in ``ceph.conf``, you'll
find that OSDs move themselves into the ``host=$HOSTNAME`` container for the node where they're running. This is usually cool, but not in
this scenario, because we want the OSDs in the ``-sata`` hosts to stay there and not move into the SAS hierarchy upon every restart, probably
causing them to discard data and replicate different data that's not meant for them.
So, for this kind of setup, you'll want to have the following option in ``ceph.conf``::
[global]
osd crush update on start = false
Now, all that's left to do is create a ruleset that only chooses OSDs from the SAS hierarchy::
rule rmvh-sas-ruleset {
id 1
type replicated
min_size 1
max_size 10
step take rmvh-sas
step chooseleaf firstn 0 type host
step emit
}
And another one for the SATA hierarchy::
rule rmvh-sata-ruleset {
id 2
type replicated
min_size 1
max_size 10
step take rmvh-sata
step chooseleaf firstn 0 type host
step emit
}
And create pools that use those rulesets. Done!