Managing Datacenter Changes with a Ceph Deployment

Marek Glinski

August 02, 2017 13:48

Updated

If you are using a Ceph deployment for storage in your cloud, and if you are planning maintenance or upgrade activities in your datacenter, be careful when performing operations that make Ceph nodes temporarily unavailable.

In order to perform a maintenance task, you may need to interrupt power or network connections, causing at least one Ceph node to go offline. Avoid taking more than one Ceph node offline at a time, especially if you have 10 or fewer nodes total. Never allow three Ceph nodes to be unavailable at the same time, which makes data loss in the Ceph pool highly likely.

Metacloud Hypervisors (MHVs) serve as nodes in a Ceph distribution, with multiple disks and object storage daemons (OSDs) running on each MHV for redundancy. Data objects are mapped to logical constructs called placement groups (PGs), which are pooled for optimal scalability of data replication. When an MHV goes offline, resulting in OSDs becoming unavailable, the PGs and pools are automatically rearranged for a rebalancing of data across the remaining distribution of OSDs.

If too many OSDs simultaneously become unavailable, PGs become degraded, resulting in data loss. Because Ceph continues to run in a degraded state, the problem can occur without immediate noticeable disruption to clients accessing the cluster..

It is recommended that you inform Metacloud Support when planning any maintenance work in your datacenter. The team can work with you to control disruptions to Ceph nodes. Submit a P4 request on this site to contact the team.