Ceph is a distributed storage system which provides data access as
files, objects and blocks. As part of this project, we’re interested in
integrating ceph’s block device (RBD) directly with Qemu/KVM.

Primary components/daemons of Ceph.
- Monitor - Serve as authentication point for clients.
- Metadata - Store all the filesystem metadata (Not configured here as
they are not required for RBD)
- OSD - Object storage devices. One daemon for each drive/location.

Currently, Ganeti supports RBD volumes on a pre-configured Ceph cluster.
This is enabled through RBD disk templates. These templates allow RBD
volume’s access through RBD Linux driver. The volumes are mapped to host
as local block devices which are then attached to the instances. This
method incurs an additional overhead. We plan to resolve it by using
Qemu’s RBD driver to enable direct access to RBD volumes for KVM
instances.

Also, Ganeti currently uses RBD volumes on a pre-configured ceph cluster.
Allowing configuration of ceph nodes through Ganeti will be a good
addition to its prime features.

A new disk param access is introduced. It’s added at
cluster/node-group level to simplify prototype implementation.
It will specify the access method either as userspace or
kernelspace. It’s accessible to StartInstance() in hv_kvm.py. The
device path, rbd:<pool>/<vol_name>, is generated by RADOSBlockDevice
and is added to the params dictionary as kvm_dev_path.

This approach ensures that no disk template specific changes are
required in hv_kvm.py allowing easy integration of other distributed
storage systems (like Gluster).

Note that the RBD volume is mapped as a local block device as before.
The local mapping won’t be used during instance operation in the
userspace access mode, but can be used by administrators and OS
scripts.

This document proposes configuration of distributed storage
pool (Ceph or Gluster) through ganeti. Currently, this design document
focuses on configuring a Ceph cluster. A prerequisite of this setup
would be installation of ceph packages on all the concerned nodes.

At Ganeti Cluster init, the user will set distributed-storage specific
options which will be stored at cluster level. The Storage cluster
will be initialized using gnt-storage. For the prototype, only a
single storage pool/node-group is configured.

Following steps take place when a node-group is initialized as a storage
cluster.

Check for an existing ceph cluster through /etc/ceph/ceph.conf file
on each node.

Ensure that no other node-group is configured as distributed storage
cluster and configure ceph on the specified node-group. If there is no
node in the node-group, it’ll only be marked as distributed storage
enabled and no action will be taken.:

$ gnt-group assign-nodes <group> <node>

It ensures that the node is offline if the node-group specified is
distributed storage capable. Ceph configuration on the newly assigned
node is not performed at this step.:

$ gnt-node --offline

If the node is part of storage node-group, an offline call will stop/remove
ceph daemons.:

$ gnt-node add --readd

If the node is now part of the storage node-group, issue init
distributed storage RPC to the respective node. This step is required
after assigning a node to the storage enabled node-group:

$ gnt-node remove

A warning will be issued stating that the node is part of distributed
storage, mark it offline before removal.

Due to the loopback bug in ceph, one may run into daemon hang issues
while performing writes to a RBD volumes through block device mapping.
This bug is applicable only when the RBD volume is stored on the OSD
running on the local node. In order to mitigate this issue, we can
create storage pools on different nodegroups and access RBD
volumes on different pools.
http://tracker.ceph.com/issues/3076