Bug Description

Current behavior:
Running several ceph osd processes, some might be slower to start upon boot, which causes the unit config-changed hook to mistakenly consider those osds down.

Expected behavior:
The config-changed hook should wait if no ceph-osd daemons are starting or respawning

Actual result:
If the osd-reformat option is set, this behaviour causes some osds to be re-osdize'd during the boot process, not waiting for the ceph-osd daemon to finish. By disabling the config option, eventually all osds come up, but we loose the advantages of having osd-reformat set.

Steps to reproduce:
Just (clean) rebooting the node. Although it does not happen always and not always with the same osd.

Urgh that's pretty ugly - I think we'll need to record which OSD block devices have already been initialised on a specific unit, and exclude those from any future osd init scans/runs that might run on a config-changed hook execution.

The currently committed scheme to keep track of which osd-devices were already formatted is likely not a sufficient safe guard and needs further improvement. Reasons are

(1) Device letters change (i.e. when adding a new disk, but it comes up earlier in the device list order). This is very common on production hardware, not a side case. Devices are also not guaranteed to come up in the same order. It would be easy for an existing OSD to get pushed off into a device name that is not listed or previously initialised and accidentally get re-initialised.

(2) New OSDs are added to the host but manually with ceph commands (in a way this is similar to the above). This has happened on production deployments when the add-disk action command was broken. However even if you use the add-disk action, the device may not be listed in the osd-devices config option if the action was taken before the charm stored the list of previously initialised osd-devices (the cation uses osdize(), so an new deployments would populate it - but old deployments won't). Even if we added code to populate this value on existing clusters from the osd-devices config option, OSDs that are in use but not in the config option won't be added to the previously initialised list and thus are still vulnerable to being reformatted.

(3) This is slightly out of use case for the charms, but something I can still see people doing and it would be ideal to avoid data loss from is also moving an existing OSD from one machine to another to recover it, which is something explicitly supported and generally recommended as a perfectly reasonable action within the Ceph community.

In all of the above situations, even with the improved commit, data loss could still occur. Plus a fix is still required for already deployed clusters.

An improvement on this situation would be to have the osd-devices stored list store the OSD FSID(stored in OSDPATH/fsid file), device ID path (e.g. /dev/disk/by-id) or similar instead of the block device path but that won't help with the OSD movement scenario or the manual ceph-disk initialise scenario.

Otherwise one way we could potentially handle this better which would automatically include already deployed OSDs Ceph also stores the Ceph cluster "fsid" which is unique to the current ceph cluster on the OSD in the "OSDPATH/ceph_fsid" file. This would be a much better way to determine if the OSD is part of the current cluster or not. Unsure if there is an easy way to detect this value without first mounting the OSD, though.

Lastly as a minor note, a place that storing the OSD UUID instead is likely to go wrong in future is when doing bluestore conversions, which requires reformatting the OSD which would give it a new UUID. That workflow may need to be considered. Bluestore and Luminous also format and partition the disk differently (in particular LVM may be used instead of GPT in some cases) and this may change or influence any code designed to detect this information.

Having said all of that, I think that this option should be removed and instead replaced with a juju action to one-shot reformat OSDs.

On Tue, May 29, 2018 at 05:57:56AM -0000, Trent Lloyd wrote:
> The currently committed scheme to keep track of which osd-devices were
> already formatted is likely not a sufficient safe guard and needs
> further improvement. Reasons are
>
> (1) Device letters change (i.e. when adding a new disk, but it comes up
> earlier in the device list order). This is very common on production
> hardware, not a side case. Devices are also not guaranteed to come up in
> the same order. It would be easy for an existing OSD to get pushed off
> into a device name that is not listed or previously initialised and
> accidentally get re-initialised.

If a user is concerned about this, they should use /dev/disk/{by-path,by-id} in the osd-devices option, I don't believe we have any restriction on the usage of those paths, we just check the prefix '/dev/' to check if the argument is a directory or a disk device.

>
> (2) New OSDs are added to the host but manually with ceph commands (in a
> way this is similar to the above). This has happened on production
> deployments when the add-disk action command was broken.

We could add an action to mark a device as osdized to allow an operator merge the view of the charm with what the system is actually running.

> However even if you use the add-disk action, the device may not be
> listed in the osd- devices config option if the action was taken
> before the charm stored the list of previously initialised osd-devices
> (the cation uses osdize(), so an new deployments would populate it -
> but old deployments won't).

As I mentioned in comment #3, the upgrade-charm needs to populate the osd-devices list.

> Even if we added code to populate this value on existing clusters from
> the osd-devices config option, OSDs that are in use but not in the
> config option won't be added to the previously initialised list and
> thus are still vulnerable to being reformatted.

osd-reformat is not an option someone should have set in this case and in conjunction with the action I suggest previously a user could get to state where the charm is in sync with the running system.

>
> (3) This is slightly out of use case for the charms, but something I can
> still see people doing and it would be ideal to avoid data loss from is
> also moving an existing OSD from one machine to another to recover it,
> which is something explicitly supported and generally recommended as a
> perfectly reasonable action within the Ceph community.

In this case the user should have osd-reformat set to false (or "").

[...]
> Having said all of that, I think that this option should be removed and instead replaced with a juju action to one-shot reformat OSDs.

This has been discussed, but the lack of support in the bundles to run actions at the end of a deployment make it non an option.

We'll be designing and implementing an action-driven approach to handle cases when a formatted device is presented for reformat. The charm will require operator decision and input in those cases.

This brings a long-standing issue with fsid re-use back to the forefront. We will need to evaluate the usefulness of explicit fsid charm config in the current world where the charms now handle that automatically. The issue with fsid re-use is that a formatted block device may contain ceph artifacts from a prior deployment, but the fsid was statically-configured, the fsid matches the current deployment, and ceph tries (unsuccessfully) to re-use it without formatting it.

I just noticed this bug - https://bugs.launchpad.net/charm-ceph-osd/+bug/1746118 - which is similar but not the same as this one but I believe they can/should both leverage the same fix but with a slight adjustment. 1746118 does a udevadm settle after having formatted and started osds to ensure they don't get reformatted on a subsequent hook fire but before all the new osds have been mounted. The trouble is that the settle is done after the formatting where we should really be doing it before as well (maybe just if osd-reformat=True) so as to ensure that there are not devices still waiting to be mounted before the code checks for unmounted devices.