Associated revisions

The default of 120 seconds may be exceeded when the disk is very slowwhich can happen in cloud environments. Increase it to 600 secondsinstead.

The partprobe command may fail for the same reason but it does not havea timeout parameter. Instead, try a few times before failing.

The udevadm settle guarding partprobe are not necessary becausepartprobe already does the same. However, partprobe does not provide away to control the timeout. Having a udevadm settle after another isgoing to be a noop most of the time and not add any delay. It matterswhen the udevadm settle run by partprobe fails with a timeout becausepartprobe will silentely ignores the failure.

sgdisk -i 1 /dev/vdb opens /dev/vdb in write mode which indirectlytriggers a BLKRRPART ioctl from udev (starting version 214 and up) whenthe device is closed (see below for the udev release note). Theimplementation of this ioctl by the kernel (even old kernels) removesall partitions and adds them again (similar to what partprobe doesexplicitly).

The side effects of partitions disappearing while ceph-disk is runningare devastating.

sgdisk is replaced by blkid which only opens the device in read mode andwill not trigger this unexpected behavior.

The problem does not show on Ubuntu 14.04 because it is running udev <214 but shows on CentOS 7 which is running udev > 214.

As an experimental feature, udev now tries to lock the disk device node (flock(LOCK_SH|LOCK_NB)) while it executes events for the disk or any of its partitions. Applications like partitioning programs can lock the disk device node (flock(LOCK_EX)) and claim temporary device ownership that way; udev will entirely skip all event handling for this disk and its partitions. If the disk was opened for writing, the close will trigger a partition table rescan in udev's "watch" facility, and if needed synthesize "change" events for the disk and all its partitions. This is now unconditionally enabled, and if it turns out to cause major problems, we might turn it on only for specific devices, or might need to disable it entirely. Device Mapper devices are excluded from this logic.

udevadm settle [options]
Watches the udev event queue, and exits if all current events are handled.
--timeout=seconds
Maximum number of seconds to wait for the event queue to become empty. The default value is 120 seconds. A value of
0 will check if the queue is empty and always return immediately.

@Matthew : using partx is appealing (and that's what ceph-disk used in previous versions of Ceph). The reason partprobe was prefered is that it has a behavior that is easier to predict and debug.

In the past that perceived stability / robustness was a matter of opinion and debate. Now that we have ceph-disk integration tests, we can actually verify by running it over and over if it's true or not. These integration tests are only two month old and it's too early to draw a conclusion. But I'm hopeful that we'll eventually have enough tests involving border cases to try and substitute partprobe with partx to verify it is at least as good.

So, in a nutshell, I'm not saying partx should not be used, nor that partprobe is definitely more stable. But I'm currently working with partprobe to make a more robust ceph-disk that works everywhere.

I tried just running sgdisk outside of ceph-deploy, which was successful, including running partprobe: http://pastebin.com/VbQii1gSNote: I had to adjust the partition name as bash was trying to look for a file named 'journal' since there was a space there.

@Matthew the log is very useful, thank you. It would be great if you could repeat the same operation without ceph-deploy so that there are no interfering partx -a launched by ceph-deploy while ceph-disk is running. If it turns out to fail as well, it will narrow down the source of errors.

Note that I think partx -a looks like it succeeds because it is non blocking (i.e. it signals the kernel about the partition change but does not wait for a confirmation in the same way partprobe does). In your case (i.e. only adding different disks with colocated journals) it works fine because there is no chance of a race.

The above suggests that trying partprobe in a loop three times is not always enough. The first time it works but the second time it fails, presumably because the kernel needs more time to complete the update of the partition table. The report that partx -a succeeds, suggests that.

In the following we see that partprobe fails immediately the first time. Since it happens right after a udevadm settle on a machine that is not doing anything except run the test, it is unlikely because a udev event is in flight. Waiting 60 seconds resolves the error which suggests an asynchronous event related to kernel partition tables was in progress and finished.