Our KVM hosting platform has evolved considerably over the six years it’s been in operation, and we’re always looking at ways we can improve it. One important aspect of this process of continual improvement, and one I am heavily involved in, is the testing of software upgrades before they are rolled out. This post describes a recent problem encountered during this testing, the analysis that led to discovering its cause, and how we have fixed it. Strap yourself in, this might get technical.

The bug’s first sightings

Until now, we have built most of our KVM hosts on Red Hat Enterprise Linux 6 — it’s fast, stable, and supported for a long time. Since the release of RHEL 7 a year ago we have been looking to using it as well, perhaps even to eventually replace all our existing RHEL 6 hypervisors.

Of course, a big change like this can’t be made without a huge amount of testing. One set of tests is to check that “live migration” of virtual machines works correctly, both between RHEL 7 hypervisors and from RHEL 6 to RHEL 7 and back again.

Live migration is a rather complex affair. Before I describe live migration, however, I ought to explain a bit about how KVM works. KVM is itself just a Linux kernel module. It provides access to the underlying hardware’s virtualization extensions, which allows guests to run at near-native speeds without emulation. However, we need to provide our guests with a set of “virtual hardware” — things like a certain number of virtual CPUs, some RAM, some disk space, and any virtual network connections the guest might need. This virtual hardware is provided by software called QEMU.

When live migrating a guest, it is QEMU that performs all the heavy lifting:

QEMU synchronizes any non-shared storage for the guest (the synchronization is maintained for the duration of the migration).

QEMU synchronizes the virtual RAM for the guest across the two hypervisors (again for the duration of the migration). But remember, this is a live migration, which means the guest could be continually changing the contents of RAM and disk, so…

QEMU waits for the amount of “out-of-sync” data to fall below a certain threshold, at which point it pauses the guest (i.e. it turns off the in-kernel KVM component for the guest).

QEMU synchronizes the remaining out-of-sync data, then resumes the guest on the new hypervisor.

Since the guest is only paused while synchronizing a small amount of out-of-sync RAM (and an even smaller amount of disk), we can limit the impact of the migration upon the guest’s operation. We’ve tuned things so that most migrations can be performed with the guest paused for no longer than a second.

So this is where our testing encountered a problem. We had successfully tested live migrations between RHEL 7 hypervisors, as well as from those running RHEL 6 to those running RHEL 7. But when we tried to migrate a guest from a RHEL 7 hypervisor to a RHEL 6 one, something went wrong: the guest remained paused after the migration! What could be the problem?

Some initial diagnosis

The first step in diagnosing any problem is to gather as much information as you can. We have a log file for each of our QEMU processes. Looking at the log file for the QEMU process “receiving” the live migration (i.e. on the target hypervisor) I found this:

What appears to have happened here is that the entire migration process worked correctly up to the point at which the QEMU process needed to resumed the guest… but when it tried to actually resume the guest, it failed to start properly. QEMU dumps out the guest’s CPU registers when this occurs. “Hardware error 0x80000021” is unfortunately a rather generic error code — it simply means “invalid guest state”. But what could be wrong with the guest state? It was just running a moment ago on the other hypervisor; how did the migration make it invalid, if live migration is supposed to copy every part of the guest state intact?

Given that all of our other migration tests were passing, what I needed to do was compare this “bad” migration with one of the “good” ones. In particular, I wanted to get the very same register dump out of a “good” migration, so that I could compare it with this “bad” migration’s register dump.

QEMU itself does not seem to have the ability to do this (after all, if a migration is successful, why would you need a register dump?), which meant I would have to change the way QEMU works. Rather than patching the QEMU software then and there, I found it easiest to modify its behaviour through GDB. By attaching a debugger to the QEMU process, I could have it stop at just the right moment, dump out the guest’s CPU registers, then continue on as if nothing had occurred:

Those fields at the end contained different values in the “bad” and “good” migrations. Could they be the cause of the “invalid guest state”?

Memory segmentation

To understand what’s going on here, we need to know a bit about how x86 memory segmentation works. Once upon a time, this was really simple: a 16-bit CS (code segment), DS (data segment) or SS (stack segment) register was simply shifted by 4 bits and added to a 16-bit offset in order to form a 20-bit absolute address.

a set of “flags” to keep track of things like whether the segment can be written to, whether the segment is actually present in physical RAM, and so on.

These are the four fields you can see in the segment registers shown above.

But hang on… this guest wasn’t running in “protected mode”. It was a 64-bit guest running a 64-bit operating system; it was running in what’s called “long mode”, and for the most part long mode doesn’t have segmentation. The particular values in the segment registers listed above are mostly irrelevant, because the CPU isn’t actively using those registers.

So at this point I knew that the segment registers had different flags in the “bad” migration than they did in the “good” migration. But if the registers weren’t being used, why would the flags matter?

“Unusable” memory segments

It took a fair bit of trawling through QEMU and kernel source code and Intel’s copious documentation before I found the answer. It turns out that there is a hidden flag, not visible in these register dumps, indicating whether a particular segment is “usable” or not. The usable flags are not part of the register dumps because they’re not really part of a guest’s CPU state; instead, they’re used by a hypervisor to tell the host CPU which of a guest’s segment registers should be loaded when a guest is started — and most importantly, this includes the times a guest is resumed immediately following a migration.

Next up, I needed to see how KVM and QEMU dealt with these “unusable” segments. So long as each register’s “unusable” flag is included in the migration, then the complete guest state should be recoverable after a migration.

Interestingly, it seems that QEMU does not track the “unusable” flag for each segment. The two functions (set_seg and get_seg) responsible for translating between KVM’s and QEMU’s representations of these segment registers would throw away a “unusable” flag when retrieving it from the kernel, and always clear it when loading the register back into the kernel. How could this ever have worked correctly?

This was finally answered when I looked at the kernel versions involved:

On the RHEL 6 kernel, when retrieving a guest’s segment registers the kernel would automatically clear the flags for a segment if the segment was marked “unusable”. When loading the guest’s segment registers again, it would treat a segment with a cleared set of flags as if it were “unusable”, even if QEMU had not said so.

On the RHEL 7 kernel, however, the kernel would not touch the flags at all when they were retrieved. On loading the segment registers again, it would treat a segment as “unusable” only if QEMU said so, or if one specific flag — the “segment is present” flag — were not set.

Although these kernels have different behaviour, they both work correctly if you stick to one kernel in a guest migration. But if you try to migrate a guest from a RHEL 7 hypervisor to a RHEL 6 hypervisor, the flags aren’t cleared and the new kernel doesn’t know the register should be automatically marked unusable. The result is that the guest tries to use an invalid segment register, so the hardware throws an “invalid guest state” error. Bingo — that’s exactly what we’d seen!

The fix

The fix turned out to be quite simple: simply have QEMU also clear the flags of any segment registers that are marked unusable, and have it ensure that segment registers whose “present” flags are cleared are also marked unusable when loading them into the kernel:

With both of these changes in place, a migration would work even if we were migrating to or from an “old” version of QEMU without the fix. Moreover, it would mean we could get the fix rolled out without having to change the kernels involved.

At present we are still testing these changes, however we look forward to working with the upstream QEMU developers in order to have them added to the mainline version of QEMU.

In writing this blog post I’ve skipped over many of the dead-ends I took in solving this problem. While the fix ended up reasonably straight-forward (well, as much as can be expected when you’re dealing with kernels and hypervisors) it was a fun and educational journey getting there.

I’m not sure about the problem you’re hitting — we haven’t seen anything like that in any of our tests. You might want to see whether saving and restoring a guest from disk works. That uses much of the same code that’s used in migration, but it’s a fair bit simpler (no network connections, no need to block-migrate non-shared disks, etc.)

This patch isn’t exactly EL7-specific — if you were migrating a guest from EL7 to EL7 you wouldn’t actually hit this bug. It specifically fixes a bug that only arises when migrating a guest from a new kernel version to an older kernel version (e.g. from EL7 to EL6). However, the patch is written in such a way that the bug will be avoided even if only one of the source or target QEMU binaries contains the patch — ideally you’d have both patched, of course, but that’s not always possible.