Stack backtraces from the mind of Garrett. Symbolic debugger not included.

Thursday, July 30, 2009

Fast Reboot & Panic

I noticed that Sherry Moore just posted a blog entry about Fast Reboot.

I wanted to take a few moments to mention a few things, that I think folks should understand.

First off, the feature (fast reboot) is really useful -- for manually initiated reboots to perform administration (such as to reboot after installing new kernel bits or a critical patch), its wonderful to skip past the various hardware related initialization, and can really help with downtime costs associated with administrative maintenance tasks like patching.

This is especially true for systems with lots of peripheral buses (SCSI, Infiniband, etc.) that take a long time for low-level BIOS to probe and test. In such situations, BIOS initialization can consume several minutes. Reducing this to a few seconds is a compelling idea.

However, there are some gotchas that in my opinion people should be aware of when using the variant of this that gets used on panic().

During a panic situation, all bets are off about kernel or hardware state. (This is why the code in the kernel called panic() after all -- it has deemed it unsafe to proceed.)

So, the nice safe quiesce(9e) entry points are not guaranteed to be called. Hopefully they will be, but not necessarily.

Some drivers may panic when they find hardware in a state that is beyond their ability to recover. So quiesce(9e) may be functionally unable to put the system back into a sane state.

Some hardware simply can't restore properly without a low level PCI reset, which the current fast reboot code skips.

If hardware is not quiesce(9e)'d properly, on the reboot, the new kernel can wind up in a situation where a device might be randomly scribbling (via DMA) to physical memory (this can lead to arbitrary data corruption of either kernel or user pages of memory), or might wind up with a stuck interrupt (which may exhibit as a hard hang of the machine).

Note that the above situations are not theoretical. I have hit these problems, involving various different bits of hardware ... certain framebuffers that require low level initialization, certain Ethernet parts that don't have a functional software reset mechanism, and a certain WiFi controller that can leave interrupts stuck.

However, all the situations described above are also quite unlikely to occur. They probably occur in fewer than 1% of all potential panic scenarios.

The upshot of this is that I would most definitely not use fast reboot on any machine that is in production or which has critical data. Do you want to Its a wonderful feature for kernel developers who trash their systems all the time and are accustomed to taking risks -- for such uses the shortened reboot time on a panic is a net win compared to the potential risks (which is virtually nil -- if a system I'm testing hard hangs or crashes a second time I can always just power cycle it), but in a production environment the expectations are different.

In such an environment, you really don't expect to see panic() occur (if it does, you're already on the path of a bug!), but when it does, you want to be 100% certain that you confine the potential damage and get back to a known safe (and good) state. This is why we panic() and reboot, after all. Eliding those time consuming steps of low-level initialization might at first sound like an attractive way to get to higher uptimes, but if you analyze the situation carefully, its a potentially riskier proposition that could (admittedly unlikely) cause much greater downtime.

Now that you know the concerns, you are of course free to make your own assessment.

If you do want to turn off fast reboot on panic (and I do recommend that you leave regular fast reboot on, as far as I can see there is no downside to making an administratively requested reboot go faster), then you can just use the following commands (which are taken from Sherry's blog posting):

(Note that fast reboot on panic is enabled by default on OpenSolaris since build 112. However, since nobody should have deployed a system with this on in production -- we've had only development releases since then -- there is probably no urgency to go and immediately change your systems. Of course, that situation might be different if you're reading this blog post at some point in the future.)