Oracle Blog

Nils Nieuwejaar's Thoughts, Complaints, and Advice.

Sunday Oct 21, 2007

Handling a xVM Panic

The first project I took on when joining the xVM team was figuring out
how to react when the xVM hypervisor panics. When Solaris panics, we are
almost always able to capture a core dump representing the entire kernel
state when the panic occurred. When the hypervisor panics, it dumps some
cryptic information to the screen and resets the machine.

Obviously, our preference would be to capture as much information on a
hypervisor panic as we do on a Solaris panic. There are two significant
difficulties with this. First, xVM has no device drivers of its own, so
xVM isn't capable of writing any of its own state to disk. Second, in the
xVM model the hypervisor and the dom0 OS work closely together to deliver
the full virtualization experience, so even capturing the xVM state would
only give us half the information we need.

The solution I adopted was to modify the xVM hypervisor to pass control
to Solaris in the event of a panic, allow Solaris to read all the xVM
state, to interpret the xVM hypervisor as a standard kernel module, and to
execute the standard Solaris panic code that we have spent many years
hardening. Exactly how this works is pretty cool (for some particularly
geeky definition of "cool"), but is much more complicated than I'm prepared
to handle in a blog post. For the highlights, see the slides attached to
the entry below.

If we just take it as read that a Solaris dom0 can properly capture an
xVM core dump, I'd like to walk through one fairly simple example of using
a dump to examine the state of both Solaris and xVM to diagnose a bug that
one of our QA guys hit a few months back.

Interpreting an xVM core dump

The test machine is in a lab somewhere in China. It paniced one night,
giving us the following information:

There's not a whole lot there to go on. If necessary, I'm sure we could
have made a start at debugging the problem with only these few lines.
Fortunately, that wasn't necessary. As the last two lines show, we were
able to capture a core dump for postmortem examination.

Since this was a test machine under active use it was subject to
crashing on a regular basis, so trying to examine the core dump on that
machine was destined to be an exercise in futility. The QA engineer
uploaded the entire dump to a more stable machine sitting in a lab in
California, where I was able to debug it at my leisure.

The first step is to load the core dump into mdb, the Solaris
binary debugger. Here we can see that the machine on which the debugger is
running is called fsh-trout and the machine that generated the
dump was called goldcloth. We can also see exactly which
Solaris build the machine was running.

The last few lines represent the bottom of the panicing thread's
execution stack. If you're a glutton for punishment, you can manually
translate those addresses into symbol names. I, on the other hand, am
lazy and prefer to let the machine do that kind of grunt work. To show the
stack in human readable form:

In this stacktrace, we are looking at a mixture of Solaris and xVM
hypervisor code. The symbols prefixed by xpv` represent symbols
in the xVM portion of the address space. Those of you familiar with
mdb will recognize this as the notation for a kernel module.
Even though the machine originally booted with xVM living in an address
space completely unknown to the kernel, in the core dump we have made it
look like just another kernel module. This allows our existing debugging
tools to work on xVM dumps without any modification at all.

One unfortunate limitation of mdb at the moment is that it
doesn't understand xVM's notation for exception frames. xVM represents
these frames on the stack by inverting the frame pointer. Since
mdb isn't aware of this convention, it simply prints the first
value of the frame out as a hex number. In the example above, that occurs
in the frame at 0xffbf3fc8.

Since I suspect the cause of the panic is to be found in the exception
frame, I have to decode it by hand:

> ffbf3fc8,4/X
0xffbf3fc8: c5a0901c 0 e0002 f4c76e9d

The third slot in the frame contains the type of exception
(0xe, which is a page fault), and the fourth slot in the frame
contains the instruction pointer when the exception occurred. How do I
know this? By reading the xVM exception handling code. Now you know, and
you don't have to read the code yourself. You're welcome.

The very first instruction of the Solaris routine
page_ctr_sub_internal() caused xVM to take a page fault. When you
see a stack operation trigger a page fault, it almost always means you've
blown your stack - that is, you've used all the space allocated for the
stack and are now accessing some address beyond the stack limit.

We want to confirm that this is the cause of the panic, and then try to
understand why it happened. To do so, we have to go back to the panic
message to find the address that caused the fault (which on x86 can be
found in the %cr2 register) and the CPU on which it happened.

OK, the faulting address is easy enough: 0xc5a08fd4. The CPU
is a little bit trickier. Since this was an xVM panic, we know the xVM
vcpu that caught the exception, but we really want to know the
Solaris CPU. To figure that out, we have to wander through a few levels of
indirection.

First we translate from the vcpu pointer to the VCPU id.

> 0xffbe1100::print struct vcpu vcpu_id
vcpu_id = 0

The VCPU id (0) is the same as the Solaris CPU id, so we look it up in
the cpuinfo table:

As a bonus, this table also tells us which thread was running on the CPU
when it paniced: 0xc5a0ade0. Since our hypothesis is that this
thread blew its stack, lets take a look at the base address of the stack:

> c5a0ade0::print kthread_t t_stkbase
t_stkbase = 0xc5a09000

The bottom of the stack is at 0xc5a09000 and we just tried to
access 0xc5a08fd4 - which is 44 bytes below the bottom of the
stack. So we were right about the cause of the panic. To see what ate up
all of the stack space, lets look at the stack again:

The leftmost column contains the frame pointer for each frame in the
stack. Each time that number increases by a large amount, it means the
calling routine allocated a lot of memory on the stack. In this example,
both xdb_open_device() and xdb_setup_node() are allocating over 1KB on the
stack.

By inspecting these routines we could see that each was allocating
MAXPATHLEN arrays on the stack. That's not large for a userspace
application, but it's pretty big for the kernel. By replacing these stack
allocations with calls to kmem_alloc(), we were able to reduce the
stack consumption, which fixed the panic.

Without the ability to capture a full dump of both xVM and Solaris, this
trivial bug would have been much harder to detect and fix. The only
information we would have had was the fact that we took a page fault, and
the address on which we faulted. If we were lucky, we might also have
gotten a hex stack dump - not a human readable stack like we see above, but
a raw block of a few hundred hex numbers.