If you have some minor programming experience, you'd probably suspect
that the terms in the parentheses after SLIST_FOREACH are variables.
You'd be right. Each of those variables has a range of acceptable
values. Someone familiar with the code would recognize the legitimate
values; by printing out what each variable contains, we can jump-start
the debugging process. You can tell gdb to print a variable's
contents with the p command, giving the variable name as an argument.
Let's look at the middle variable, vp.

(kgdb) p vp
$2 = (struct vnode *) 0xe0b0bd00
(kgdb)

The (struct vnode *) bit tells us that this is a pointer
to a data structure. You can show its contents by putting an asterisk
in front of the variable name, like so:

For those of you who are learning C, this is an excellent example of
how it's easier to hand around a pointer than the object it
references. An interested developer can dig through this and see
what's going on. Let's take a look at the first variable, vq, and try
to get similar information from it.

(kgdb) p vq
$4 = (struct vnode *) 0x0
(kgdb)

This isn't exactly a problem, but we're stuck. A pointer equal to 0x0
is a null pointer. There are many legitimate reasons for having a
null pointer, but there isn't anything in it for us to view. Feel
free to try, however; you really can't hurt anything through gdb.

(kgdb) p *vq
Cannot access memory at address 0x0.
(kgdb)

You've probably heard the words "null pointer" in close proximity to
the word "panic." Without digging into the kernel code, you can't
assume that this is the case. (In this particular panic, the null
pointer is perfectly legitimate; the kernel panicked trying to decide
what value to assign to this newly allocated pointer.)

You could try digging a little further into the data to see what's
going on. The second variable in our panic (vp->v_rdev->si_hlist)
actually goes on a bit; let's take a look a little deeper into it.

(kgdb) p vp->v_rdev
There is no member named v_rdev.
(kgdb)

Normally, this would work, and if you've used gdb before, you might
think that it's wrong. In this case it's correct, however. v_rdev is
a convenience macro. Only people who have read the kernel source code
would know that. v_rdev actually expands to v_un.vu_spec.vu_specinfo.
You couldn't be expected to know that, but don't be surprised if a
developer asks you to type something different from what actually
appears in the trace. To actually view vp->v_rdev, you would type:

You should be able to recognize the null pointer here, but that's
about it.

In this particular case, your extra digging would produce the answer
for a developer very quickly. This tidbit in the contents of the vp structure identifies the problem almost immediately.

v_type = VBAD

This is a vnode that isn't currently used, and it shouldn't even be in
this part of the system. A developer would jump directly on that and
try to learn why the system is trying to set a new vnode to a bogus value.

By gathering what information you could before sending out an email,
you would short-circuit a round or two of email. Anyone who's used
email support in a crisis knows just how valuable this is! Without
being a kernel hacker, you can't know which tidbit of knowledge is most
important, so you need to include everything you can learn.

I got this particular kernel dump from Dag-Erling Smorgrav. His
comment on this was that he "could fix vcount() to return 0 for
invalid vnodes--it wouldn't, strictly speaking, be incorrect--but
the real bug is somewhere else, and 'fixing' vcount() would just
hide it." This is the correct attitude to have on this sort of
problem; BSD users expect bugs to be found, not painted over. This
means, however, that you can expect your developer to come back to you
with requests for further information, and probably more things to
type into gdb. He might even ask you to send the kernel.debug and
core file.

You should be aware that the vmcore file contains everything in your
system's memory at the time of the panic. This can include all sorts
of security-impacting information. Someone could conceivably use this
information to break into your system. A developer might write you
and ask for a copy of the files. There are all sorts of legitimate
reasons for this; it makes debugging easier and can save countless
rounds of email. Carefully consider the potential consequences of someone
having this information. If you don't recognize the
person who asks, or if you don't trust him, don't send the files!

If the panic is reproducible, however, you can cold-boot the system to
single-user mode and trigger the panic immediately. If the system
never starts any programs that contain confidential information, and
nobody types any passwords into the system, the dump cannot contain
that information. Reproducing a panic in single-user mode, hence,
generates a "clean" core file. Just enter boot -s at the loader
prompt to bring the system to a command prompt, then do the minimal
setup necessary to prepare a dump and panic the system.

The first line tells the system where to put its dump. You'll want to
put your correct swap partition name here. The second line mounts the
filesystem's read-only, so you won't have to fsck after your panic.
(Since you know the crash is coming, why make yourself fsck?)
Finally, you run the command that triggers the panic. You might need
some additional commands, depending on your local setup, but this
should get most people up and running.

As a final treat, here's a debugging session from the same panic, and
the same kernel, but without debugging symbols. Compare it to the
initial where output above.

That's it. There are no hints here about where the panic happened,
just the function names that happened. An extraordinarily experienced
hacker might happen to recognize a place in the kernel where the exact
system calls take place, in exactly this order. If the kernel
developer is really, really interested in the problem, he could get
some information out of it like this.

The p/x command means "print in hexadecimal." Here, we've learned
roughly how far into vcount() the problem happened. If the developer
has a similar kernel built with similar source code, they can do this:

That's it. There's no way to get the bad vnode information out. The
developer is left on his own, poking through the code to see if he can
figure out the problem through sheer, dogged determination. And in any
event, it's very unlikely that any developer capable of working on a
problem will have the exact setup that you have on a panicking system.
While many of them would be happy to set up such a system in exchange
for lavish amounts of hard currency, it's a bit much to expect for
free.

Congratulations! You're now as prepared for a crash as any
non-kernel developer can be. Proper preparation can make your life
easier, and preparing for the worst is one of the best ways to sleep
uninterrupted at night.

(The author wishes to gratefully acknowledge the assistance of
Dag-Erling Smorgrav in preparing this article.)