Soft lockupsSome kernel bugs do not trigger oopses, but simply freeze the machine. They can be caused by deadlocks or livelocks, among other things. In most cases (unless some stupid bug causes an interrupt handler to spin on a lock), these soft lockups will not prevent delivery of interrupts.

If interrupt delivery is still possible, the machine will react to pings, and keyboard input will be echoed on the text console. However, processes will not make any progress anymore.

Still, the machine will be mostly unresponsive, because processes do not make any progress anymore.

A good way to test for this is to hit the numlock or capslock key; if the keyboard LEDs are turned on and off, you have a deadlock somewhere.

Also, a 2.4 kernel will start flashing the numlock LED all by itself if it got hung. 2.6 doesn't have this feature yet.

Hard lockupsA hard lockup may occur as well; these are usually due to hardware problems (or excessive abuse of the hardware by some poorly written driver). If this happens, you're in trouble. You can stop reading now, and we wish you good luck in debugging this.

Получение необходимой информации

Users have a tendency to blame software problems on that part of the system that seems to be the most complex or mysterious from their point of view. Which in most cases is the kernel.

That doesn't mean they have to be wrong. But it often means that their bug reports are inaccurate, or omit important details.

So when you're tasked with debugging a Linux kernel bug (or what someone thinks is a Linux kernel bug) there is a number of questions on the symptoms you should ask first:

Hangs versus crashesIs the machine hanging? Did it crash? Did it just appear to have crashed?

Even experienced users will not always remember to check the syslog for oops messages, especially if the kernel just behaves "slightly strange" without crashing and burning spectaculary. You can save yourself a lot of work if you ask them to check the syslog for oops messages nevertheless.

As described above, if the kernel oopsed while user was using X, it may appear to him as if the machine was hanging.

On a 2.4 kernel, there is one indication that will tell you that the kernel paniced, even if you're in X: the numlock LED will start flashing. 2.6 doesn't have this feature yet.

In this case, it always helps to reproduce the problem after changing to the text console. If the machine isn't hung hard, the kernel will at least accept keyboard input. On the console, it will also be possible to capture additional information on the crash. As a minimum, you should tell the kernel to display the oops message on the console as well, using

# klogconsole -r0 -l8

If the oops isn't written to the syslog (e.g. when the oops occurs inside an interrupt), capturing the output with a digital camera may still help (but please make sure that any images you attach to a bug report don't exceed 512k).

Alternatively, one can try to capture the oops via a serial console.

In addition, you may want to enable the sysrq key and capture some sysrq information, as described in section "Capturing sysrq information" below.

Eliminate well-known problemsThere are certain classes of problems that are common as dirt. Be aware of these problem areas and try to get these out of the equation early on.

Item Number One on the list of annoying issues is probably ACPI:

Most of the time when a user reports a problem with a machine not booting properly, or hardware not getting set up correctly, this is caused by bad ACPI BIOS tables.

Try to boot with acpi=off to turn off ACPI entirely.

[list of all ACPI related kernel command line variables goes here]

Other Very Common Problems?

Eliminate non-essential variablesUser bug reports often describe fairly specific scenarios, such as "I am using a USB disk with reiserfs on it exported via NFS while listening to my mp3s and all of a sudden the machine crashes".

This is a nice and accurate report involving almost every subsystem the kernel has (block device layer, VM, VFS, network, sound, ...)

To help you narrow down the problem, here is a bunch of things you can try:

does the problem exist with older/newer kernel versions as well?

if you take component X out of the equation, can the bug still be reproduced?

if you exchange component X for another, equivalent component (e.g. replacing reiserfs with ext3), does the problem persist?

if the problem implicates memory corruption or random hardware failure: can the problem be reproduced on a different machine?

Especially on large machines, random memory corruption could be caused by hardware problems with bad RAM. To diagnose bad RAM, use the installation CD, select memtest86 and run it for 24 hours.

Your worst enemy is the desktop. Any kernel messages printed to the console while the X server is running will not show up on the screen. The X server has a way of capturing messages printed on /dev/console and display them to you, but if the bug is bad enough to prevent syslogd and klogd from writing the oops to the syslog, the chances of X actually being able to display anything useful to you are very very small.

So if you're able to reproduce the problem in some way, the first thing you should do is switch to a text console and bump the console logging info:

klogconsole -r0 -l8

This will switch the kernel's console log level to display anything it sends to syslogd on the virtual console as well. This includes any kernel oopses; if you trigger the kernel bug now, you will at least get a screenful of oops information.

Note: if you're doing this frequently, please refer to the section on serial consoles below - this is really the preferred method, but as a first stab, just being able to read the oops is a major win. To learn how to read an oops, please refer to the file oops-reading.

Using ksymoops

A kernel oops usually includes a dump of the current processor state, including registers, the instruction pointers, and a function call back trace. For this to be of any use to the kernel developer, these addresses must be mapped to function and/or variable names, if possible.

Current SUSE kernels support a feature called "kallsyms" where the running kernel includes a symbol table of itself, which allows it to resolve the addresses automatically when printing an oops.

Older kernels do not have this feature, so the oops printed will contain just the raw addresses, which need to be converted by a user space application.

This is what ksymoops is for: you can feed ksymoops a raw oops on standard input, and given the right symbol information, it will provide you with a cooked version of that oops that maps all the symbols and a disassembly listing of the hex instructions. It should not be used when kallsyms is turned on (which holds true for opensuse kernels).

The crux of the matter is providing ksymoops with the right symbol information. This information is usually taken from the vmlinux image and the System.map file in /boot, which must exactly match the version of the kernel that generated the oops. Therefore, it is usually a good idea to run ksymoops on the machine where the crash happened.

If it is to resolve module symbols properly, ksymoops also needs the list of kernel modules and their location in memory. A good way of providing this is to copy the file /proc/modules immediately before or after the oops occurred, and specify this copy on the ksymoops command line using the -l option:

# ksymoops -l /tmp/proc-modules-copy < /tmp/my-oops

Fortunately, much of this work is already done automatically when the oops is captured by syslogd, because syslogd will do all the symbols translations for you. This has the great advantage that it always uses the correct symbol list.

Of course, oopses captured via the serial console will not have their addresses massaged by syslogd, so you will have to run ksymoops manually on this case.

Using sysrq

sysrq means "system request". This is the name for a bunch of magic key combinations that will tell the kernel to display various types of internal information, sync the file system or kill a task. Since this is somewhat security sensitive (esp. the task killing part), the sysrq keyboard commands are disabled by default for security reasons.

One way to enable sysrq is to execute the following command at the shell prompt:

echo 1 > /proc/sys/kernel/sysrq

In addition, you may want to edit /etc/sysconfig/sysctl and change the variable ENABLE_SYSRQ to "yes". This will ensure that sysrq is enabled after reboot.

To use sysrq, you need to press a "magic" key combination plus a command key. This magic key combination depends on the hardware platform, but on most platforms it's usually ALT-SysRQ (on some keyboards, the SysRQ key is labelled "PrtScr" or "Print", it's usually located to the right of the function keys).

Most sysrq keys will cause the kernel to report status information to the serial console. In the default configuration, a SUSE system has all kernel generated output redirected to tty10, so you need to switch to console 10 or redirect the kernel console to a different tty using klogconsole.

The most helpful command key is "h", which displays a short help text:

0-8 These keys change the console log level to the indicated
level. 8 will display everything on the console, 1 will be
critical messages only, and 0 turns console logging off entirely.
M Display current memory statistics
P Display current processor registers, instruction counter, call
trace and list of loaded modules. This is essentially the process
related information that would get printed as part of an oops.
T Shows a listing of all tasks, including the back trace of their
current kernel stack. Beware, this list can be very long.
U Try to re-mount all currently mounted file systems read-only.
E Send a TERM signal to all processes except init.
I Send a KILL signal to all processes except init.

There are a number of other sysrq keys; a complete list is available
from Documentation/sysrq.txt in the kernel source.

It is also possible to trigger sysrq commands from the command line,
which is very useful if you do not have keyboard access (e.g. when
debugging a problem remotely). In this case, simply echo the letter
to /proc/sysrq-trigger and read back the information from dmesg or
the syslog files: