Debugging a Hung Operating System

Have you ever had your entire Windows machine become frozen or hung? I am not referring to a window becoming hung, but rather the kind where even the mouse and keyboard do not respond.

Each case is different, but most often, I observe these to be the result of a piece of code not giving up control while the CPU is running at or above the IRQL of the thread dispatcher. That includes drivers responding to hardware interrupts and DPCs completing I/O for drivers.

When this happens, the operating system cannot switch to another thread to allow other processing to occur. One consequence is the threads responsible for painting the user interface are unable to run - thus, it seems like your input is being ignored. In fact, it is very likely that your keyboard and mouse input are recognized by their respective drivers and the response to these messages is simply being delayed until the correct thread can service them.

There is a way to debug these hangs, but it requires some advanced preparation. Furthermore, if you have a USB keyboard, it requires a hotfix or a modern version of Windows, such as Windows 7. In order to debug these, you should enable the CrashOnCtrlScroll feature described in this article.

After enabling this feature and rebooting, holding the control key while pressing scroll lock twice will manually crash the system. If you have enabled proper dump collection for system crashes, a dump will be generated and this can be debugged after the system is brought back up.

The caveat is if the hang occurs at an IRQL higher than what the keyboard driver responds at. This feature will not work in these cases. However, this rarely happens and most users should find this strategy to work. In fact, this strategy can also be used in a limited manner to debug a system that has trouble during the latter stages of a boot cycle - even the cases where the mouse is still responsive, but the system fails to make it to the shell.

On Windows 8, we now have a DPC watchdog timeout enabled by default (bugcheck 0x133) to make it easier to diagnose and capture data when a CPU is kept above DISPATCH_LEVEL for too long. On client, the default timeout is 10 seconds for a single DPC, and 30 seconds for cumulative time a CPU can be kept at IRQL >= DISPATCH_LEVEL. The most common issue is a driver raising IRQL and not lowering it again, which driver verifier can catch.