Unkillable Processes

Have you ever terminated an application only to see in your favorite task manager (Process Explorer, of course) that the process still exists? Or have you tried logging out or shutting down only to have the logoff or shutdown stall indefinitely for no apparent reason? These scenarios are usually the result of buggy device drivers that don’t properly handle the cancellation of outstanding I/O requests.

Over the last few years I’ve developed a tool called Notmyfault that demonstrates a number of common device driver bugs, including accessing freed memory, overrunning buffers, and leaking memory. The crashes generated by Notmyfault are featured in the crash analysis chapter of Windows Internals book I coauthored with Dave Solomon. I’ve recently added a new error selection, Hang Irp, in order to show the effects of drivers that don’t cancel I/O requests.

When you run Notmyfault and select the Hang Irp bug Notmyfault sends an I/O request into its helper driver, Myfault.sys, that Myfault.sys never completes. The names of the executable and driver reinforce the fact that user-mode code can never directly cause a Windows crash: Notmyfault relies on the Myfault driver to do the dirty work. The Notmyfault thread that issues the request never continues executing because it ends up stuck in the kernel waiting for the I/O request to complete. However, because Notmyfault issues the request from a second thread the UI remains responsive and you can issue other bugs, more hanging IRPs, or try to terminate the process.

Terminating Notmyfault reveals the effect of a hung IRP. Even after you close the Notmyfault window the Notmyfault process still shows in Process Explorer’s process list. Logging off and back in, even into a different account, does not cause the zombied process to exit. So what’s going on under the hood? If you’ve configured Process Explorer to take advantage of Microsoft’s symbol support (steps for doing so are documented in Process Explorer’s help file) you can view the stack of the hung thread by double-clicking on the Notmyfault process, navigating to the resulting Process Properties dialog’s Threads tab, and double-clicking on the thread:

A stack reflects a history of subroutine invocation and reads top to bottom from most to least recent. The stack above indicates that Notmyfault called DeviceIoControlFile, which called ZwDeviceIoControlFile. ZwDeviceIoControlFile transitioned into kernel-mode (the frames that are prefixed with “ntkrnlpa.exe”) where the kernel’s system call dispatcher executed NtDeviceIoControlFile. Since the I/O request was synchronous the I/O manager waits for the driver at which the I/O is targeted to complete the request.

When a process terminates the Process Manager performs process rundown, which includes terminating all the threads in the process, closing handles to opened system resources (e.g. files and registry keys) and tearing down the address space of the process. When the Process Manager sees a terminating thread has outstanding I/O requests it informs the drivers processing the requests that the requests should be cancelled. You can see that in the stack as the call to IopCancelAlertedRequest. Because the completion of an I/O request requires access to the address space of the owning thread’s process the system can’t finish tearing down a process until all its I/O requests have completed or cancelled. The I/O Manager has no choice but to wait indefinitely, which you can see in the stack as the call to KeWaitForSingleObject.

If you run across this type of problem in the real world you’ll need to run a kernel debugger to look at the outstanding I/O requests of any hung threads and the determine driver that owns them. If the system is hung you need to debug it from a second computer running a kernel debugger. Since the system as a whole isn’t hung when you create a hung thread with Notmyfault you can use local kernel debugging with LiveKd or, if you’re running Windows XP or higher, the Windows Debugging Tools for Windows built-in local kernel debugging. If you’ve never used a kernel debugger the easiest approach is to download the Debugging Tools for Windows and then run Livekd from the directory in which you install the tools.

The first kernel debugger command to execute is one to look at the hung process and its threads. Look at the IRP List area, which a list of outstanding I/O requests, of any threads that are listed. Here’s the command to dump hung process and partial output that includes the IRP list for the Notmyfault thread:

The output reports that /Driver/Myfault, the internal name of the Myfault driver, owns the IRP and is therefore the driver that’s guilty of not completing the I/O and not responding to the system’s cancellation request. The error regarding missing symbols for myfault.sys is expected since Microsoft only stores symbols for its own drivers and components.

The reason that the Notmyfault bug does not result in logoff or shutdown hangs is that the system doesn’t care if user applications really terminate during either of those activities. As long as the TerminateProcess API returns success, which it does for such zombie processes, the system is happy. However, if Explorer or one of the core system processes gets into a zombie state the system will be effectively hung.