[Status Update: As of 22 March 2005, this bug was fixed in Windows "Longhorn." Thanks to the ms dev who took this bug and fixed it!]

Be grateful to the editors of The NT Insider that you are about to be told about a bug that is (1) quite likely to exist in your code, (2) exists in a lot of "working" code, (3) has been a bug since NT 4.0, and (4) is a bug that several (unnamed) knowledgeable people refused to believe is a bug.

I installed my well

-tested driver on a Toshiba laptop running XP Home. It won't reduce the suspense one whit to tell you that the bug had nothing to do with it being a Toshiba laptop, or running some proprietary Toshiba software, or anything to do with XP Home.

Like so many other kernel bugs, it was time to connect up a 1394 cable and crank up WinDbg, put in a breakpoint at DriverEntry wrapped in a __try/__except block (so that I can run this with and without the kernel debugger running), and then step through the code to see where it crashes.

Oh dastardly expletive! This was one of those wretched Heisenbugs. (A Heisenbug is, of course, a bug that goes away when you put in debugging statements or invoke a debugger.)

The crashes kept moving around. It was never in my code. It was always elsewhere so that I couldn't even tell what was happening or how much of my code had run before the crash occurred.

What I needed was some sort of minimally

-invasive technique to track this bug. Frustratingly, the Toshiba never crashed when the debugger ran, so the debugger didn't help. I put in some KeBugCheck()s to see how much of my code was being executed. That helped some, but not enough.

For me, one of the many frustrations about kernel mode crashes is that coaxing Windows to produce a memory.dmp file is more art than science. I still can't reliably produce one. Sometimes a crash won't write a memory dump to pagefile.sys; sometimes you'll see it dumping memory in the BSOD. Sometimes the dump is done, but then the next reboot on the same partition doesn't. Sometimes it does and sometimes it doesn't. Don't ask me why.

On those rare occasions when a memory.dmp was actually produced, I used a file viewer to look for the memory trace. Okay; the crash showed up some time after I entered AddDevice. I added a lot of debugging code to validate subscripts and other parameters. Unfortunately, everything checked out--but we were still crashing when we booted without a debugger.

Time to break out a bigger hammer. Turn on all the error checking in Driver Verifier. The symptoms were the same: Everything worked if the debugger was running, randomly crashed without the debugger installed.

I kept breaking out bigger hammers. Henry Gabryjelski from Microsoft suggested that I try

verifier.exe /flags 0xFB /all

With Henry's suggestion, I also turned on "special pools" handling via gflags. Oh God, the machine ran so s

-l--o---w----l------y.

With all of the debugging tracing turned on I finally started seeing the crashes at a consistent point in my code. The crash was no longer in AddDevice but very early in my DriverEntry code. Specifically, it was a page fault while in a call to MmGetSystemRoutineAddress. I called this routine in order to get the version of the OS.

This was bizarre. In an attempt to isolate this error, all my code referenced non

-paged memory and ran in non-paged pool. How could there possibly have been a page fault?

I was faced with a real stumper. We see from the statement before the MmGetSystemRoutineAddress that the current IRQL is 0 (PASSIVE_LEVEL). Yet the crash is a IRQL_NOT_LESS_OR_EQUAL. Huh?

Even more disturbing are two more facts. First, the instruction at the crash is:

mov edi,edi

This is a nop that only uses registers. How can it possibly crash? Second, the mov instruction is the first instruction in MmGetSystemRoutineAddress! How in all that is holy to programmers can this be happening?

Gentle reader, I'm not being fair with you. The people reading my many, many posts on the mailing list on this subject had the benefit of portions of a crash dump. I'll point out the most relevant lines of the dump:

efl=00010046CURRENT_IRQL: ff

I'd been seeing the "CURRENT_IRQL: ff" for several weeks. It had bothered me but no one seemed to be able to tell me how the CURRENT_IRQL could be 0xFF unless I'd walked over some memory that didn't belong to me. You see, the IRQL is stored in a section of kernel memory known as the Processor Control Region (PCR). (You can display the PCR by using the !pcr WinDbg command extension.)

So let's review the conundrum. First, we're running at passive level. Second, the instruction that is faulting only uses registers and is, in fact, a nop. Third, the instruction that is failing is the first instruction in MmGetSystemRoutine Address.

You have all the clues that I did. I couldn't figure it out but OSR's very own Tony Mason did. His answer was:

I had to go check, but it turns out that the interrupt enable bit is bit 9 (Volume 1, Page 3

-15). That corresponds to 0x200 hex. This bit is NOT set, and thus on crash this exhibits by showing an "irql" of 0xff. That is consistent with what I've always observed.

The documentation clearly says: This routine can only be called at IRQL = PASSIVE_LEVEL.

Interrupts are disabled. This is the *equivalent* of running at IRQL HIGH_LEVEL, without that nasty TPR programming. The page in question is probably marked as "in transition" (use "!pte 805bfa33") which means the data contents are really in memory, but this causes a hardware page fault anyway. The debugger "helps" you by showing you the contents of that memory location, even though it generated a page fault.

The use of a two

-byte NOP code is odd, but a red herring. The reason that an innocuous instruction like this causes a fault is because the *instruction* is at fault.

Now you just need to figure out why interrupts are disabled.

Later that same week, Microsoft's Jake Oshins added a rather comprehensive explanation of the IRQL 0xFF situation. You can find this description in the Hector's Memos section of OSR Online.

So the answer to the question is that the first instruction in MmGetSystemRoutineAddress is paged out, plus interrupts are turned off. In another message to the mailing list, Tony traced through the OS code and found that, indeed, when interrupts are turned off that IRQL is set to 0xFF.

Tony said it with so much understatement: Now you just need to figure out why interrupts are disabled.

Indeed.

It turns out that finding the source of why interrupts were turned off was considerably easier than I'd anticipated. What I did was write the totally non

It's highly system specific, it's WAAAAAY too easy to THINK you know what you're doing ("it worked on DOS") but get an unexpected result (as in INT 1 versus INT 3), and it won't even compile in the x64 cross compiler.

It has NEVER been acceptable to use an INT 1 or an INT 3 or any other such convention in a Windows NT driver. Really.

With the CrashIfInterruptsAreDisabled() code in place I was able to sprinkle defensive tests throughout my code.

At the beginning of driver entry I had:

#if DBG // { CrashIfInterruptsAreDisabled();

MyBreakPoint();

CrashIfInterruptsAreDisabled();#endif // }

Where MyBreakPoint() is defined as:

// Call this if there is a constructor in the function. Will// avoid an SEH/constructor error if using C++ in driver code// (error C4509) that needs a destructor.voidMyBreakPoint(){ __try { DbgBreakPoint() } __except(EXCEPTION_CONTINUE_EXECUTION) {}}

I looked at the static crash dump. I saw that the second call to CrashIfInterruptsAreDisabled(); actually crashed my Toshiba laptop.

Weird. I wondered if there is some strange anti

-hacking software on the Toshiba. The easiest way to see if this was Toshiba-specific was to run this code on another machine. This time I tried XP Professional on a desktop machine running an AMD processor. The Toshiba was running an Intel processor.

Oh dear god! It crashed on XP Professional, too! I sent the code to Jamey Kirby and he tested it and saw the same result.

Here's the enormous surprise: The following code turns off interrupts if there is no debugger running and does not turn it off if there is a debugger running!

__try { DbgBreakPoint() }__except(EXCEPTION_CONTINUE_EXECUTION) {}

Jamey points out that the Intel documentation states that INT 1, INT 3 and BOUNDS exceptions disable interrupts.

The above--seemingly trivial--code segment (which I and many, many other developers have been using for several years without a lick of trouble) was the source of the problem. It was the first statement in my device driver and it caused havoc millions of instructions later. Even more surprisingly, no one else has analyzed this problem and sounded the alarm bells in the developer community.

Jamey Kirby wonders how many seemingly spurious BSODs are the result of this " innocuous" piece of code.

The entire point of the above code is to break into a debugger if one is there but to continue execution if there is no debugger. This is the ultimate Heisenbug because the mere act of attempting to invoke the debugger causes a really nasty change in the state of the machine, from interrupts enabled to interrupts disabled.

All of my problems instantly became clear. All of those spurious crashes that moved around as I changed code; the machine just hanging; all of the unexplained Bug Check 0xA: IRQL_NOT_LESS_OR_EQUALs and God

-knows-what else. Oh my God, interrupts were disabled!

We're now left with a another problem: How can one write code so that a debugger is safely tripped when a debugger is present but be harmless when there isn't a debugger present?

The first solution--a very bad solution--is to write assembly code to fix the problem.

No. Nyet. Bad. Don't do that. You're tracking mud across my clean floor. Get the idea? Writing any assembly code in any production driver is a big no

-no. It's worse than acquiring a spin lock and holding on to it until you wake up in the morning. Very naughty.

The second solution is for Microsoft to fix the problem. The DDK does indeed say that you can trap the exception raised by DbgBreakPoint when no debugger is attached. See the side

-bar entitled "Wazzup With This Bug?" (and OSR Online) for the current status of this problem at Microsoft.

So, does that mean there's in fact no solution to this problem? Well, finally, today, as I write this, I found a buggy and partial answer in the DDK documentation, in the section entitled Debugging

Ignoring for the moment the fact that it doesn't work, why is this only a partial solution? Because according to the DDK documentation, "this global variable can only be used in Microsoft Windows XP and later." Because "If a kernel debugger was recently attached or removed, the value of KD_DEBUGGER_NOT_ PRESENT may not reflect the new state."

As to whether this solution works correctly: If you paste the code above into your driver and start up Windows without a kernel debugger running, you will get a BSOD. Bug check code 0x7E with parameter 1 being 0x80000003.

Two caveats: This will only work on Windows XP and later systems. There is also the problem of a debugger being attached and then later detached. For that you'll need KdRefreshDebuggerNotPresent(). Alas, KdRefreshDebugger NotPresent() is only available for starting on Windows Server 2003.

EpilogueAfter working for two years on this device driver--and after five weeks of hellish work so that we resolved what I hope is the last unresolved issue--Jamey told me that he thinks he has a different and better way to accomplish what we need... without needing a device driver.

I just love this profession.

Ralph Shnelvar (CEO) and Jamey Kirby (CTO) are with Information LLC, a startup developing next

-generation backup software.

Ralph can be reached at RalphS@InformationLLC.com.

Wazzup With This Bug?

When Ralph and Jamey contacted us here at OSR about this bug, we couldn't believe what they were telling us was true.But, it didn?t take long to convince us.

We made a few careful enquiries among our friends in Redmond, and one of the Windows devs agreed to look at the problem.We filed a bug in Microsoft's internal bug reporting system.And that was the status of this problem as we went to press with this issue of The NT Insider.

Check out the memo from Hector titled Don't __tryto Catch The DbgBreakPoint(...) Exception, then keep an eye on it here at OSR Online for any updates on the status of this bug.