What Should Never Happen... Did

Hi, this is Bob Golding; I wanted to write a blog about an interesting hardware issue I ran into. Hardware problems can be tricky to isolate. I recently came across one that I thought was interesting and gave an example of how to trace code execution. The machine executed the filler “int 3” instructions generated by the compiler. Execution should never reach these filler instructions, so we needed to determine how the instruction pointer got there.

What was the issue?

The issue was a bug check 8E (unhandled exception). The exception was a debug exception (80000003), because a filler INT 3 instruction was executed.

1: kd> .bugcheck

Bugcheck code 0000008E

Arguments 80000003 8082e3e0 f78aec38 00000000

Below is the trap frame (the trap frame is the third argument in the bugcheck code). Note that the actual trapping instruction is 8082e3e0, the instruction pointer is incremented before the INT 3 generates a trap. The correct EIP is reported in the bug check values.

Now, we need to find the execution path that caused the machine to execute this INT 3. There are places to look to find clues that will tell us. The first place to start looking is the stack. If a “call” instruction was made, the return will be pushed on the stack. This way we can try to determine if we arrived at this bad instruction pointer from a call or a ret instruction.

Using the value from esp in the above trap frame, let’s dump the stack.

1: kd> dps f78aecac

f78aecac80a5d8fc hal!HalpIpiHandler+0xcc <<< Interesting?

f78aecb0f78aecc0

f78aecb400000000

f78aecb800000002

f78aecbc000000e1

f78aecc0f78aed50

f78aecc4f75d9ca2

f78aecc8badb0d00

f78aeccc000086a8

In looking at the stack dump, we see that there may have been a call from HalpIpiHandler. Let’s dump the code leading up to hal!HalpIpiHandler+0xcc to see what it did.

In the above assembly, we can see that there is a call made using a pointer in the import table.Now, let’s have a look at that pointer.

1: kd> dd 80a5b020 l 1

80a5b020 8082e3e4

The pointer is very close to the instruction we trapped on. Is this a coincidence?

It looks like due to an effective address calculation failure, the machine starting executing at 8082e3e0 instead of 8082e3e4.Somewhere in the data path the processor executing this instruction stream dropped bit three, turning a 4 into a 0.

1: kd> ?0y0100

Evaluate expression: 4 = 00000000`00000004

1: kd> ?0y0000

Evaluate expression: 0 = 00000000`00000000

What does all of this mean?

There is some circumstantial evidence here that the machine was in the IPI handler. The IPI Handler is used in multiprocessor systems so that one processor may interrupt another. So, how can we further prove this is where we were? Let’s try to match the trap frame registers to the assembly from HalpIpiHandler before it calls KiIpiServiceRoutine.

Let's view the IPI state using the debugger command !ipi.From this output we can see that processor 1 is the receiver of a cross interrupt from Processor 0. This is consistent with the data we found on the stack.

1: kd> !ipi

IPI State for Processor 0

As a sender, awaiting packet completion from processor 1.

TargetSet2PacketBarriereIpiFrozen2 [Frozen]

IpiFrame8089a570SignalDone00000000RequestSummary 0

Packet StateActiveWorkerRoutinent!KiFlushTargetMultipleTb

Parameter[0]00000000Parameter[1]80899f10Parameter[2]80899f04

IPI State for Processor 1

As a receiver, the following unhandled requests are pending: [Packet] [DPC]

TargetSet0PacketBarrier0IpiFrozen0 [Running]

IpiFrameb5ad7be8SignalDoneffdff120RequestSummary 2 [DPC]

Packet StateStaleWorkerRoutinent!KiFlushTargetMultipleTb

Parameter[0]00000000Parameter[1]b5ad7950Parameter[2]b5ad7948

What went wrong?

Based on the evidence in this dump it appears that call instruction transferred execution to the wrong address. The machine ended up executing at address 8082e3e0 instead of 8082e3e4, a single bit difference. This same bit was flipped in several crashes from this machine, so all the evidence pointed to a faulty processor. After replacing the processor we were running on when we bugchecked, the issue did not occur again.

Hardware can sometimes cause pretty specific failures, such as the flipped bit we see here. To determine that this failure was a hardware issue, we had to reconstruct the execution path and trace how we ended up at the failing instruction.We were able to match the register contents to what they would have been before the call to KiIpiServiceRoutine.This demonstrated that the call should have been made to KiIpiServiceRoutine, but it unexpectedly went to the wrong address.

Thanks for this a lot, reminds me on some older posts on this blog about hardware errors and how code execution changes. Could you please post something like this but on x64 architecture in the future (if you guys run around some problem, of course)?

[Hi Miroslav. Thank you for your feedback. This type of issue (a bit flip) would look similar in x64. We do have some articles coming up that are x64.]