Saturday, February 22, 2014

I've been working on an 0x101 BSOD (located here), and I thought I'd go ahead and blog about it, even though it's not officially solved just yet. It's still interesting nonetheless, and I believe is good content for a post.

CLOCK_WATCHDOG_TIMEOUT (101)

This indicates that an expected clock interrupt on a secondary
processor, in a multi-processor system, was not received within the
allocated interval.

So there's the basic definition of this particular bug check. Let's get into the debugging now.

--------------------

BugCheck 101, {19, 0, fffff880009b2180, 4}

^^ 19 clock ticks in regards to the timeout.

fffff880009b2180 is the PRCB address of the hung processor, let's keep this address in mind.

For reference, I did not do !prcb 0 through 4. That would have been very tedious. Instead, you can use !running -it. The "i" argument causes it to display idle processors too, and "t" displays the stack trace for the thread running on each processor. If we run that extension, it shows the is an 8 core box.

Hint: At times, the 4th parameter of the bug check will show you the responsible processor. For example, in your *101 here, it was correct as the 4th parameter was 4.

Hint #2: You can also generally tell the amount of cores on the box by checking the bugcheck_string - BUGCHECK_STR: CLOCK_WATCHDOG_TIMEOUT_8_PROC

As this matches the 3rd parameter of the bug check, processor #4 is the responsible processor. Now with the information we have here thus far, we know that processor #4 reached 19 clock ticks without responding, therefore the system crashed. Before we go further, what is a clock tick? A clock interrupt is a form of interrupt which involves counting the the cycles of the processor core, which is running a clock on the processors to keep them all in sync. A clock interrupt is handed out to all processors and then they must report in, and when one doesn't report in, you then crash.

--------------------

Let's now look at the stacks of the different processors to see what the threads were involved in:

We can use knL and go through a grueling method of obtaining the trap frame, but we don't like having to put in more work, so let's use kv instead on Processor 0:

^^ Disassembling the first few instructions reveals a jump (jmp) that is back up in the nt!KxFlushEntireTb function. It appears at the time of the bug check, the thread was executing a pause (a CPU delay), and doing this in a loop waiting for a release.

So, what's the summary so far? Processor #0 was the thread that created the bugcheck itself, and must have been interrupted by a clock interrupt in order to trigger the CLOCK_WATCHDOG_TIMEOUT bug check.

--------------------

Let's take a look into Processor #1's call stack like we did Processor #0:

^^ So it seems that we have the intelppm!MWaitIdle function. I have done some research and I cannot find info on it, although intelppm is related to the processor and I believe its power configuration, power states, etc. Assuming idle implies what I believe it does, this may indicate that processor #1 at the time of the crash was idle waiting for something.

^^ We have a zerod stack + registers, so this will be problematic. Usually this occurs on the problem processor because the IRQL is too high, OR the processor was too hung at the time of the crash to report its information, etc. We will need to get the raw stack.

^^ Okay, so from that raw stack, we can see quite a few DirectX Kernel & MMS calls, as well as nVidia driver calls as well. This is good news, as this may be our problem (it gives us a good start as far as troubleshooting goes). I'd like to note that there were much more than this, and that the raw stack went on for a very very long time. I am just cutting it to a very small sample for blogging purposes.

Friday, February 21, 2014

There are generally two types of 0x9F bug checks that you'll see most in the wild:

1. 4th parameter containing the blocked IRP address.

-- In the case of #1, if the analyze -v doesn't show anything, and the stack is useless, you'd run !irp 123addresshere123 and it would show the culprit '>'.

2. 1st parameter = 0x4, which implies that a power IRP has
failed to synchronize with the PnP Manager.

-- In the case of #2, it's a bit different, so let's get into a recent debugging I did. I learned most of this from documentation provided by my good friend x BlueRobot, and other parts viewing various developer forums.

We have an IRP list, and a little bit of a more informative stack text. We can see a few partmgr routines (Partition Management system driver).

** I've gone through a few 0x9F's in which dumping the !thread didn't provide an IRP list address. I am not sure why this would be the case, but my only guess is simply not enough information was available at the time of the crash to obtain an IRP list address.

What the issue was - What seemed to be occurring was the DPC may have been looping
itself by gathering a Spinlock at DPC Level, cancelling the Timer, and
then finally releasing the Spinlock again. This was going into a loop
over and over again, apparently
caused by IntelliMemory. Removal of IntelliMemory solved the crashes.

I've solved of course many *133 bug checks in the past, however, this was the first time I was supplied a kernel-dump for it and was finally knowledgeable enough thanks to reading, experience, and documentation, to debug it 'in-depth', and successfully solved it at the same time.

I have also supplied Harry (x BlueRobot) this kernel-dump as we both wanted to write and learn about 0x133's in-depth, so he has gone ahead and written a tutorial as well for it (which you can see here). Harry goes into pretty nice detail about DPC's, which is something I won't be doing here in my tutorial. I'll instead be focusing on what caused it and how it was solved.

--------------------

I've thankfully done most of the analysis in the thread I solved the crash in, however, I don't want to get lazy and will of course go into detail wherever I can. Let's get started:

DPC_WATCHDOG_VIOLATION (133)The DPC watchdog detected a prolonged run time at an IRQL of DISPATCH_LEVELor above.Arguments:Arg1: 0000000000000000, A single DPC or ISR exceeded its time allotment. The offending component can usually be identified with a stack trace.Arg2: 0000000000000501, The DPC time count (in ticks).Arg3: 0000000000000500, The DPC time allotment (in ticks).

Here we have the basic bug check information. First off, the DPC_WATCHDOG_VIOLATION bug check can be triggered in two ways.
First, if a single DPC exceeds a specified number of ticks, the system
will stop with 0x133 with parameter 1 of the bug check set to 0. In
this case, the system's time limit for single DPC
will be in parameter 3, with the number of ticks taken by this DPC in
parameter 2.

1. The driver called the CancelSendsTimerDpc routine. I do not know exactly what this routine does, however, it's certainly something in regards to a timer on and/or for a DPC (Deferred Procedure Call). According to Harry, he believes that the driver may use a Custom DPC associated with a Timer object.

2. The driver then calls the KeSetTimer routine which sets the absolute or relative interval at which a timer object is to be set to a signaled state and, optionally, supplies a CustomTimerDpc routine to be executed when that interval expires.

3. The driver then calls the CancelSendsTimerDpc routine again. As far as I know, what should be going on here is the CustomTimerDpc routine should be called, but CancelSendsTimerDpc may be in a loop.

Overall, what seems to be occurring is the DPC may be looping itself by gathering a Spinlock at DPC Level, cancelling the Timer, and then finally releasing the Spinlock again. This is happening over and over again, therefore we have a loop.

(Network Driver Interface Specification driver) routine call. The Network Driver Interface Specification (NDIS) is an application programming interface (API) for network interface cards (NICs). The NDIS forms the Logical Link Control (LLC) sublayer, which is the upper sublayer of the OSI data link layer (layer 2). Therefore, the NDIS acts as the interface between the Media Access Control (MAC) sublayer, which is the lower sublayer of the data link layer, and the network layer (layer 3).

The NDIS is a library of functions often referred to as a "wrapper" that hides the underlying complexity of the NIC hardware and serves as a standard interface for level 3 network protocol drivers and hardware level MAC drivers. Another common LLC is the Open Data-Link Interface (ODI).

2. dxgkrnl.sys - Direct X Kernel.

--------------------

So, with all of this said, we know that something is causing usb80236.sys to call into a loop, and it may be anything that's working with and/or possibility interfering with Windows' networking, or Direct X. We'll need to do some detective work to determine what is causing this, as it's a system driver and is being faulted by something else. At this point, since we're at quite the wall, I recommend enabling Driver Verifier so we can see what's going on. The user enabled DV, and sure enough, they had an 0xC4 crash! Let's take a look:

DRIVER_VERIFIER_DETECTED_VIOLATION (c4)A device driver attempting to corrupt the system has been caught. This isbecause the driver was specified in the registry as being suspect (by theadministrator) and the kernel has enabled substantial checking of this driver.If the driver attempts to corrupt the system, bugchecks 0xC4, 0xC1 and 0xA willbe among the most commonly seen crashes.Arguments:Arg1: 0000000000001011, Invariant MDL buffer contents for Read Irp were modified during dispatch or buffer backed by dummy pages.Arg2: fffffa8006219060, Device object to which the Read IRP was issued.Arg3: fffff980098a8c60, The address of the IRP.Arg4: fffff8801a5b3000, System-Space Virtual Address for the buffer that the MDL describes.

Here we have the basic bug check info, with the 2nd/3rd parameter highlighted as they will be useful later on. Let's go ahead and take a look at the call stack first:

As we can see, we have many different file system related routines (Ntfs, FLTMGR, etc). Why? Well, as we move up the stack, we eventually see three intmsd.sys calls. This is the IntelliMemory Storage Filter Driver from Condusiv Technologies.

That's why we're seeing so many file system and storage related routines being called. After this was found, I recommend disabling and/or preferably uninstalling IntelliMemory. After uninstalling IntelliMemory, the crashes ceased. Why?

First off, IntelliMemory™ is an intelligent data caching technology that provides faster access to frequently used files. IntelliMemory is supposed to improve latency and throughput by reducing disk I/O requests as active files are predicatively cached within the server to preempt round trips between VMs and network storage.

Remember how we saw various network related routines, etc, during the 0x133 debugging? Well, it's because IntelliMemory was the driver that was causing the loop.

This isn't really a tutorial, or how to debug, but just some information that came as quite a surprise to me that I thought I'd share!

Right, so a user shared their dump and I began to look, and it was of the 0xDEADDEAD bug check. I have never actually seen this bug check outside of driver development, crash and/or hang troubleshooting, learning, etc, because those are generally the only things it's used for. However, today, I learned that a driver can call KeBugCheckEx and pass it the code.

So here we have the basic bug check information. As far as I know, the parameters have no meaning in this bug check. There's no 'if parameter 1 = 3, it means x'. We can even see the description of the bug check via WinDbg is 'The user manually initiated this crash dump'.

From the call stack we can see that 0xdeaddead called into NETwNs64+0xa6fcf (Intel(R) Wireless WiFi Link 5000 Series Adapter Driver for Windows 7). Also, what's interesting, in the RetAddr portion of the call stack, we can see deaddead is mentioned. I am going to assume that this indicates that NETwNs64.sys went ahead and called nt!KeBugCheckEx and passed it the code. That's what gave us the 0xDEADDEAD bug check.

Wednesday, February 12, 2014

This will be my first among many debugging tutorials (aside from older ones)! I very much want to get back into writing tutorials for a few reasons, but the main is that they are very fun, and I obviously learn more and more every day! Another thing about tutorials is they are all over the web on various blogs, forums, etc, but many have different styles of the way they were written. Some may contain more info, etc, and different methods of explaining, etc. My goal with everything regarding debugging has always and will always be explain as much as my personal knowledge permits, and do it in the way that anyone that doesn't know how to do it can learn it by reading and then performing it hands on by themselves.

--------------------

Let's get started! We're going to start off with the *D1 bug check, but more specifically when NETIO.sys is the labeled fault of the crash. I've been debugging online on various forums for a little over two years now, and in the past few months to a year, I have seen a huge increase in NETIO.sys *D1's. I am going to tell you right now that NETIO.sys *D1 bug checks are caused 100% of the time from what I have seen (and I have debugged and solved MANY NETIO.sys *D1's) by either the following:

1. Network drivers themselves; whether they need to be updated, reinstalled due to corruption, rolled back due to bug in latest version, etc.

(99% of the time #2 is the cause, and rarely have I seen #1 but it's of course possible).

Right, so with all of this said, what's NETIO.sys? NETIO.sys is Microsoft Windows' Network I/O Subsystem.

First of all, Input and Output (I/O) is actually extremely in-depth and will not be explained in this blog post. If you of course would however like to read about it and learn (which I highly recommend), read the following from the msdn website.

With this said, the basic definition (per msdn) for the *D1 bug check is the following:

DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)This indicates that a kernel-mode driver attempted to access pageable memory at a process IRQL that was too high.A driver tried to access an address that is pageable (or that is
completely invalid) while the IRQL was too high. This bug check is
usually caused by drivers that have used improper addresses.

So, this is a fairly standard explanation for a person who understands how Windows' memory manager works. If you don't however, you can kinda sorta get the gist of it, but at the same time it may not really mean much to you. Let's go into detail on the memory manager subsystem, because we're all about learning!

Windows' memory manager runs at IRQL 0 (PASSIVE_LEVEL), which is the layer that threads run at. If for example a driver attempts to access memory that is not currently in RAM (paged), this will cause an exception (thrown by the processor). When this exception happens, Windows' memory manager will go ahead and catch the exception, fetch memory from the hard disk, and then finally the processor will then go ahead and return to the driver that attempted to access this memory which was not paged, but at this point will now be paged.

Alright, great, so why do we get this bug check? *D1 occurs when a driver attempts to access memory that is running at a higher IRQL. This is not good (clearly), because when the driver attempts to access paged-out memory at IRQL[n] (I use (n) because there are different levels, but I will go ahead and say that 2 is the most common, so from this point on I will use 2), Windows' memory manager will page-in the memory and run at IRQL 0. This cannot happen, so Windows' memory manager will bug check the system as a deadlock will occur.

This can also occur not only if a driver attempts to access memory that is running at a higher IRQL, but if a driver attempts to access an invalid memory address.

--------------------

Now that we have all of that said, let's move onto an example crash dump (just a random *D1 NETIO.sys dump from a user that I managed to dig up):

Right away we can see that the 2nd parameter and/or argument of the *D1 bug check itself is 0000000000000002 (2) as I mentioned earlier. There are various other ways to display the parameters of a bug check in different ways.

We can see from the stack that we just have Windows' usual error handling and fault tolerance bug check related routines. No driver calls, etc. Very dead stack. Let's go ahead and refer to the FBID:

FAILURE_BUCKET_ID: X64_0xD1_NETIO!RtlCopyBufferToMdl+1f

We can see the fault of the crash is NETIO.sys (calling into?) the RtlCopyBufferToMdl routine. I am not entirely sure actually what this routine implies, however just from knowing the acronyms...

Rtl = Run-Time Library.Mdl = Memory Descriptor List.

I can imagine there's some sort of buffer being copied from an RTL routine to an MDL. So, what does this mean to us? Well, nothing really. It's a minidump with not very much information. All we know is something is conflicting with NETIO.sys. Let's go ahead and take a look at the loaded modules list (Debug > Modules). Now, in NETIO.sys dumps you are going to want to check for popular antivirus drivers. I would list them here, but there are so many. I think I'll add them over time. I will just go ahead and let you know that this specific dump contained ggc.sys which is a driver in relation to Quick Heal AntiVirus.

So, there's ggc.sys. Now, at this point I recommend removal of QuickHeal and explained that it was likely causing network related conflicts, which in turn caused the system to crash. After QuickHeal was removed, the crashes stopped.

--------------------

-- Today when I wake up I will add a list of antiviruses and firewalls that I have seen cause this bug check.