Today's post is a guest post from Eric Roman. He wrote a slightly
more expletive-laden version of this post inside Google and I asked
him if I could post it here. It serves as a good illustration of how
deploying software on more than a hundred million different users'
computers is a nearly-biological fight.

Understanding stability of Windows applications is really hard.

Recently, @apatrick has concluded a heroic investigation into one of
Chrome's most mysterious top crashes for Windows. He just committed a
"fix" for it on the canary channel, which seems to be working!

This debugging journey is pretty epic, so I'm giving a blow-by-blow
summary of it below. If you want to jump straight to the conclusion,
see r96807 for the spoiler (can also read
comments in bug 81449 for the full novella).

Prelude

Our story begins a year ago, when we first began tracking crashes in a
generic RunnableFunction<>::Run() location. These crashes had
established themselves as a top browser crash for Windows Chrome. At
the time, Huan and I unsuccessfully looked into the issue, but
couldn't make heads or tails of it (bug
54307). The basic format of the crash looked
something like this:

Above we can see that the crash is due to jumping to instruction
pointer of 0. And judging by the top frames, it looks like there was
some stack corruption at work (notice how the first frame's alleged
return address is actually a stack location).

The callstack itself isn't terribly helpful though, since we can't
tell what code was actually running prior to the crash (it was gobbled
up by the stack corruption).

Moreover, the source location of RunnableFunction<>::Run() doesn't
help narrow things in the slightest, since pretty much all of Chrome's
code runs through this path (Chrome relies heavily on message passing
to post asynchronous tasks to another thread's message loop).

Really, all we know at this point is that "some task" got posted to
"some thread", and then crashed at "some point" after running this
task.

There is one interesting piece of information that can be inferred
from the minidumps: based on the crashed thread's index, it is likely
the crashing task was running on Chrome's "Child process launcher"
thread. In fact, a subsequent instrumentation
(r58786) confirmed that all of these crashes were
happening on the child process launcher thread. Since there is very
little code that legitimately runs on this thread, we did a full
code-flow analysis of all the paths that could post tasks. But we did
not discover any problematic codepaths (I was hoping to stumble across
something bad like a use-after-free).

Without any extra leads, (as well as a temporary dip in the crash's
frequency lowering its priority) the mystery bug got pushed onto the
back-burner.

It lay dormant for the next 9 months, waiting for a new champion to
take up arms.

Part II

In May, Al Patrick (an innocent bystander), is assigned the bug on
suspicion that it is a regression. (At this point the callstack has
morphed a bit due to various optimization ambiguities, but it is
essentially the same bug I had failed to solve earlier).

Our new hero starts off by adding some instrumentation trying to see
if a bad function pointer is ever being directly passed to a runnable
function (r85359). This doesn't turn up anything
salient.

Next he instruments posted tasks to retain the location where they got
posted from (r85991). This change is absolutely
brilliant! I have benefited from it many times since it was
introduced, to help debug crash dumps.

The instrumentation is cheap yet effective: whenever posting a task,
the FROM_HERE macro (that was being used in debug builds to save
filename/line numbers) now saves the instruction pointer into the
PendingTask. Later when the task is de-queued by the target thread's
message loop, this same instruction pointer (i.e. the birthplace of
the task) is pushed onto the stack prior to calling the task's virtual
function. That way should a crash happen later, you can poach the
address of the birthplace off the stack during postmortem dump
analysis!

This instrumentation reveals that the problem tasks were posted by
ChildProcessLauncher::Context::Terminate(), suggesting that the task
itself was a runnable function on
ChildProcessLauncher::Context::TerminateInternal().

So far so good, but we still have no idea why it is crashing.

Next, our hero instruments the base runnable function/method tasks in
Chrome to detect use-after-frees (r86447), as well
as other memory molestation (by preserving the value of the function
pointer into the minidump). This instrumentation reveals that not only
was the task alive and well at the time it was run, but the function
pointer was also untouched. This is a major breakthrough in the
investigation, since it tells us conclusively that whatever craziness
has happened, it occurred while executing
ChildProcessLauncher::Context::TerminateInternal().

This is where things start to get weird. Looking at TerminateInternal
(the function that is blowing up) there is hardly any code:

So basically all that we are doing is killing the process, by calling
a handful of win32 API functions. Hmmm.

Al theorizes that someone may be hooking one of the winapi calls
(perhaps kernel32!TerminateProcess), and that whatever code it is
running in response to that call is responsible for the stack
corruption. For instance if the hooker used the wrong calling
convention (not stdcall), that could be corrupting our stack upon
return!

So how is this guy "hooking" the API call?

There are a lot of different ways you could intercept Windows API
calls, and I am definitely no expert to explain them all. You could
for instance do things like directly patch the code in the system DLL
(in user land, or even on disk). Or re-write the binary to substitute
all the target function calls with your new one. But definitely the
simplest and most intuitive way to hook is to just patch the import
address table. (You can see how that works in the code above -- our
compiled code doesn't call directly into kernel32, but rather it jumps
to the address listed from the import table table... that is the
address you would be patching if you wanted to intercept the call).

To this end, Al adds yet more instrumentation, this time to try and
detect if TerminateProcess is being hooked via the import address
table (r96266). Unfortunately this instrumentation
doesn't reveal any smoking gun yet.

Ricardo makes a good observation -- a hook on CloseHandle() is
perhaps more likely than a hook on TerminateProcess, based on the
position of the 0 that appears on the stack (bug
81449). Plus, there is probably more value
in hooking CloseHandle over TerminateProcess.

Finally, Al commits a changelist that bypasses the address table
altogether for both CloseHandle and TerminateProcess, and instead
calls the underlying implementation in ntdll.dll directly:
r96807. This is not quite as efficient, but
not a huge deal either due to the low frequency of these calls.

This workaround appears to have thwarted the bad hooker, and so far
there hasn't been a single crash of this sort in the Windows Chrome
canary!

It remains unclear which of the two was being hooked or why. The
workaround is a pretty ridiculous thing to have to do, but the bypass
could shave off as much as 10% of our Windows browser crashes (yes,
this crash really was that high in some releases)! It is likely the
hooker is malware -- we could copy a fragment of the hooked code into
our minidump to try and learn more.

Ideally we want to alert the user about these sorts of problems,
rather than papering-over them with workarounds (since they indicate a
real problem with their underlying system). But we don't have a good
mechanism to do that yet (see bug 72239 for
some proposals).

In summary, kudos to @apatrick for his excellent and persistent
debugging investigation over the past three months.

Also this shows how powerful Chrome's Canary channel is, since it
allows you do this style of experimental debugging with very quick
turn-arounds (mere days).