Sunday, February 10, 2013

Debugging a Resource Leak in WebKit

From time to time a really interesting bug comes up. Here's one that I investigated recently.

There have been multiple reports that in Chrome, pages using WebGL were holding on to GPU memory too long. The symptom was that as the page was reloaded, the WebGL canvas "de-rezzed", so that it got blockier and blockier as the page was refreshed. The code is actually designed to do this in certain circumstances. Chrome limits the amount of video memory it will dedicate to WebGL canvases' back buffers, and will allocate the canvas at lower resolution if necessary. However, it was clear that this limit was being hit too often. When navigating away from a page, the canvas and WebGL context should have been promptly garbage collected, and this wasn't happening in Chrome because of how the generational garbage collector in V8 worked. Recently, one fix was committed for this problem, which basically forced the JavaScript garbage collector to run after navigating away from a page containing WebGL. This mitigated the problem, but didn't really fix it.

My colleague Brandon Jones has been working on a better fix for this problem, which involves forcibly shutting down the WebGL context when navigating away from the page, rather than relying on the JavaScript garbage collector to clean up the Canvas element and its context. Since there had been multiple bug reports by people writing real-world applications, we wanted to create a regression test. One application in particular, Chaos to Perfection, triggered the problem repeatably, even with the hack in place to trigger the garbage collector. Brandon contacted the authors of the demo, studied the code, and created a reduced test case which demonstrated, incredibly, that creation of a Web Worker was causing the problem. The worker didn't even need to do anything; if it was just created and immediately exited, WebGL contexts started leaking. There was no good explanation for this behavior, and Brandon immediately filed another bug, which is the subject of this post.

For background, many of the C++ classes in WebKit implement a reference counting protocol through a few template base classes (RefCounted, ThreadSafeRefCounted, TreeShared) and smart pointer classes (RefPtr, PassRefPtr). All DOM nodes in WebKit are reference counted. Reference counting these objects simplifies their memory management. Unfortunately, one "leak" of a reference can keep an entire graph of objects alive.

Given that a particular object (the WebGLRenderingContext) was leaking, the question was: who was causing it to leak? Who was holding on to a stray reference either to this object, or to another object which pointed to it? It would be great if, for any reference-counted C++ object, we could ask the question, "who is holding a reference to this object?" Unfortunately, there is no way to answer this question after the fact. Once the reference has been taken (the ref() method has been called on the target object), the information about who took the reference is lost. WebKit's infrastructure can not be easily modified to walk backward from a given reference-counted object to the objects which have incremented its reference count. Virtual machines for garbage collected languages like JavaScript inherently keep track of this information, and provide tools to figure out why any given object is being kept alive. With WebKit's reference counting implementation, all that can be done is to watch the ref/deref calls as they happen, and try to figure out based on the absence of a deref call which object actually leaked.

The JavaScript engine adds another wrinkle to the reference counting mechanism. JavaScript can be thought of as a layer on top of the entire C++ object graph representing the DOM. When a given C++ object is exposed to the JavaScript engine, a "wrapper object" is created, and the C++ object's reference count is incremented. When the JavaScript wrapper object is garbage collected, the C++ object's reference count is decremented. JavaScript increments the reference count of the underlying C++ object at most once. Initially, it seemed that the JavaScript engine might somehow be holding on to the WebGL context longer than it should be. However, setting a breakpoint in CanvasRenderingContext::deref() showed that the wrapper object was being properly garbage collected, and that V8WebGLRenderingContext::derefObject was being called.

More debugging showed that the WebGLRenderingContext wasn't being destroyed because the HTMLCanvasElement which owned it had a non-null parent when the WebGL context was deref'ed for the last time. When the worker wasn't started, the HTMLCanvasElement didn't have a parent at that time, and it was destroyed, cleaning up the WebGL context along with it. The question then became: who was keeping the HTMLCanvasElement alive? Fortunately, the test case's DOM was simple, and a little logging proved that the HTMLDocument owning the canvas was never being destroyed if a worker was started. In other words, the entire DOM of the web page was leaking if a worker was ever used, even after the worker exited.

The next question was: who was keeping the HTMLDocument alive? This was a much harder question to answer. One might think to just set breakpoints in HTMLDocument::ref and deref, grab call stacks at each, and try to match them up. Unfortunately, one does not simply "set breakpoints in HTMLDocument::ref and deref".

HTMLDocument inherits its ref and deref methods through a template base class called TreeShared, and ref and deref are non-virtual. Therefore it isn't possible to simply override them in HTMLDocument. Because TreeShared is a template class, though, one might think that calling code might use RefPtr<HTMLDocument>, calling HTMLDocument::ref and deref directly rather than Node's ref and deref methods, and therefore that there might be some debugging information allowing a breakpoint to be set. However, attempting to set one in gdb resulted in the following output:

(There were 32 variants of this ref() call in total.) This would not work.

In order to be able to set a breakpoint when ref() was called against an HTMLDocument, I added a flag to the base class TreeShared to track whether the instance was actually an HTMLDocument, and set that flag in HTMLDocument's constructor. (In hindsight, ref() and deref() could have simply been made virtual in TreeShared, overridden in HTMLDocument, and a breakpoint set there. At the time, I had been concerned about changing the calling convention of these methods, but I hadn't realized that call sites which called ref() and deref() could not parameterize over completely unrelated types like TreeShared and RefCounted. For this reason, it would have been perfectly fine, and less code, to make TreeShared's ref and deref methods virtual.)

After making these code changes, I found that a lot of calls were being made to HTMLDocument::ref(). Examples included HTML parsing (WebCore::DocumentWriter::begin and WebCore::V8DOMWindowShell::updateDocument), WebCore::LiveNodeListBase::LiveNodeListBase (called from WebKit::WebDocument::forms and autofill::PasswordAutofillManager::SendPasswordForms), parsing scripts, sending events, and others. There were too many calls being made to see a difference between when a worker was being started and when it wasn't.

To try to whittle down the ref() calls to only the essential ones, I added a mechanism to prevent my new code from firing during the next call to ref(). This helped, but only marginally. The problem was that nested calls were made to ref() and deref(), and it was infeasible to instrument all of the code involved. Instead, I added a stack-allocated helper class to suppress the calling of the new code in nested fashion. This worked much better. (Think about this – it's basically an "anti-RefPtr".) Tediously, I examined all of the calls to HTMLDocument::ref in the debugger and suppressed those apparently unrelated to debugging the task at hand. This involved sprinkling the suppression calls in various places such as document parsing, script execution, and event dispatch.

After reducing the calls to ref() to a manageable number for the given WebGL-related test case, I examined the call stacks of each in gdb during a reload of the test case, both with and without the creation of a worker, hoping that there would be a difference when the worker was started, and that that would indicate the leak. Unfortunately, the calls were exactly the same. I had clearly suppressed too many of the calls to ref(), and missed the essential one causing the leak.

This was disappointing. A significant amount of time had already been invested to this point, involving many full rebuilds of WebKit and Chromium, and it wasn't clear which of the code changes so far had prevented catching the bug. Instead, I focused on calls to HTMLDocument::deref(), hoping that one of them would provide a clue about the leak. Luckily, after examining only a few call stacks, the following showed up:

Finally, this was a worker-related call to ref/deref. Looking at the related code, I found that WebCore's WorkerMessagingProxy class was being instantiated from Chromium-specific code: specifically, a class called WebWorkerClientImpl. Setting an appropriate breakpoint, I found that the stack trace constructing the WebWorkerClientImpl was as follows:

The reason this hadn't shown up in the previous stack traces of calls to HTMLDocument::ref was that I had added one key suppression which was too eager: suppressing calls to HTMLDocument::ref from within execution of script nodes. The worker was naturally being instantiated from JavaScript during execution of a script node, and by suppressing all calls to ref and deref from these script nodes, I'd accidentally hidden the cause of the resource leak. In hindsight, this should have been obvious; since the leak was happening when a worker was started, I should have been looking specifically for worker-related objects being construted which referred to the document.

Why was the leak happening? After examining the code, the answer was simple. Chromium-specific code wrapped the WebCore class WorkerContextProxy in order to implement both dedicated (in-process) and shared (out-of-process) workers. The WebWorkerClientImpl was a thin wrapper object which delegated most of its methods to the WorkerContextProxy instance. Unfortunately, the WorkerContextProxy managed its own lifetime. When both the main thread and the worker thread had exited, the WorkerContextProxy would delete itself. It did not send out any notifications when this occurred, so the Chromium-specific WebWorkerClientImpl wrapper object would live forever. The WebWorkerClientImpl contained a RefPtr to the HTMLDocument in the form of a RefPtr<WebCore::ScriptExecutionContext> to prevent the document from being destroyed before the worker it had started. This was the cause of the leak.

How was the leak fixed? The fix involved splitting up Chromium's implementations of dedicated and shared workers and a moderate amount of surgery to the class hierarchy to avoid creating a wrapper object for WorkerContextProxy in the case of dedicated workers. Refer to the WebKit bug for the patch implementing the final fix, in addition to the patch used to debug the problem.

I hope that this post provided some insight into techniques used to track down a moderately tricky resource leak in WebKit. If you have comments on how this diagnosis was done or suggestions on how it could have been done more efficiently, please leave them in the comments or send email.

5 comments:

Nice story! It really reminds us that sometimes being a programmer can really be a pain in the ***!!! But we still love it lol! masochists :P

One year ago I would probably have used a similar process to the one you used. But for that you're pretty much dependent on having a known simple offender use case and a more or less concrete idea of where to look. If you don't have that then this method is way too much time consuming and you'll easily go crazy! Too many re-compiles, debugs and it can get too messy at times if there's too much nesting.

A few months back a work college of mine showed me how to do logging the right way and how it can save you tons of hours!

1)Create a protocol of data to be considered important (timestamp, threadid, memory references, stack, context, etc) and create stubs for custom fields. This is the second most important step (see below).2)Insert the data logging mechanism in the base templates with private methods (important for override and it can also be very useful for memory trace and debugging) to log everything! At this step there's no need to worry about data amount or filters - we need it raw :D3)This can all be ignored at compile time with the correct usage of preprocessor compiler statements. 4)Use a proper tool to query the pile of data. This is actually the most important and tricky part. I was lucky enough that the company where I worked had an amazing proprietary tool tuned for our kind of programs that made the process of finding the problems amazingly fast!

In the end a proper old school debugging is still needed but at least now you don't need to go completely blind-folded. And you can also use the private logging functions to make the usual "hacky" stuff to help you narrow the last bits of the problem which I always prefer to maintain the code execution as "normal" as possible.

Of course that my process can't be applied/useful for all kinds of bugs/leaks tracking but for me it has been making wonders ;)

Thanks for your comment. Agree that when debugging large unknown problems, the best approach is to gather as much data as possible, and then sort through it. In this case, there was no primitive available to either print or grab a stack trace programmatically. That operation had to be done from the debugger. This made it impossible to log information related to the leak. It might have been possible to instrument the sites which incremented and decremented reference counts, but that would have not been enough information; the entire call stack was necessary to pinpoint which code had leaked the reference.

Hi, I am a newbie in WebKit & WebGL. I was wondering if I can get the call flow of WebGL inside Webkit. Which essential functions, files are involved in the call flow when we create a WebGL sample application targeting Webkit. Any ideas are most welcome!! Please help me out guys. I am totally freaked out as I have no idea from where to start and how to end!!