The Hardest Bugs in The World – Part One

Some bugs in software are extremely hard to track down. One keeps trying to “kill” them and fail for a long time. Much like the guys in this video.

The most difficult bugs in the world can take months to track down and isolate. While not always deadly, Sometimes even when the bug location is known (like George Bush in Washington DC) the fix itself can be extremely difficult.

Bugs are hard to solve when they are non deterministic, non reproducible or out of control. They happen only in production system, or only during load, only at certain places, or just on the customer desktop.

Here are few examples of elusive software bugs, their characteristics and how to turn them into butterflies of code :).

Bugs in other people code

We just found a bug in the way Firefox loads plug-ins. It appears that Firefox tries to cache a plug-in DLL so it does not have to load it many times. While the feature is useful there is a flaw in the caching mechanisms that causes Java plug-in to load instead of our plug-in for our object.

Why was it hard to find ? The bug only happens at very specific conditions, and not all the time. It only happens on Firefox 3.6.3. It only happens when the Plug-In uses the Object HTML tag and not the “SRC” attribute. It only happens when Java and our Plug-In are loaded in the same page, in a certain order …

What did we do ? we (which means Leeor) compiled Firefox and ran the source code with a debugger . Thanks god for Open Source. If we had the same issue in IE, we would not have had much of a chance to solve it. We submitted the bug to Firefox,but since we needed a solution right away, we were able to find a workaround, by finding the root cause.

Why was it hard to find ? Memory leaks are extremely hard to find because they tend to be non deterministic. When the memory of the process just keeps growing, it is relatively easy to find the problems source. But in many operating systems the memory management and garbage collection have become so sophisticated, it is not trivial to know if there is a leak or not. Memory goes up and down , or just stays still.

In theory, memory management problems were supposed to go away in Java,C# and Python. While most of them have, the ones that remain are the hardest one to solve. For a while we kept hunting an “Out Of memory’ problem in IronPython. In the good (?) old days of C++ we could have used a memory profiler to locate our lost memory chunks. In IronPython this is next to impossible, since the .Net object are so mangled it is not possible to correlate them to original language objects. therefore, traditional tools like Quantify have little value.

What did we do ? Sometime the best resolution is to write a custom memory management library. We used this trick in Check Point when we needed to debug memory leaks in the kernel, where no standard tool works. This approach works especially well when this infrastructure is written form the ground up.

A similar approach can be used in Python, but the performance implications are too hard to run it in production. Unfortunately , the memory leaks only happens in production …

If the problem can be reproduced with unit tests, life is a bit better. One innovative idea that Idan came up with is Binary Search over the code. Since we moved to GIT we can now change the past retroactively. In other words, we perform a binary search on the code, to track down the line of code in which the memory leak started. We “Pretend” the unit test was written in the past and run it against the old branch. Using binary search we can locate the exact commit in which the problem arose.

Another option is to look for the usual suspects – unmanaged code .In one case we used NetApp SDK that was written in C++ in our .Net code. It took three iterations to resolve all the memory leaks caused by their library. Pretty much a trail an error process.

3 Responses to “The Hardest Bugs in The World – Part One”

Yes, some bugs can be very hard to track down, but almost all of them can be tracked down, if not otherwise, then with a low-level (kernel) debugger. IMHO the two nastiest bugs are concurrency bugs (which you have almost zero chance of catching with a debugger and if you add some diagnostics code, they might very well not reproduce) and bugs on client systems caused by third party software you don’t control or know about (like an entry in the hosts file which prevents one of your javascript files from loading, or some window-manager tweaker which puts your modal dialog box behind the application window).

Memory leaks are quite straight-forward (again, IMHO). And there is a simple pragmatic solution for it: just give it more memory and restart the servers from time-to-time. We do this with our HA financial servers. While we try very hard not to create buggy code :-), we realize that we (or the underlying libraries for that matter) are not perfect, so we restart the servers daily. The servers run in an active-inactive pair configuration, so the restarting is seamless. Where possible, we also try to run some basic profiling in production, so that we have “real” data.

Interesting post.
Indeed, it’s very very hard to find the reason of bugs.

One of the ways to find the bugs, when you work on parallel with another people, is to find out the file versions which have been merged, manually or automatically. These merges can lead to bugs in compilations (the easy case) and run-time errors (the hard case).
One of our solutions can find it automatically and specify it for the users. They love it and tell us it can save hours of work.