Software Needs Seatbelts and Airbags

(This post is a draft version of an article slated to appear in ACM Queue.)

Finding and fixing bugs in deployed software is difficult and time-consuming: here are some alternatives.

Death, Taxes, and Bugs.

Like death and taxes, buggy code is an unfortunate fact of life. Nearly every program ships with known bugs, and probably all of them end up with bugs that are only discovered post-deployment. There are many reasons for this sad state of affairs.

Unsafe Languages.

One problem is that many applications are written in memory-unsafe languages. Variants of C, including C++ and Objective-C, are especially vulnerable to memory errors like buffer overflows and dangling pointers (use-after-free bugs). These bugs, which can lead to crashes, erroneous execution, and security vulnerabilities, are notoriously challenging to repair.

Safe Languages: No Panacea.

Writing new applications in memory-safe languages like Java instead of C/C++ would go some way towards mitigating these problems. For example, because Java uses garbage collection, Java programs are not susceptible to use-after-free bugs; similarly, because Java always performs bounds-checking, Java applications cannot suffer memory corruption due to buffer overflows.

That said, safe languages are no cure-all. Java programs still suffer from buffer overflows and null pointer dereferences, though they throw an exception as soon as they happen, unlike their C-based counterparts. The common recourse to these exceptions is to abort execution and print a stack trace (even to a web page!). Java is also just as vulnerable as any other language to concurrency errors like race conditions and deadlocks.

There are both practical and technical reasons not to use safe languages. First, it is generally not feasible to rewrite existing code because of the cost and time involved, not to mention the risk of introducing new bugs. Second, languages that rely on garbage collection are not a good fit for programs that need high performance or which make extensive use of available physical memory, since garbage collection always requires some extra memory [5]. These include OS-level services, database managers, search engines, and physics-based games.

Are Tools the Answer?

While tools can help, they too cannot catch all bugs. Static analyzers have made enormous strides in recent years, but many bugs remain out of reach. Rather than swamp developers with false positive reports, most modern static analyzers report far fewer bugs than they could. In other words, they trade false negatives (failing to report real bugs) for lower false positive rates. That makes these tools more usable, but also means that they will fail to report real bugs. Dawson Engler and his colleagues made exactly this choice for Coverity’s “unsound” static analyzer (see the Communications of the ACM article on their experiences: A Few Billion Lines of Code Later: Using Static Analysis to Find Bugs in the Real World [4].

Testing is Good but Not Enough.

The state of the art in testing tools has also advanced dramatically in the last decade. Randomized fuzz testing can be combined with static analysis to drive tests to explore paths that lead to failure. These tools are now in the mainstream: for example, Microsoft’s Driver Verifier can test device driver code for a wide variety of problems, and now includes randomized concurrency stress testing.

But as Dijkstra famously remarked, “Program testing can be used to show the presence of bugs, but never to show their absence!” At some point, testing will fail to turn up new bugs, which will unfortunately be discovered only once the software has shipped.

Fixing Bugs: Risky (and Slow) Business.

Finding the bugs is only the first step. Once a bug is found, whether by inspection, testing, or analysis, fixing it remains a challenge. Any bug fix must be undertaken with extreme care, since any new code runs the risk of introducing yet more bugs. Developers must construct and carefully test a patch to ensure that it fixes the bug without introducing any new ones. This process can be costly and time-consuming. For example, according to Symantec, the average time between the discovery of a remotely exploitable memory error and the release of a patch for enterprise applications is 28 days [11].

Cut Bait and Ship.

At some point, it simply stops making economic sense to fix certain bugs. Tracking their source is often difficult and time consuming, even when the full memory state and all inputs to the program are available. Obviously, show-stopper bugs must be fixed. For other bugs, the benefits of fixing them may be outweighed by the risks of creating new bugs and the costs in programmer time and in delayed deployment.

Debugging at a Distance.

Once that faulty software has been deployed, the problem of chasing down and repairing bugs becomes exponentially more difficult. Users rarely provide detailed bug reports that allow developers to reproduce the problem.

For deployed software on desktops or mobile devices, getting enough information to find a bug can be difficult. Sending an entire core file is generally impractical, especially over a mobile connection. Typically, the best one can hope for is some logging messages and a minidump consisting of a stack trace and information about thread contexts.

Even this limited information can provide valuable clues. If a particular function appears on many stack traces observed during crashes, then that function is a likely culprit. Microsoft Windows includes an application debugger (formerly “Watson”, now “Windows Error Reporting”) that is used to perform this sort of triage not only for Microsoft but also for third-party applications via Microsoft’s Winqual program. Google also has made available a cross-platform tool called Breakpad that can be used to provide similar services for any application.

However, for many bugs, the kind of information that these tools provide is of limited value. For example, memory corruption errors often trigger failures millions of instructions past the point of the actual error, making stack traces useless. The same is generally true for null dereference exceptions, where the error often happens long after the null pointer was stored.

Captain’s Log: Not Enough Information.

On servers, the situation is somewhat improved. Server applications typically generate log messages, which may contain clues to why a program failed. Unfortunately, log files can be unmanageably large. Poring over logs and trying to correlate them to the source code can be extremely time-consuming. Even worse, that work may yield no useful results because the logs are incomplete; that is, the logs simply may not provide enough information to narrow down the source of a particular error because there were not enough or the right kind of log messages. Recent work at Illinois and UC San Diego may lead to tools that address some of these problems; SherLog [12] automates the process of tracing back bugs from log messages to buggy source code paths, and LogEnhancer [13] automatically extends log messages with information to help post-crash debugging. More information on logging appears in a recent Queue article, Advances and Challenges in Log Analysis [10].

God Doesn’t Play Dice, but Your Computer Does.

Despite these advances, finding bugs has actually become harder than ever. Back when many programs were sequential it was already challenging to find bugs, but now the situation has become far worse. Multithreaded programs, asynchrony, and multiple cores are now a fact of life. Every execution of these non-deterministic programs is completely different from the last because of different timing of events and thread interleavings. This situation makes reproducing bugs impossible even with a complete log of all input events—something that would be too expensive to record in practice anyway.

Bumpers, Seatbelts and Airbags.

Let’s shift gears for a moment to talk about cars (we’ll get back to talking about software in a minute). As an analogy for the situation we find ourselves in, consider when cars first entered onto the scene. For years, safety was an after-thought at best. When designing new cars, the primary considerations were aesthetics and high performance (think tailfins and V-8 engines).

Eventually, traffic fatalities led legislators and car manufacturers to take safety into account. Seatbelts became required standard equipment in US cars in the late 1960s, bumpers in the 1970s, and airbags in the 1980s. Modern cars incorporate a wide range of safety features, including laminated windshields, crumple zones, and anti-lock braking systems. It is now practically unthinkable that anyone would ship a car without these essential safety features.

However, we routinely ship software with no safety measures of any kind. We are in a position similar to that of the automobile industry of the 1950s, delivering software with lots of horsepower and tailfins, complete with decorative spikes on the steering column to make sure that the user will suffer if their application crashes.

Drunk-Driving Through a Minefield.

The potent cocktail of manual memory management mixed with unchecked memory accesses makes C and C++ applications susceptible to a wide range of memory errors. These errors can cause programs to crash or produce incorrect results. Attackers are also frequently able to exploit these memory errors to gain unauthorized access to systems. Since the vast majority of objects accessed by applications are on the heap, heap-related errors are especially serious.

Numerous memory errors happen when programs incorrectly free objects. Dangling pointers arise when a heap object is freed while it is still live, leading to use-after-free bugs. Invalid frees happen when a program deallocates an object that was never returned by the allocator by inadvertently freeing a stack object or an address in the middle of a heap object. Double frees are when a heap object is deallocated multiple times without an intervening allocation. This error may at first glance seem innocuous but, in many cases, leads to heap corruption or program termination.

Other memory errors have to do with the use of allocated objects. When an object is allocated with a size that is not large enough, an out-of-bound error can occur when the memory address to be read or written lies outside the object. Out-of-bound writes are also known as buffer overflows. Uninitialized reads happen when a program reads memory that has never been initialized; in many cases, uninitialized memory contains data from previously-allocated objects.

Airbags for Your Applications.

Given that we know we will be shipping software with bugs and that the terrain is dangerous, it might make sense to equip it with seatbelts and airbags. What we’d like is to have both resilience and prompt corrective action for any problem that surfaces in our deployed applications.

Let’s focus on C/C++/Objective-C applications—the lion’s share of applications running on servers, desktops, and mobile platforms—and memory errors, the number one headache for applications written in these languages. Safety-equipped memory allocators can play a crucial role in helping to protect your software against crashes.

The Garbage Collection Safety Net.

Dealing with the first class of errors—those that happen because of the misuse of free or delete—can be remedied directly by using garbage collection. Garbage collection works by only reclaiming objects that it allocated, eliminating invalid frees. It only reclaims objects once there is no way to reach those objects anymore by traversing pointers from the “roots”: the globals and the stack. That eliminates dangling pointer errors, since by definition there can’t be any pointers around to reclaimed objects. Since it naturally only reclaims these objects once, a garbage collector also eliminates double frees.

While C and C++ were not designed with garbage collection in mind, it is possible to plug in a “conservative” garbage collector and entirely prevent free-related errors. The word “conservative” here means that because the garbage collector doesn’t necessarily know what values are pointers (since we are in C-land), it conservatively assumes that if a value looks like a pointer (it is in the right range and properly aligned), and it acts like a pointer (it only points to valid objects), then it may be a pointer.

Slipping Through the Net.

While garbage collectors eliminate free-related errors, they cannot help prevent the second class of memory errors: those that have to do with the misuse of allocated objects such as buffer overflows.

Runtime systems that can find buffer overflows often impose staggeringly high overheads, making them not particularly suitable for deployed code. Tools like Valgrind’s MemCheck are incredibly comprehensive and useful, but are heavyweight by design and slow execution by orders of magnitude [7].

Compiler-based approaches can reduce overhead substantially by avoiding unnecessary checks, though they entail recompiling all of an application’s code, including libraries. Google has recently made available AddressSanitizer, a combination of compiler and runtime technology that can find a number of bugs, including overflows and use-after-free bugs. While it is much faster than Valgrind, its overhead remains relatively high (around 75%), making it primarily useful for testing.

Your Program Has Encountered An Error. Goodbye, Cruel World.

All of these approaches are based on the idea that the best thing to do upon encountering an error is to abort immediately. This fail-stop behavior is certainly desirable in testing. However, it is not usually what your users want. Most application programs are not safety-critical systems, and aborting them in midstream can be an unpleasant experience for users.

Suppose you have been working on a Microsoft Word document for hours (and for some mysterious reason, auto-save has not been turned on). If Microsoft Word suddenly discovers that some error has occurred, what should it do? It could just pop up the window indicating that something terrible has happened and would you like it to send a note home to Redmond. That might be the best thing to do from a debugging perspective, but most people would prefer that Word do its damndest to save the current document rather than fall on its sword if it discovers a possible error. In short, users generally would prefer that their applications be fault tolerant whenever possible.

Bohr versus Heisenberg.

In fact, the exact behavior users do not want is for an error to happen consistently and repeatably. In his classic 1985 article “Why do computers stop and what can be done about it”, Jim Gray drew a distinction between two kinds of bugs. The first kind are bugs that behave predictably and repeatably—that is, ones that occur every time that the program encounters the same inputs and goes through the same sequence of steps. These are Bohr bugs, by analogy with the classical atomic model where electrons circle around the nucleus in planetary-like orbits. Bohr bugs are great when debugging a program, since it makes it easier to reproduce the bug and find its root cause.

The second kind of bugs are Heisenbugs, meant to connote the inherit uncertainty in quantum mechanics, which are unpredictable and cannot be reliably reproduced. The most common Heisenbugs these days are concurrency errors, a.k.a. race conditions, which depend on the order and timing of scheduling events to appear. Heisenbugs are also often sensitive to the observer effect; attempts to find the bug by inserting debugging code or running in a debugger often disrupt the sequence of events that led to the bug, making it go away.

Jim Gray makes the point that while Bohr bugs are great for debugging, what users want are Heisenbugs. Why? Because a Bohr bug is a showstopper for the user: every time the user does the same thing, they will encounter the same bug. But with Heisenbugs, the bugs often go away when you run the program again. If a program crashes, and the problem is a Heisenbug, then running the program again is likely to work. This is a perfect match for the way users already behave on the Web. If they go to a web page and it fails to respond, they just click “refresh” and that usually solves the problem.

So one way we can make life better for users is to convert Bohr bugs into Heisenbugs, if we can figure out how to do that.

Defensive Driving with DieHard.

My graduate students at the University of Massachusetts Amherst and I, in collaboration with my colleague Ben Zorn at Microsoft Research, have been working for the past few years on ways to protect programs from bugs. The first fruit of that research is a system called DieHard that makes memory errors less likely to impact users. DieHard eliminates some errors entirely and converts the others into (rare) Heisenbugs.

To explain how DieHard works, let’s go back to the car analogy. One way to make it less likely for cars to crash into each other is for them to be spaced further apart, providing adequate braking distance in case something goes wrong. DieHard provides this “defensive driving” by taking over all memory management operations and allocating objects in a space larger than required.

This de facto padding increases the odds that a small overflow will end up in un- allocated space where it can do no harm. However, DieHard doesn’t just add a fixed amount of padding between objects. That would provide great protection against overflows that are small enough, and zero protection against the others. In other words, those overflows would still be Bohr bugs.

Instead, DieHard provides probabilistic memory safety by randomly allocating objects on the heap. DieHard adaptively sizes its heap be a bit larger than the maximum needed by the application; the default is 1/3 [2, 3]. DieHard allocates memory from increasingly large chunks that we call miniheaps.

By randomly allocating objects across all the miniheaps (see this diagram for a detailed view), DieHard makes many memory overflows benign, with a probability that naturally declines as the overflow increases in size and as the heap becomes full. The effect is that, in most cases when running with DieHard, a small overflow is likely to have no effect.

DieHard’s random allocation approach also reduces the likelihood of the free-related errors that garbage collection addresses. DieHard uses bitmaps, stored outside the heap, to track allocated memory. A bit set to ’1’ indicates that a given block is in use, and ’0’ that it is available.

This use of bitmaps to manage memory eliminates the risk of double frees, since resetting a bit to zero twice is the same as resetting in once. Keeping the heap metadata separate from the data in the heap makes it impossible to inadvertently corrupt the heap itself.

Most importantly, DieHard drastically reduces the risk of dangling pointer errors, which effectively go away. If the heap has one million freed objects, the chances that you will immediately reuse one that was just freed is literally one in a million. Contrast this with most allocators, which immediately reclaim freed objects. With DieHard, even after 10,000 reallocations, there is still a 99% chance that the dangled object will not be reused.

Because it performs its allocation in (amortized) constant time, DieHard can provide added safety with very little additional cost in performance. For example, using it in a browser results in no perceivable performance impact.

Tolerating Faults FTW with FTH.

At Microsoft Research, tests with a variant of DieHard resolved about 30% of all bugs in the Microsoft Office database, while having no perceivable impact on performance. Beginning with Windows 7, Microsoft Windows now ships with a Fault-Tolerant Heap (FTH) that was directly inspired by DieHard. 8 Normally, applications use the default heap, but after a program crashes more than a certain number of times, the Fault-Tolerant Heap takes over. Like DieHard, the Fault-Tolerant Heap manages heap metadata separately from the heap. It also adds padding and delays allocations, though it does not provide DieHard’s probabilistic fault tolerance because it does not randomize allocations or deallocations. The Fault-Tolerant Heap approach is especially attractive because it acts like an airbag: effectively invisible and cost-free when everything is fine, but providing protection when they need it.

Exterminating the Bugs.

Tolerating bugs is one way to improve the effective quality of deployed software. It would be even better if somehow the software could not only tolerate faults but also correct them. A follow-on to DieHard, called Exterminator, does exactly that [8, 9]. Exterminator uses a version of DieHard extended to detect errors, and uses statistical inference to compute what kind of error happened and where the error occurred. Exterminator not only can send this information back to programmers for them to repair the software, but it also automatically corrects the errors via runtime patches. For example, if it detects that a certain object was responsible for a buffer overflow of 8 bytes, it will always allocate such objects (distinguished by their call site and size) with an 8-byte pad. Exterminator can learn from the results of multiple runs or multiple users, so it could be used to proactively push out patches to prevent other users from experiencing errors it has already detected elsewhere.

The Future: Safer, Self-Repairing Software.

My group and others (notably Martin Rinard at MIT, Vikram Adve at Illinois, Yuanyuan Zhou at UC-San Diego, Shan Lu at Wisconsin, and Luis Ceze and Dan Grossman at Washington) have made great strides in building safety systems for other classes of errors. We have recently published work on systems that prevent concurrency errors, some of which we can eliminate automatically. Grace is a runtime system that eliminates concurrency errors for concurrent programs that use “fork-join” parallelism. It hijacks the threads library, converting threads to processes “under the hood”, and uses virtual memory mapping and protection to enforce behavior that gives the illusion of a sequential execution, even on a multicore processor [1]. Dthreads (“Deterministic Threads”) is a full replacement for the POSIX threads library that enforces deterministic execution for multithreaded code [6]. In other words, a multithreaded program running with dthreads never has races; every execution with the same inputs generates the same outputs.

We look forward to a day in the not too distant future when such safer runtime systems are the norm. Just as we can now barely imagine cars without their myriad of safety features, we are finally adopting a similar philosophy for software. Buggy software is inevitable, and when possible, we should deploy safety systems that reduce their impact on users.