Microsoft's Secret Bug Squasher

Simson Garfinkel
11.10.05

It turns out that a good portion of all those Windows crashes over the years are not caused by the operating system itself, but by buggy device drivers -- low-level pieces of code that allow the operating system to communicate with external devices like the computer's keyboard, hard drive, screens and network cards.

Because device drivers run deep within the operating system, they are hard to write and hard to debug. And when they fail, they can take down the whole computer. "If they go bad, the whole OS can go bad," says Tom Ball, a scientist at Microsoft Research.

But in a little-noticed project percolating in Redmond, the world's biggest single producer of software bugs is pushing the envelope on an anti-bug technology that promises to make the Windows operating system a whole lot more reliable, and may eventually raise the bar for dependable software throughout the industry.

Microsoft has developed a tool called the Static Driver Verifier, or SDV, that uses "model checking" to analyze the source code for Windows device drivers and see if the code that the programmer wrote matches a mathematical model of what a Windows device driver should actually do. If the driver doesn't match the model, the SDV warns that the driver might contain a bug.

It's a deceptively simple-sounding breakthrough that encapsulates some remarkable software engineering theory. When Bill Gates announced that the technology was under development at the 2002 Windows Engineering Conference, he called it "the holy grail of computer science" -- a description that does not overstate the tool's significance. "Now in some very key areas -- for example, driver verification -- we're building tools that can do actual proof about the software and how it works in order to guarantee the reliability," said Gates.

Three years later, the technology has moved from Microsoft Research, where it was developed as part of a project called Slam, and into developers' hands. And while the SDV can't prove if a program will execute correctly, it can find many errors that previously snuck through the development process and went on to vex customers.

"It finds bugs via static analysis (compile-time analysis) of the code rather than run-time analysis," says Ball, who led the Slam project with Sriram Rajamani. A typical driver can be checked in a few minutes, but some complex drivers can take as long as 20 minutes to analyze. "Most of the bugs it finds are real, but because it is a static analysis, it can report false bugs."

Drivers are usually written by hardware makers -- companies that invariably have less experience writing code than Microsoft. Because drivers have a limited set of operations that they perform -- such as moving a packet from a computer's ethernet card to the machine's memory -- they are amenable to formal analysis, says George Avrunin, a professor at the University of Massachusetts who is familiar with Slam.

"A lot of the problems with drivers seem to have to do with things like whether the right interrupts are masked before a certain operation, etc., and this kind of checking can detect errors of that sort," Avrunin says. "Moreover, although the driver code can be quite complicated, the basic structure of what it's doing is usually fairly simple -- and there's no concurrency -- so it's possible to do this kind of automated abstraction."

Despite its success with device drivers, model checking isn't a panacea, and in any event, the technique can't work for complicated programs like Microsoft Word or Adobe Acrobat, because it's too hard to create a formal specification of what these programs should actually be doing.

Instead, for large-scale applications, some experts are turning to special program checkers that scan source code looking for common mistakes.

Prexis from Ounce Labs is one such automated code analyzer. "We don't just look for bugs," says Jack Danahy, the company's founder and CTO. Instead, Prexis looks for design errors that lead to security problems -- for example, a program that sends sensitive information over a network without first encrypting that information.

Automated tools are especially useful when companies outsource programming to outside vendors, says Danahy: "Most of the security risks that people face are not just bugs -- they're decisions that programmers made."

What programs like Prexis do is find the code that those decisions produced and bring it to the attention of managers and quality-assurance professionals. These reports are much easier to audit than the code itself. And if the reports can be audited, there is a chance -- however small -- that the programmers might actually be forced to fix their bugs before customers are bitten.