eXplode: Effective Model Checking of Real Systems

We are developing an approach called in-situ model checking to thoroughly check general systems software in a lightweight manner. We’ve made our approach so easy that we have applied it to more than 20 widely used, well tested systems and found nearly a hundred serious errors. Some of these errors can cause unrecoverable loss of an entire file system. Currently we’re collaborating with researchers at Microsoft Research to apply the eXplode approach to distributed systems.

Our goal is to effectively detect these bugs. The main technique we use is model checking. This formal verification technique systematically enumerates the possible execution paths of a distributed system by starting from an initial state and repeatedly performing all possible actions to this state and its successors. This state-space exploration makes rare actions such as machine crashes and network failures appear as often as common ones, thereby quickly driving a system into corner cases where subtle bugs surface.

However,naive application of model checking to large systems is prohibitive because of the huge cost required to write an abstract specification of the checked system. Recent work has developed implementation-level model checkers that check code directly, but these checkers still require an invasive, heavyweight port of the checked system to run inside these model checkers.

Instead of the heavyweight way of creating a simulated environment to run a system, we created an in-situ checking architecture that interlaces the mechanism for comprehensive checking within the checked system. This architecture enables users to check live systems, drastically reducing the overhead and the invasive infrastructure needed. For example, in eXplode (our storage system checker), the checking infrastructure is reduced down to a single device driver, dynamically loadable to a running OS kernel. These drivers are fairly small and easy to write; both the Linux and FreeBSD drivers of eXplode are less than two thousand lines of code. Once such a driver is loaded, eXplode can readily check any storage system that runs inside or above the OS kernel. Compared to the cost of weeks or months to check a system using the old approaches, eXplode only requires minutes, several orders of magnitude reduction.

Using this approach, we have found numerous serious errors in 20 widely used, well-tested systems. For example, we found data-loss errors that can vaporize all user data in 17 storage systems, including three version control software, a database, Linux NFS, ten local file systems, a software RAID, and a popular commercial virtual machine.

Implementation-level software model checking explores the state space of a system implementation directly to find potential software defects without requiring any specification or modeling. Despite early successes, the effectiveness of this approach remains severely constrained due to poor scalability caused by state-space explosion. DeMeter makes software model checking more practical with the following contributions: (i) proposing dynamic interface reduction, a new state-space reduction technique, (ii) introducing a framework that enables dynamic interface reduction in an existing model checker with a reasonable amount of effort, and (iii) providing the framework with a distributed runtime engine that supports parallel distributed model checking.

We have integrated DeMeter into two existing model checkers, MaceMC and MoDist, each involving changes of around 1,000 lines of code. Compared to the original MaceMC and MoDist model checkers, our experiments have shown state-space reduction from a factor of five to up to five orders of magnitude in representative distributed applications such as Paxos, Berkeley DB, Chord, and Pastry. As a result, when applied to a deployed Paxos implementation, which has been running in production data centers for years to manage tens of thousands of machines, DeMeter manages to explore completely a logically meaningful state space that covers both phases of the Paxos protocol, offering higher assurance of software reliability that was not possible before.

MoDist is the first model checker designed for transparently checking unmodified distributed systems running on unmodified operating systems. It achieves this transparency via a novel architecture: a thin interposition layer exposes all actions in a distributed system and a centralized, OS-independent model checking engine explores these actions systematically. We made MoDist practical through three techniques: an execution engine to simulate consistent, deterministic executions and failures; a virtual clock mechanism to avoid false positives and false negatives; and a state exploration framework to incorporate heuristics for efficient error detection.

We implemented MoDist on Windows and applied it to three well-tested distributed systems: Berkeley DB, a widely used open source database; MPS, a deployed Paxos implementation; and PacificA, a primary-backup replication protocol implementation. MoDist found 35 bugs in total. Most importantly, it found protocol-level bugs (i.e., flaws in the core distributed protocols) in every system checked: 10 in total, including 2 in Berkeley DB, 2 in MPS, and 6 in PacificA.

Storage systems such as file systems, databases, and RAID systems have a simple, basic contract: you give them data, they do not lose or corrupt it. Often they store the only copy, making its irrevocable loss almost arbitrarily bad. Unfortunately, their code is exceptionally hard to get right, since it must correctly recover from any crash at any program point, no matter how their state was smeared across volatile and persistent memory.

This paper describes eXplode , a system that makes it easy to systematically check real storage systems for errors. It takes user-written, potentially system-specific checkers and uses them to drive a storage system into tricky corner cases, including crash recovery errors. eXplode uses a novel adaptation of ideas from model checking, a comprehensive, heavy-weight formal verification technique, that makes its checking more systematic (and hopefully more effective) than a pure testing approach while being just as lightweight.

eXplode is effective. It found serious bugs in a broad range of real storage systems (without requiring source code): three version control systems, Berkeley DB, an NFS implementation, ten file systems, a RAID system, and the popular VMware GSX virtual machine. We found bugs in every system we checked, 36 bugs in total, typically with little effort.

This paper shows how to use model checking to find serious errors in file systems. Model checking is a formal verification technique tuned for finding corner-case errors by comprehensively exploring the state spaces defined by a system. File systems have two dynamics that make them attractive for such an approach. First, their errors are some of the most serious, since they can destroy persistent data and lead to unrecoverable corruption. Second, traditional testing needs an impractical, exponential number of test cases to check that the system will recover if it crashes at any point during execution. Model checking employs a variety of state-reducing techniques that allow it to explore such vast state spaces efficiently.

We built a system, FiSC, for model checking file systems. We applied it to four widely-used, heavily-tested file systems: ext3, JFS, ReiserFS and XFS. We found serious bugs in all of them, 33 in total. Most have led to patches within a day of diagnosis. For each file system, FiSC found demonstrable events leading to the unrecoverable destruction of metadata and entire directories, including the file system root directory “/”.

File systems, RAID systems, and applications that care about data consistency, among others, assure data integrity by carefully forcing valuable data to stable storage. Unfortunately, verifying that a system recovers from a crash to a valid state at any program counter is very difficult. Previous techniques for finding data integrity bugs have been heavyweight, requiring extensive effort for each OS and file system to be checked. We demonstrate a lightweight, flexible, easy-to-apply technique by developing a tool called eXplode and show how we used it to find 25 serious bugs in eight Linux file systems, Linux software RAID 5, Linux NFS, and three version control systems.

This paper shows how to use model checking to find serious errors in file systems. Model checking is a formal verification technique tuned for finding corner-case errors by comprehensively exploring the state spaces defined by a system. File systems have two dynamics that make them attractive for such an approach. First, their errors are some of the most serious, since they can destroy persistent data and lead to unrecoverable corruption. Second, traditional testing needs an impractical, exponential number of test cases to check that the system will recover if it crashes at any point during execution. Model checking employs a variety of state-reducing techniques that allow it to explore such vast state spaces efficiently.

We built a system, FiSC, for model checking file systems. We applied it to three widely-used, heavily-tested file systems: ext3, JFS, and ReiserFS. We found serious bugs in all of them, 32 in total. Most have led to patches within a day of diagnosis. For each file system, FiSC found demonstrable events leading to the unrecoverable destruction of metadata and entire directories, including the file system root directory “/”.

Download eXplode

The above are three ways to get eXplode. You can download the eXplode source code locally, or download a virtual machine image with eXplode and eXplode-patched kernels compiled, or get the source code from sourceforge. You can browse the version history of eXplode at sourceforge.

For instructions to compile, build, and use eXplode, Please refer to the README file in the top-level directory after you uncompress the explode tar ball.

The eXplode distribution contains a generic model checker that can be applied to other systems. For details, please refer to the README file under directory mcl.
Real systems are difficult to get right because they must correctly handle a practically infinite number of rare events. For example, file systems must correctly recover from all possible crashes; distributed systems must ensure consistency and liveness despite of a large variety of rare events, such as machine crashes, network partition, message delays, and message loss. The complexity to handle these rare events often leads to corner-case errors that are difficult to test, and once detected in the field, impossible to reproduce.