Crash-Only Software

Abstract

Crash-only programs crash safely and recover quickly. There is
only one way to stop such software -- by crashing it -- and only
one way to bring it up -- by initiating recovery. Crash-only
systems are built from crash-only components, and the use of
transparent component-level retries hides intra-system component
crashes from end users. In this paper we advocate a crash-only
design for Internet systems, showing that it can lead to more
reliable, predictable code and faster, more effective recovery. We
present ideas on how to build such crash-only Internet services,
taking successful techniques to their logical extreme.

1. Occam's Razor and the Restart Potpourri

There are many reasons to restart software, and many ways to do it.
Studies have shown that a main source of downtime in large scale
software systems is caused by intermittent or transient bugs [12,20,19,1]. Most non-embedded systems have a
variety of ways to stop; for example, an operating system can shut
down cleanly, panic, hang, crash, lose power, etc.

When shutting down programs cleanly, unavailability consists of the
time to shut down and the time to come back up; when crash-rebooting,
unavailability consists only of the time to recover. Ironically,
shutting down and reinitializing can sometimes take longer than
recovering from a crash. Table 1
illustrates a casual comparison of reboot times; no important data was
lost in either of the experiments.

In earlier work, we used recursive micro-reboots to improve the
availability of a soft-state system that was trivially
crash-safe [3]. In this paper we
advocate a crash-only design (i.e., crash safety + fast
recovery) for Internet systems, a class distinguished by the following
properties: large scale, stringent high availability requirements,
built from many heterogenous components, accessed over standard
request-reply protocols such as HTTP, serving workloads that consist
of large numbers of relatively short tasks that frame state updates,
and subjected to rapid and perpetual evolution. We restrict our
attention to single installations that reside inside one data center
and do not span administrative domains.

In high level terms, a crash-only system is defined by the equations
stop=crash and start=recover. In the rest of the paper,
we describe the benefits of the crash-only design approach by analogy
to physics, describe the internal properties of components in a
crash-only system, the architectural properties governing the
interaction of components, and a restart/retry architecture that
exploits crash-only design, including our work to date on a prototype
using J2EE.

2. Why Crash-Only Design ?

Mature engineering disciplines rely on macroscopic descriptive
physical laws to build and understand the behavior of physical
systems. These sets of laws, such as Newtonian mechanics, capture in
simple form an observed physical invariant. Software, however, is an
abstraction with no physical embodiment, so it obeys no physical laws.
Computer scientists have tried to use prescriptive rules, such
as formal models and invariant proofs, to reason about software.
These rules, however, are often formulated relative to an abstract
model of the software that does not completely describe the behavior
of the running system (which includes hardware, an operating system,
runtime libraries, etc.). As a result, the prescriptive models do not
provide a complete description of how the implementation behaves in
practice, because many physically possible states of the complete
system do not correspond to any state in the abstract model.

With the crash-only property, we are trying to impose, from outside
the software system, macroscopic behavior that coerces the system into
a simpler, more predictable universe with fewer states and simpler
invariants. Each crash-only component has a single idempotent
``power-off switch'' and a single idempotent ``power-on switch''; the
switches for larger systems are built by wiring together their
subsystems' switches in ways described by section 3. A component's power-off switch
implementation is entirely external to the component, thus not
invoking any of the component's code and not relying on correct
internal behavior of the component. Examples of such switches include
kill -9 sent to a UNIX process, or turning off the
physical, or virtual, machine that is running some software inside it.

Keeping the power-off switch mechanism external to components makes
it a high confidence ``component crasher.'' Consequently, every
component in the system must be prepared to suddenly be deactivated.
Power-off and power-on switches provide a very small repertoire of
high-confidence, simple behaviors, leading to a small state space. Of
course, the ``virtual shutdown'' of a virtual machine, even if invoked
with kill -9, has a much larger state space than the
physical power switch on the workstation, but it is still vastly
simpler than the state space of a typical program hosted in the VM,
and it does not vary for different hosted programs. Indeed, the fact
that virtual machines are relatively small and simple compared to the
programs they host has been successfully invoked as an argument for
using VMs for inter-application isolation [26].

Recovery code deals with exceptional situations, and must run
flawlessly. Unfortunately, exceptional situations are difficult to
handle, occur seldom, and are not trivial to simulate during
development; this often leads to unreliable recovery code. In
crash-only systems, however, recovery code is exercised every time the
system starts up, which should ultimately improve its reliability.
This is particularly relevant given that the rate at which we reduce
the number of bugs per Klines of code lags behind the rate at which
the number of Klines per system increases, with the net result being
that the number of bugs in an evolving system increases with
time [7]. More bugs mean more
failures, and systems that fail more often will need to recover more
often.

Many of the benefits resulting from a crash-only design have been
previously obtained in the data storage/retrieval world with the
introduction of transactions. Our approach aims for a similar effect
on the failure properties of Internet systems---crash-only design is
in many ways a generalization of the transaction model. It is
important to note that Internet applications do not have to use
transactions in order to be crash-only; in fact, ACID semantics can
sometimes lead to overkill. For example, session data accumulates
information at the server over a series of user service requests, for
use in subsequent operations. It is mostly
single-reader/single-writer, thus not requiring ordering and
concurrency control. The richness of a query language like SQL is
unnecessary, and session state usually does not persist beyond a few
minutes. These observations are leveraged by SSM [18], a crash-only hashtable-like session state
store.

A crash-only system makes it affordable to transform every detected
failure into component-level crashes; this leads to a simple fault
model, and components only need to know how to recover from one type
of failure. For example, [21] forced
all unknown faults into node crashes, allowing the authors to improve
the availability of a clusterized web server. Existing literature
often assumes unrealistic fault models (e.g., that failures occur
according to well-behaved tractable distributions); a crash-only
design enables aggressive enforcement of such desirable fault models,
thus increasing the impact of prior work. If we state invariants
about the system's failure behavior and make such behavior
predictable, we are effectively coercing reality into a small universe
governed by well-understood laws.

Moreover, a system in which crash-recovery is cheap allows us to
micro-reboot suspect components before they fail. By aggressively
employing software rejuvenation [16] --
rebooting in order to stave off failure induced by resource exhaustion
-- we can prevent failures altogether. The system can immediately
trigger component-level rejuvenation whenever it notices fail-stutter
behavior [2], a trough in offered
workload, or based on predictive mathematical models of software
aging [10].

Finally, if we admit that most failures can be recovered by
micro-rebooting, crashing every suspicious component could shorten the
fault detection and diagnosis time---a period that sometimes lasts
longer than repair itself.

In section4 we will show how
crash-only components can be glued together into a robust Internet
system based on a restart/retry architecture; in the rest of this
section we describe in more detail the properties of crash-only
systems. Some relate to intra-component state management, while
others relate to inter-component interactions. We recognize that some
of these sacrifice performance, but we strongly believe the time has
come for robustness to reclaim its status as a first-class citizen.

3.1 Intra-Component Properties

In today's Internet applications there are a small number of types of
state: transactional persistent state, single-reader/single-writer
persistent state (e.g., user profiles, that almost never see
concurrent access), expendable persistent state (server-side
information that could be sacrificed for the sake of correctness or
performance, such as clickstream data and access logs), session state
(e.g., the result set of a previous search, subject to refinement),
soft state (state that can be reconstructed at any time based on other
data sources), and volatile state. While differentiated mostly by
guaranteed lifetime, the requirements for these categories of state
lead to qualitatively different implementations.

All important non-volatile state is managed by dedicated
state stores, leaving applications with just program
logic. Specialized state stores (e.g., relational and object-oriented
databases, file system appliances, distributed data
structures [14], non-transactional
hashtables [15], session state
stores [18], etc.) are much better
suited to manage state than code written by developers with minimal
training in systems programming. Applications become stateless
clients of the state stores, which allows them to have simpler and
faster recovery routines. A popular example of such separation can be
found in three-tier Internet architectures, where the middle tier is
largely stateless and relies on backend databases to store data.

These state stores must also be crash-only, otherwise the problem
has just moved down one level. Many commercial off-the-shelf state
stores available today are crash-safe (i.e., they can be crashed
without loss of data), such as databases and the various
network-attached storage devices, but most are not crash-only, because
they recover slowly. Many products, however, offer tuning knobs that
permit the administrator to trade performance for improved recovery
time, such as making checkpoints more often in the Oracle
DBMS [17]. An example of a pure
crash-only state store is the Postgres database system [25], which uses non-overwriting storage and
maintains all data in a single append-only log. Although it trades
away some performance, Postgres achieves practically instantaneous
recovery, because it only needs to mark the uncommitted transactions
as aborted.

The abstractions and guarantees provided by state stores must be
congruent with application requirements. This means that the state
abstraction exported by the state store is not too powerful (e.g.,
offering a SQL interface with ACID semantics for storing and
retrieving simple key-value tuples) and not too weak. A state
abstraction that is too weak will require client components to do some
amount of state management, such as implementing a customer record
abstraction over an offered file system interface. Good state
abstractions allow applications to operate at their ``natural''
semantic level. Offering the weakest state guarantees that satisfy
the application allows us to exploit application semantics and build
simpler, faster, more reliable state stores.

For example, Berkeley DB [22] is a
storage system supporting B+tree, hash, and record abstractions. It
can be accessed through four different interfaces, ranging from no
concurrency control/no transactions/no disaster recovery to a
multi-user, transactional API with logging, fine-grained locking, and
support for data replication. Applications can use the abstraction
that is right for their purposes and the underlying state store
optimizes its operation to fit those requirements. Workload
characteristics can also be leveraged by state stores; e.g., expecting
a read-mostly workload allows a state store to utilize write-through
caching, which can significantly improve recovery time and
performance.

We do not advocate that every application have its own set of state
stores. Instead, we believe Internet systems will standardize on a
small number of state store types: ACID stores (e.g., databases for
customer and transaction data), non-transactional persistent stores
(e.g., DeStor [15], a crash-only system
specialized in handling non-transactional persistent data, like user
profiles), session state stores (e.g., SSM [18] for shopping carts), simple read-only
stores (e.g., file system appliances for static HTML and images), and
soft state stores (e.g., web caches). If we think carefully about the
state abstractions required by each application component and use
suitable state stores, we can make these components crash-only.

3.2 Inter-Component Properties

Subsystems that crash on their own, or that are explicitly
crash-rebooted for recovery, will temporarily become unavailable to
serve requests. For a crash-only system to gracefully tolerate such
behavior, we need to decouple components from each other, from the
resources they use, and from the requests they process.

Components have externally enforced boundaries
that provide strong fault containment. The desired isolation can be
achieved with virtual machines, isolation kernels [26], task-based intra-JVM isolation [24,8], OS processes, etc. Indeed, the Denali
isolation kernel is designed for such encapsulation. Web hosting
service providers often use multiple virtual machines on one physical
machine to offer their clients individual web servers they can
administer at will, without affecting other customers. The boundaries
between components delineate distinct, individually recoverable stages
in the processing of requests.

All interactions between components have a
timeout. This includes explicit communication as well as
RPC: if no response is received to a call within the allotted
timeframe, the caller assumes the callee has failed and reports it to
a recovery agent [5], which can
crash-restart the callee if appropriate. Crash-restarting helps
ensure the called component is in a known state; this is safe because
the component is crash-safe and crash-restart is idempotent. Timeouts
provide an orthogonal mechanism for turning all non-Byzantine
failures, both at the component level and at the network level, into
fail-stop events (i.e., the failed entity either provides results or
is stopped), even though the components are not necessarily fail-stop.
Such behavior is easier to accomodate, and containment of faults is
improved.

Crash-only components recover quickly, thus recovery is very cheap.
Under such circumstances, it becomes acceptable for the recovery
manager to crash-recover suspect components even when it lacks the
certainty that those components have indeed failed; the downtime risk
of letting the components run may be higher than crash-rebooting
healthy components.

All resources are leased, rather than
permanently allocated, to ensure that resources are not coupled to the
components using them. Resources include many types of persistent
state, such as account profiles for a free e-mail provider: every time
the user logs in, a 6-month lease is renewed; when the lease expires,
all associated data can be purged from the system. It also includes
CPU resources: if a computation is unable to renew its execution
lease, it is terminated by a high confidence watchdog [9]. For example, in PHP, a server-side
scripting language used for writing dynamic web pages, runaway scripts
are killed and an error is returned to the web browser.
Leases [11] give us the ability to
reason about the conditions that hold true of the system's resources
after a lease expires. Infinite timeouts or leases are not
acceptable; the maximum-allowed timeout and lease are specified in an
application-global policy. This way it is less likely that the system
will hang or become blocked.

Requests are entirely self-describing, by
making the state and context needed for their processing explicit.
This allows a fresh instance of a rebooted component to pick up a
request and continue from where the previous instance left off.
Requests also carry information on whether they are idempotent, along
with a time-to-live; both idempotency and TTL information can
initially be set at the system boundary, such as in the web tier. For
example, the TTL may be determined by load or service level
agreements, and idempotency flags can be based on application-specific
information (which can be derived, for instance, from URL substrings
that determine the type of request). Many interesting operations in
an Internet service are idempotent, or can easily be made idempotent
by keeping track of sequence numbers or by wrapping requests in
transactions; some large Internet services have already found it
practical to do so [23]. Over the
course of its lifetime, a request will split into multiple
sub-operations, which may rejoin, in much the same way nested
transactions do. Recovering from a failed idempotent sub-operation
entails simply reissuing it; for non-idempotent operations, the system
can either roll them back, apply compensating operations, or tolerate
the inconsistency resulting from a retry. Such transparent recovery
of the request stream can hide intra-system component failures from
the end user.

In Figure 1 we show a simple
restart/retry example, in which a request to view a shopping cart
splits inside the application server into one subrequest to a stateful
session EJB (Enterprise JavaBean) that communicates with a session
state store and another subrequest to a stateless session EJB. Should
the state store become unavailable, the application server either
receives a
RetryAfter exception or times out, at which time it can decide
whether to resubmit the request to SSM or not. Within each of the
subsystems shown in Figure 1, we can
imagine each subrequest further splitting into finer grain subrequests
submitted to the respective subsystems' components. We have
implemented crash-restarting of EJBs in a J2EE application server; an
EJB-level micro-reboot takes less than a second [5].

Timeout-based failure detection is supplemented with traditional
heartbeats and progress counters. The counters---compact
representations of a component's processing progress---are usually
placed at state stores and in messaging facilities, where they can map
state access and messaging activity into per-component progress. Many
existing performance monitors can be transformed into progress
monitors by augmenting them with request origin information.
Components themselves can also implement progress counters that more
accurately reflect application semantics, but they are less
trustworthy, because they are inside the components.

The dynamics of loosely coupled systems can sometimes be
surprising. For example, resubmitting requests to a component that is
recovering can overload it and make it fail again; for this reason,
the RetryAfter exceptions provide an estimated time-to-recover.
This estimated value can be used to spread out request resubmissions,
by varying the reported time-to-recover estimate across different
requestors. A maximum limit on the number of retries is specified in
the application-global policy, along with the lease durations and
communication timeouts. These numbers can be dynamically estimated
based on historical information collected by a recovery
manager [5], or simply captured in a
static description of each component, similar to deployment
descriptors for EJBs. In the absence of such hints, a simple load
balancing algorithm or exponential backoff can be used.

To prevent reboot cycles and other unstable conditions during
recovery, it is possible to quiesce the system when a set of
components is being crash-rebooted. This can be done at the
communication/RPC layer, or for the system as a whole. In our
prototype, we use a stall proxy [5] in
front of the web tier to keep new requests from entering the system
during the recovery process. Since Internet workloads are typically
made of short running requests, the stall proxy transforms brief
system unavailability into temporarily higher latency for clients. We
are exploring modifications to the Java RMI layer that would allow
finer grain request stalling.

5. Discussion

Building crash-only systems is not easy; the key to widespread
adoption of our approach will require employing the right
architectural models and having the right tools. With the recent
success of component-based architectures (e.g., J2EE and .Net), and
the emergence of the application server as an operating system for
Internet applications, it is possible to provide many of the
crash-only properties in the platform itself. This would allow all
applications running on that platform to take advantage of the effort
and become crash-only.

We are applying the principles described here to an open-source Java 2
Enterprise Edition (J2EE) application server. We are separating the
individual J2EE services (naming, directory lookup, messaging, etc.)
into well-isolated components, implementing requests as
self-describing continuations, modifying the RMI layer to allow for
timeout-based operation, modifying the EJB containers to implement
lease-based resource allocation, and integrating non-transactional
state stores like DeStor and SSM. A first step in this direction is
described in [5].

We are focusing initially on applications whose workloads can be
characterized as relatively short-running tasks that frame state
updates. Substantially all Internet services fit this description, in
part because the nature of HTTP has forced designers into this mold.
As enterprise services and applications (e.g., workflow, customer
management) become web-enabled, they adopt similar architectures. We
expect there are many applications outside this domain that could not
easily be cast this way, and for which deriving a crash-only design
would be impractical or infeasible.

In order for the restart/retry architecture to be highly available
and correct, most requests it serves must be idempotent. This
requirement might be inappropriate for some applications. Our
proposal does not handle Byzantine failures or data errors, but such
behavior can be turned into fail-stop behavior using well-known
orthogonal mechanisms, such as triple modular redundancy [13] or clever state replication [6].

In today's Internet systems, fast recovery is obtained by
overprovisioning and counting on rapid failure detection to trigger
failover. Such failover can sometimes successfully mask hours-long
recovery times, but often detecting failures end-to-end takes longer
than expected. Crash-only software is complementary to this approach
and can help alleviate some of the complex and expensive management
requirements for highly redundant hardware, because faster recovering
software means less redundancy is required. In addition, a crash-only
system can reintegrate recovered components faster, as well as better
accommodate removed, added, or upgraded components.

We expect throughput to suffer in crash-only systems, but this concern
is secondary to the high availability and predictability we expect in
exchange. The first program written in a high-level language was
certainly slower than its hand-coded assembly counterpart, yet it set
the stage for software of a scale, functionality and robustness that
had previously been unthinkable. These benefits drove compiler
writers to significantly optimize the performance of programs written
in high-level languages, making it hard to imagine today how we could
program otherwise. We expect the benefits of crash-only software to
similarly drive efforts that will erase, over time, the potential
performance loss of such designs.

6. Conclusion

By using a crash-only approach to building software, we expect to
obtain better reliability and higher availability in Internet systems.
Application fault models can be simplified through the application of
externally-enforced ``crash-only laws,'' thus encouraging simpler
recovery routines which have higher chances of being correct. Writing
crash-only components may be harder, but their simple failure behavior
can make the assembly of such components into large systems easier.

The promise of a simple fault model makes stating invariants on
failure behavior possible. A system whose component-level and
system-level invariants can be enforced through crash-rebooting is
more predictable, making recovery management more robust. It is our
belief that applications and services with high availability
requirements can significantly benefit from these properties.

Once we surround a crash-only system with a suitable recovery
infrastructure, we obtain a recursively restartable system [4]. Transparent recovery based on
component-level micro-reboots enables restart/retry architectures to
hide intra-system failure from the end users, thus improving the
perceived reliability of the service. We find it encouraging that our
initial prototype [5] was able to
complete 78% more client requests under faultload than a
non-crash-only version of the system that did not employ micro-reboots
for recovery.