Professor

Department of Computer Science

Current Projects

In my research projects we don't just build prototype systems
and write papers; we build production-quality software systems
that are used by other people for their daily work. In most
cases we release the software in open-source form. This approach
is unusual in academia, but it allows us to do much better
research: designing for production use forces us to think about
important issues that could be ignored otherwise, and measurements
of usage allow us to evaluate our ideas more thoroughly.
Furthermore, this approach allows students to develop
a higher level of system-building maturity than would be
possible otherwise.

RAMCloud (2009- )

The RAMCloud project is creating a new class of storage, based
entirely in DRAM,
that is 2-3 orders of magnitude faster than existing storage systems.
If successful, it will enable new applications that manipulate
large-scale datasets much more intensively than has ever been possible
before. In
addition, we think RAMCloud, or something like it, will become
the primary storage system for cloud computing environments such as
Amazon's AWS and Microsoft's Azure.

The role of DRAM in storage systems has been increasing rapidly
in recent years, driven by the needs of large-scale Web applications.
These applications manipulate very large datasets with an intensity
that cannot be satisfied by disks alone. As a result, applications
are keeping more and more of their data in DRAM. For example,
large-scale caching systems such as memcached are being widely
used (in 2009 Facebook used a total of 150 TB of DRAM in memcached
and other caches for a database containing 200 TB of disk storage), and
the major Web search engines now keep their search indexes
entirely in DRAM.

Although DRAM's role is increasing, it still tends to be used in
limited or specialized ways. In most cases DRAM is just a cache
for some other storage system such as a database; in other cases
(such as search indexes) DRAM is managed in an application-specific
fashion. It is difficult for developers to use DRAM
effectively in their applications; for example, the application
must manage consistency between caches and the backing storage.
In addition, cache misses and backing store overheads make it
difficult to capture DRAM's full performance potential.

Our goal for RAMCloud is to create a general-purpose storage system
that makes it easy for developers to harness the full performance
potential of large-scale DRAM storage. It keeps all data in DRAM
all the time, so there are no cache misses. RAMCloud
storage is durable and available, so developers need not manage a
separate backing store. RAMCloud is designed to scale
to thousands of servers and hundreds of terabytes of data
while providing uniform low-latency access to all machines within
a large datacenter.

In January 2014 we officially tagged RAMCloud version 1.0, which
means the system has reached a point of maturity where it can
support real applications (in particular the crash recovery
mechanisms now work reliably both for crashes of storage masters
and for crashes of the cluster coordinator). On our 80-node test
cluster we are able to perform
remote reads of 100-byte objects in about 5 microseconds and writes
in about 16 microseconds. An individual server can process about
700,000 small read requests per second.

The RAMCloud project is still young, so there are many interesting
research issues still to explore, such as the following:

Data model. RAMCloud currently supports a very simple data model
(key-value store); we would like to see if we can provide higher-level
features such as secondary indexes and multi-object transactions while
without sacrificing the scalability or performance of the system.

Consistency: we believe that RAMCloud can provide strong consistency
(serializability) without sacrificing performance, but there are several
interesting problems to solve in order to achieve that.

Cluster management: what are the right mechanisms and policies for
reorganizing RAMCloud data in response to changes in the amount of data
and the access patterns?

Network protocols: we don't think that TCP is the right protocol to provide
highest performance within a datacenter, so there is an interesting research
project to investigate what is the ideal protocol.

Multi-tenancy: how to support multiple independent, perhaps hostile,
applications sharing the same RAMCloud storage system within a large
datacenter? This introduces issues related to access control and also
potentially issues of performance isolation.

Multiple datacenters: our current design for RAMCloud focuses on a
single datacenter, but some applications will require redundancy across
datacenters in order to protect against datacenter failures. An interesting
question is whether we can provide that level of redundancy without
dramatically impacting the performance of the system.

Related links:

Log-Structured
Memory: describes how RAMCloud manages in-memory storage using
an approach similar to that of log-structured filesystems. This
allows RAMCloud to use memory at 80-90% utilization without major
performance degradation. This paper appeared in the USENIX FAST
Conference in February, 2014.

Fast Recovery in
RAMCloud: describes RAMCloud's mechanism for recovering crashed
servers in 1-2 seconds. This paper appeared in the ACM Symposium on
Operating Systems Principles in October, 2011.

The Case for RAMCloud:
a position paper that discusses the motivation for RAMCloud,
the new kinds of applications it may enable,
and some of the research issues that will have to be addressed
to create a working system. This paper appeared in
Communications of the ACM in July 2011.

The RAMCloud Wiki: used by project members to share design
documents, miscellaneous notes, and links to related materials.

Previous Projects

The following sections describe some projects on which
I have worked in the past.
I am no longer involved with these projects, though
several of them are still active as open-source projects
or commercial products. The years listed for each project represent
the period when I was involved.

Fiz (2008-2011)

The Fiz project explored new frameworks for highly interactive Web
applications. The goal of the project was to develop higher-level
reusable components in order to encourage reusability and simplify
application development by hiding inside the components many of the
complexities that bedevil Web developers (such as security
issues or using Ajax to enhance interactivity).

Related links:

Integrating Long Polling with an
MVC Web Framework. This paper appeared in the 2011 USENIX
Conference on Web Application Development; it shows how the
architecture of an MVC Web development framework can be extended
to make it easy for Web applications to push updates automatically
from servers out of browsers.

Managing State for
Ajax-Driven Web Components. This paper appears in
the 2010 USENIX Conference on Web Application Development;
it discusses the state-management issues that arise when
trying to use a component-based approach for Web applications
using Ajax, and evaluates two alternative solutions to the problem.

ElectricCommander (2005-2007, Electric Cloud)

ElectricCommander is the second major product for Electric Cloud.
It addresses the problem of managing software development
processes such as nightly builds and automated test suites. Most
organizations have home-grown software for these tasks, which
is hard to manage and scales poorly as the organization grows.
ElectricCommander provides a general-purpose Web-based platform
for managing these processes. Developers use a Web interface
to describe each job as a collection of shell commands that
run serially or in parallel on one or more machines.
The ElectricCommander server manages the execution of these
commands according to schedules defined by the developers.
It also provides a variety of reporting tools and manages a
collection of server machines to allow concurrent execution of
multiple jobs.

ElectricAccelerator (2002-2005, Electric Cloud)

ElectricAccelerator is Electric Cloud's first product; it
accelerates software builds based on the make
program by running steps concurrently on a cluster of
server machines. Previous attempts at concurrent builds
have had limited success because most Makefiles
do not have adequate dependency information. As a result,
the build tool cannot safely identify steps that can execute
concurrently and attempts to use large-scale
parallelism produce flaky or broken builds. ElectricAccelerator
solves this problem using kernel-level file system drivers to
record file accesses during the build. From this information it
can construct a perfect view of dependencies, even if the Makefiles
are incorrect. As a result, ElectricAccelerator can safely utilize
dozens of machines in a single build, producing speedups of 10-20x.

Tcl/Tk (1988-2000, U.C. Berkeley, Sun, Scriptics)

Tcl is an embeddable scripting language: it is a simple
interpreted language implemented
as a library package that can be incorporated into a variety of
applications. Furthermore, Tcl is extensible, so additional functions
(such as those provided by the enclosing application) can easily be
added to those already provided by the Tcl interpreter. Tk is a GUI
toolkit that is implemented as a Tcl extension. I initially built
Tcl and Tk as hobby projects and didn't think that anyone besides me would
care about them. However, they were widely adopted because they solved
two problems.
First, the combination of Tcl and Tk made it much easier to create
graphical user interfaces than previous frameworks such as Motif
(and, Tcl and Tk were eventually ported from Unix to the Macintosh
and Windows, making Tcl/Tk applications highly portable). Second,
Tcl made it easy to include powerful command languages in a variety
of applications ranging from design tools to embedded systems.

Log-Structured File Systems (1988-1994, U.C. Berkeley)

A log-structured file system (LFS) is one where all information is
written to disk sequentially in a log-like structure, thereby
speeding up both file writing and crash recovery. The log is the
only structure on disk; it contains index information so files can
be read back from the log efficiently. LFS was initially motivated
by the RAID project, because random writes are expensive in RAID
disk arrays. However, the LFS approach has also found use in
other settings, such as flash memory where wear-leveling is an
important problem. Mendel Rosenblum created the first LFS implementation
as part of the Sprite project; Ken Shirriff added RAID support to
LFS in the Sawmill project, and John Hartman extended the LFS
ideas into the world of cluster file systems with Zebra.

Sprite (1984-1994, U.C. Berkeley)

Sprite was a Unix-like network operating system built at U.C. Berkeley
and used for day-to-day computing by about 100 students and staff for
more than five years. The overall goal of the project was to create a
single system image, meaning that a collection of workstations
would appear to users the same as a single time-shared system, except
with much better performance. Among its more notable features were
support for diskless workstations, large client-level file caches
with guaranteed consistency, and a transparent process migration
mechanism. We built a parallel version of make
called pmake that took advantage of process migration
to run builds across multiple machines, providing 3-5x speedups
(pmake was the inspiration for the ElectricAccelerator
product at Electric Cloud).

VLSI Design Tools (1980-1986, U.C. Berkeley)

When I arrived at Berkeley in 1980 the Mead-Conway VLSI revolution
was in full bloom, enabling university researchers to create large-scale
integrated circuits such as the Berkeley RISC chips. However, there
were few tools for chip designers to use. In this project my students
and I created a series of design tools, starting with a simple layout
editor called Caesar. We quickly replaced Caesar with a more powerful layout editor called Magic. Magic made two contributions: first, it
implemented a number of novel interactive features for developers, such as
incremental design-rule checking, which notified
developers of design rule violations immediately while they edited,
rather than waiting for a slow offline batch run. Second, Magic
incorporated algorithms for operating on hierarchical layout structures,
which were much more efficient than previous approaches that
"flattened" the layout into a non-hierarchical form. Magic took advantage
of a new data structure called corner stitching, which permitted
efficient implementation of a variety of geometric operations. We
also created a switch-level timing analysis program called Crystal.
We made free source-level releases of these tools and many others
in what became known as the Berkeley VLSI Tools Distributions; these
represented one of the earliest open-source software releases.
Magic continued to evolve after the end of the Berkeley project;
as of 2005 it was still widely used.

Cm* and Medusa (1975-1980, Carnegie Mellon University)

As a graduate student at Carnegie Mellon University I worked on the
Cm* project, which created the first large-scale NUMA
multiprocessor. Cm* consisted of 50 LSI-ll processors connected
by a network of special-purpose processors called Kmaps that
provided shared virtual memory. I worked initially on the design
and implementation of the Kmap hardware, then shifted to the software
side and led the Medusa project, which was one of two operating systems
created for Cm*. The goal of the Medusa project was to understand how to
structure an operating system to run on the NUMA hardware; among
other things, Medusa was the first system to implement coscheduling
(now called "gang scheduling").