About Me

My name is Marc Brooker. I've been writing code, reading code, and living vicariously through computers for as long as I can remember. I like to build things that work. I also dabble in brewing, cooking and skiing.

I'm currently an engineer at Amazon Web Services (AWS) in Seattle, where I lead engineering on AWS Lambda and our others serverless products. Before that, I worked on EC2 and EBS.
All opinions are my own.

Other Media

Some risks of coordinating only sometimes

Sometimes-coordinating systems have dangerous emergent behaviors

A classic cloud architecture is built of small clusters of nodes (typically one to nine1), with coordination used inside
each cluster to provide availability, durability and integrity in the face of node failures. Coordination between
clusters is avoided, making it easier to scale the system while meeting tight availability and latency requirements. In
reality, however, systems sometimes do need to coordinate between clusters, or clusters need to coordinate with a
central controller. Some of these circumstances are operational, such as around adding or removing capacity. Others are
triggered by the application, where the need to present a client API which appears consistent requires either the system itself, or a layer above it, to coordinate across otherwise-uncoordinated clusters.

The costs and risks of re-introducing coordination to handle API requests or provide strong client guarantees are well
explored in the literature. Unfortunately, other aspects of sometimes-coordinated systems do not get as much attention,
and many designs are not robust in cases where coordination is required for large-scale operations. Results like CAP and CALM2 provide clear tools for thinking through when coordination must occur, but offer little help in understanding the dynamic behavior of the system when it does occur.

One example of this problem is reacting to correlated failures. At scale, uncorrelated node failures happen all the
time. Designing to handle them is straightforward, as the code and design is continuously validated in production.
Large-scale correlated failures also happen, triggered by power and network failures, offered load, software bugs,
operator mistakes, and all manner of unlikely events. If systems are designed to coordinate during failure handling,
either as a mesh or by falling back to a controller, these correlated failures bring sudden bursts of coordination and
traffic. These correlated failures are rare, so the way the system reacts to them is typically untested at the scale at
which it is currently operating when they do happen. This increases time-to-recovery, and sometimes requires that
drastic action is taken to recover the system. Overloaded controllers, suddenly called upon to operate at thousands of
times their usual traffic, are a common cause of long time-to-recovery outages in large-scale cloud systems.

A related issue is the work that each individual cluster needs to perform during recovery or even scale-up. In practice,
it is difficult to ensure that real-world systems have both the capacity required to run, and spare capacity for
recovery. As soon as a system can’t do both kinds of work, it runs the risk of entering a mode where it is too
overloaded to scale up. The causes of failure here are both technical (load measurement is difficult, especially in
systems with rich APIs), economic (failure headroom is used very seldom, making it an attractive target to be optimized
away), and social (people tend to be poor at planning for relatively rare events).

Another risk of sometimes-coordination is changing quality of results. It’s well known how difficult it is to program
against APIs which offer inconsistent consistency, but this problem goes beyond just API behavior. A common design for
distributed workload schedulers and placement systems is to avoid coordination on the scheduling path (which may be
latency and performance critical), and instead distribute or discover stale information about the overall state of the
system. In steady state, when staleness is approximately constant, the output of these systems is predictable. During
failures, however, staleness may increase substantially, leading the system to making worse choices. This may increase
churn and stress on capacity, further altering the workload characteristics and pushing the system outside its comfort
zone.

The underlying cause of each of these issues is that the worst-case behavior of these systems may diverge significantly
from their average-case behavior, and that many of these systems are bistable with a stable state in normal operation,
and a stable state at “overloaded”. Within AWS, we are starting to settle on some patterns that help constrain the
behavior of systems in the worst case. One approach is to design systems that do a constant amount of coordination,
independent of the offered workload or environmental factors. This is expensive, with the constant work frequently going to waste, but worth it for resilience. Another emerging approach is designing explicitly for blast radius, strongly limiting the ability of systems to coordinate or communicate beyond some limited radius. We also design for static stability, the ability for systems to continue to operate as best they can when they aren’t able to coordinate.

More work is needed in this space, both in understanding how to build systems which strongly avoid congestive collapse
during all kinds of failures, and in building tools to characterize and test the behavior of real-world systems.
Distributed systems and control theory are natural partners.

Footnotes:

Cluster sizing is a super interesting topic in it's own right. Nine seems arbitrary here, but isn't: for the most durable consensus systems, because when spread across three datacenters allows one datacenter failure (losing 3) and one host failure while still having a healthy majority. Chain replicated and erasure coded systems will obviously choose differently, as will anything with read replicas, or cost, latency or other constraints.