The book starts with a story about a time Margaret Hamilton brought her young daughter with her to NASA, back in the days of the Apollo program. During a simulation mission, her daughter caused the mission to crash by pressing some keys that caused a prelaunch program to run during the simulated mission. Hamilton submitted a change request to add error checking code to prevent the error from happening again, but the request was rejected because the error case should never happen.

On the next mission, Apollo 8, that exact error condition occurred and a potentially fatal problem that could have been prevented with a trivial check took NASA’s engineers 9 hours to resolve.

This sounds familiar – I’ve lost track of the number of dev post-mortems that have the same basic structure.

This is an experiment in note-taking for me in two ways. First, I normally take pen and paper notes and then scan them in for posterity. Second, I normally don’t post my notes online, but I’ve been inspired to try this by Jamie Brandon’s notes on books he’s read . My handwritten notes are a series of bullet points, which may not translate well into markdown. One issue is that my markdown renderer doesn’t handle more than one level of nesting, so things will get artificially flattened. There are probably more issues. Let’s find out what they are! In case it’s not obvious, asides from me are in italics.

Chapter 1: Introduction

Everything in this chapter is covered in much more detail later.

Two approaches to hiring people to manage system stability:

Traditional approach: sysadmins

Assemble existing components and deploy to produce a service

Respond to events and updates as they occur

Grow team to absorb increased work as service grows

Pros

Easy to implement because it’s standard

Large talent pool to hire from

Lots of available software

Cons

Manual intervention for change management and event handling causes size of team to scale with load on system

Ops is fundamentally at odds with dev, which can cause pathological resistance to changes, which causes similarly pathological response from devs, which reclassify “launches” as “incremental updates”, “flag flips”, etc.

Google’s approach: SREs

Have software engineers do operations

Candidates should be able to pass or nearly pass normal dev hiring bar, and may have some additional skills that are rare among devs (e.g., L1 – L3 networking or UNIX system internals).

Career progress comparable to dev career track

Results

SREs would be bored by doing tasks by hand

Have the skillset necessary to automate tasks

Do the same work as an operations team, but with automation instead of manual labor

To avoid manual labor trap that causes team size to scale with service load, Google places a 50% cap on the amount of “ops” work for SREs

Upper bound. Actual amount of ops work is expected to be much lower

Pros

Cheaper to scale

Circumvents devs/ops split

Cons

Hard to hire for

May be unorthodox in ways that require management support (e.g., product team may push back against decision to stop releases for the quarter because the error budget is depleted)

I don’t really understand how this is an example of circumventing the dev/ops split. I can see how it’s true in one sense, but the example of stopping all releases because an error budget got hit doesn’t seem fundamentally different from the “sysadmin” example where teams push back against launches. It seems that SREs have more political capital to spend and that, in the specific examples given, the SREs might be more reasonable, but there’s no reason to think that sysadmins can’t be reasonable.

Tenets of SRE

Ensuring a durable focus on engineering

50% ops cap means that extra ops work is redirected to product teams on overflow

Provides feedback mechanism to product teams as well as keeps load down

Target max 2 events per 8-12 hour on-call shift

Postmortems for all serious incidents, even if they didn’t trigger a page

Blameless postmortems

2 events per shift is the max, but what’s the average? How many on-call events are expected to get sent from the SRE team to the dev team per week?

How do you get from a blameful postmortem culture to a blameless postmortem culture? Now that everyone knows that you should have blameless postmortems, everyone will claim to do them. Sort of like having good testing and deployment practices. I’ve been lucky to be on an on call rotation that’s never gotten paged, but when I talk to folks who joined recently and are on call, they have hair-raising stories of finger pointing, trash talk, and blame shifting. The fact that everyone knows you’re supposed to be blameless seems to make it harder to call out blamefulness, not easier.

Move fast without breaking SLO

Going from 5 9s to 100% reliability isn’t noticeable to most users and requires tremendous effort

Set a goal that acknowledges the trade-off and leaves an error budget

Error budget can be spent on anything: launching features, etc.

Error budget allows for discussion about how phased rollouts and 1% experiments can maintain tolerable levels of errors

Goal of SRE team isn’t “zero outages” – SRE and product devs are incentive aligned to spend the error budget to get maximum feature velocity

It’s not explicitly stated, but for teams that need to “move fast”, consistently coming in way under the error budget could be taken as a sign that the team is spending too much effort on reliability.

I like this idea a lot, but when I discussed this with Jessica Kerr, she pushed back on this idea because maybe you’re just under your error budget because you got lucky and a single really bad event can wipe out your error budget for the next decade. Followup question: how can you be confident enough in your risk model that you can purposefully consume error budget to move faster without worrying that a downstream (in time) bad event will put you overbudget? Nat Welch (a former Google SRE) responded to this by saying that you can build confidence through simulated disasters and other testing.

Monitoring

Monitoring should never require a human to interpret any part of the alerting domain

Three valid kinds of monitoring output

Alerts: human needs to take action immediately

Tickets: human needs to take action eventually

Logging: no action needed

Note that, for example, graphs are a type of log

Emergency Response

Reliability is a function of MTTF (mean-time-to-failure) and MTTR (mean-time-to-recovery)

For evaluating responses, we care about MTTR

Humans add latency

Systems that don’t require humans to respond will have higher availability due to lower MTTR

Having a “playbook” produces 3x lower MTTR

Having hero generalists who can respond to everything works, but having playbooks works better

Identifying risk tolerance of infrastructure services

Target availability

Some teams use bigtable as a backing store for offline analysis – care more about throughput than reliability

Too expensive to meet all needs generically

Ex: Bigtable instance

Low-latency Bigtable user wants low queue depth

Throughput oriented Bigtable user wants moderate to high queue depth

Success and failure are diametrically opposed in these two cases!

Cost

Partition infra and offer different levels of service

In addition to obv. benefits, allows service to externalize the cost of providing different levels of service (e.g., expect latency oriented service to be more expensive than throughput oriented service)

Motivation for error budgets

No notes on this because I already believe all of this. Maybe go back and re-read this if involved in debate about this.

Chapter 4: Service level objectives

Note: skipping notes on terminology section.

Ex: Chubby planned outages

Google found that Chubby was consistently over its SLO, and that global Chubby outages would cause unusually bad outages at Google

Chubby was so reliable that teams were incorrectly assuming that it would never be down and failing to design systems that account for failures in Chubby

Solution: take Chubby down globally when it’s too far above its SLO for a quarter to “show” teams that Chubby can go down

What do you and your users care about?

Too many indicators: hard to pay attention

Too few indicators: might ignore important behavior

Different classes of services should have different indicators

User-facing: availability, latency, throughput

Storage: latency, availability, durability

Big data: throughput, end-to-end latency

All systems care about correctness

Collecting indicators

Can often do naturally from server, but client-side metrics sometimes needed.

Aggregation

Use distributions and not averages

User studies show that people usually prefer slower average with better tail latency

Standardize on common defs, e.g., average over 1 minute, average over tasks in cluster, etc.

Can have exceptions, but having reasonable defaults makes things easier

Choosing targets

Don’t pick target based on current performance

Current performance may require heroic effort

Keep it simple

Avoid absolutes

Unreasonable to talk about “infinite” scale or “always” available

Minimize number of SLOs

Perfection can wait

Can always redefine SLOs over time

SLOs set expectations

Keep a safety margin (internal SLOs can be defined more loosely than external SLOs)

Don’t overachieve

See Chubby example, above

Another example is making sure that the system isn’t too fast under light loads

Chapter 5: Eliminating toil

Carla Geisser: “If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow.”

Def: Toil

Not just “work I don’t want to do”

Manual

Repetitive

Automatable

Tactical

No enduring value

O(n) with service growth

In surveys, find 33% toil on average

Numbers can be as low as 0% and as high as 80%

Toil > 50% is a sign that the manager should spread toil load more evenly

Is toil always bad?

Predictable and repetitive tasks can be calming

Can produce a sense of accomplishment, can be low-risk / low-stress activities

Section on why toil is bad. Skipping notetaking for that section.

Chapter 6: Monitoring distributed systems

Why monitor?

Analyze long-term trends

Compare over time or do experiments

Alerting

Building dashboards

Debugging

As Alex Clemmer is wont to say, our problem isn’t that we move too slowly, it’s that we build the wrong thing. I wonder how we could get from where we are today to having enough instrumentation to be able to make informed decisions when building new systems.

Setting reasonable expectations

Monitoring is non-trivial

10-12 person SRE team typically has 1-2 people building and maintaining monitoring

Number has decreased over time due to improvements in tooling/libs/centralized monitoring infra

This is great. We should do this. People sometimes get really rough on-call rotations a few times in a row and considering the infrequency of on-call rotations there’s no reason to expect that this should randomly balance out over the course of a year or two.

What went well: dependent teams escalated issues immediately, were able to restore access

What we learned: had an insufficient understanding of the system and its interaction with other systems, failed to follow incident response that would have informed customers of outage, hadn’t tested rollback procedures in test env

Almost all externally facing systems depend on this, become unavailable

Many internal systems also have dependency and become unavailable

Alerts start firing with seconds

Within 5 minutes of config push, engineer who pushed change rolled back change and services started recovering

What went well: monitoring fired immediately, incident management worked well, out-of-band communications systems kept people up to date even though many systems were down, luck (engineer who pushed change was following real-time comms channels, which isn’t part of the release procedure)

What we learned: push to canary didn’t trigger same issue because it didn’t hit a specific config keyword combination; push was considered low-risk and went through less stringent canary process, alerting was too noisy during outage

Process-induced emergency

No notes on process-induced example.

Chapter 14: Managing incidents

This is an area where we seem to actually be pretty good. No notes on this chapter.

Chapter 15: Postmortem culture: learning from failure

No notes on this chapter.

Chapter 16: Tracking outages

Escalator: centralized system that tracks ACKs to alerts, notifies other people if necessary, etc.

No notes on why SRE software, how to spin up a group, etc. TODO: re-read back half of this chapter and take notes if it’s ever directly relevant for me.

Chapter 19: Load balancing at the frontend

No notes on this section. Seems pretty similar to what we have in terms of high-level goals, and the chapter doesn’t go into low-level details. It’s notable that they do [redacted] differently from us, though. For more info on lower-level details, there’s the Maglev paper .

Chapter 20: Load balancing in the datacenter

Flow control

Need to avoid unhealthy tasks

Naive flow control for unhealthy tasks

Track number of requests to a backend

Treat backend as unhealthy when threshold is reached

Cons: generally terrible

Health-based flow control

Backend task can be in one of three states: {healthy, refusing connections, lame duck}

Lame duck state can still take connections, but sends backpressure request to all clients

Lame duck state simplifies clean shutdown

Def: subsetting: limiting pool of backend tasks that a client task can interact with

Clients in RPC system maintain pool of connections to backends

Using pool reduces latency compared to doing setup/teardown when needed

Inactive connections are relatively cheap, but not free, even in “inactive” mode (reduced health checks, UDP instead of TCP, etc.)

Choosing the correct subset

Typ: 20-100, choose base on workload

Subset selection: random

Bad utilization

Subset selection: round robin

Order is permuted; each round has its own permutation

Load balancing

Subset selection is for connection balancing, but we still need to balance load

Load balancing: round robin

In practice, observe 2x difference between most loaded and least load

In practice, most expensive request can be 1000x more expensive than cheapest request

In addition, there’s random unpredictable variation in requests

Load balancing: least-loaded round robin

Exactly what it sounds like: round-robin among least loaded backends

Load appears to be measured in terms of connection count; may not always be the best metric

This is per client, not globally, so it’s possible to send requests to a backend with many requests from other clients

Must have good health checks – bad backends may tasks very quickly and have low load

Load balancing: weighted round robin

Same as above, but weight with other factors

I wonder what Heroku meant when they responded to Rap Genius by saying “after extensive research and experimentation, we have yet to find either a theoretical model or a practical implementation that beats the simplicity and robustness of random routing to web backends that can support multiple concurrent connections”.

Chapter 21: Handling overload

Even with “good” load balancing, systems will become overloaded

Typical strategy is to serve degraded responses, but under very high load that may not be possible

Modeling capacity as QPS or as a function of requests (e.g., how many keys the requests read) is failure prone

These generally change slowly, but can change rapidly (e.g., because of a single checkin)

Better solution: measure directly available resources

CPU utilization is usually a good signal for provisioning

With GC, memory pressure turns into CPU utilization

With other systems, can provision other resources such that CPU is likely to be limiting factor

In cases where over-provisioning CPU is too expensive, take other resources into account

How much does it cost to generally over-provision CPU like that?

Client-side throttling

Backends start rejecting requests when customer hits quota

Requests still use resources, even when rejected – without throttling, backends can spend most of their resources on rejecting requests

Criticality

Seems to be priority but with a different name?

First-class notion in RPC system

Client-side throttling keeps separate stats for each level of criticality

By default, criticality is propagated through subsequent RPCs

Handling overloaded errors

Shed load to other DCs if DC is overloaded

Shed load to other backends if DC is ok but some backends are overloaded

Clients retry when they get an overloaded response

Per-request retry budget (3)

Per-client retry budget (10%)

Failed retries from client cause “overloaded; don’t retry” response to be returned upstream

Having a “don’t retry” response is “obvious”, but relatively rare in practice. A lot of real systems have a problem with failed retries causing more retries up the stack. This is especially true when crossing a hardware/software boundary (e.g., filesystem read causes many retries on DVD/SSD/spinning disk, fails, and then gets retried at the filesystem level), but seems to be generally true in pure software too.

Chapter 22: Addressing cascading failures

Typical failure scenarios?

Server overload

Ex: have two servers

One gets overloaded, failing

Other one now gets all traffic and also fails

Resource exhaustion

CPU/memory/threads/file descriptors/etc.

Ex: dependencies among resources

1) Java frontend has poorly tuned GC params

2) Frontend runs out of CPU due to GC

3) CPU exhaustion slows down requests

4) Increased queue depth uses more RAM

5) Fixed memory allocation for entire frontend means that less memory is available for caching

6) Lower hit rate

7) More requests into backend

8) Backend runs out of CPU or threads

9) Health checks fail, starting cascading failure

Difficult to determine cause during outage

Note: policies that avoid servers that serve errors can make things worse

fewer backends available, which get too many requests, which then become unavailable

Using deadlines instead of timeouts is great. We should really be more systematic about this.

Not allowing systems to fill up with pointless zombie requests by setting reasonable deadlines is “obvious”, but a lot of real systems seem to have arbitrary timeouts at nice round human numbers (30s, 60s, 100s, etc.) instead of deadlines that are assigned with load/cascading failures in mind.

Try to avoid intra-layer communication

Simpler, avoids possible cascading failure paths

Testing for cascading failures

Load test components!

Load testing both reveals breaking and point ferrets out components that will totally fall over under load

Restart servers (only if that would help – e.g., in GC death spiral or deadlock)

Drop traffic – drastic, last resort

Enter degraded mode – requires having built this into service previously

Eliminate batch load

Eliminate bad traffic

Chapter 23: Distributed consensus for reliability

How do we agree on questions like…

Which process is the leader of a group of processes?

What is the set of processes in a group?

Has a message been successfully committed to a distributed queue?

Does a process hold a particular lease?

What’s the value in a datastore for a particular key?

Ex1: split-brain

Service has replicated file servers in different racks

Must avoid writing simultaneously to both file servers in a set to avoid data corruption

Each pair of file servers has one leader & one follower

Servers monitor each other via heartbeats

If one server can’t contact the other, it sends a STONITH (shoot the other node in the head)

But what happens if the network is slow or packets get dropped?

What happens if both servers issue STONITH?

This reminds me of one of my favorite distributed database postmortems. The database is configured as a ring, where each node talks to and replicates data into a “neighborhood” of 5 servers. If some machines in the neighborhood go down, other servers join the neighborhood and data gets replicated appropriately.

Sounds good, but in the case where a server goes bad and decides that no data exists and all of its neighbors are bad, it can return results faster than any of its neighbors, as well as tell its neighbors that they’re all bad. Because the bad server has no data it’s very fast and can report that its neighbors are bad faster than its neighbors can report that it’s bad. Whoops!

Ex2: failover requires human intervention

A highly sharded DB has a primary for each shard, which replicates to a secondary in another DC

External health checks decide if the primary should failover to its secondary

If the primary can’t see the secondary, it makes itself unavailable to avoid the problems from “Ex1”

This increases operational load

Problems are correlated and this is relatively likely to run into problems when people are busy with other issues

If there’s a network issues, there’s no reason to think that a human will have a better view into the state of the world than machines in the system

99.99% good bytes in a 2GB file means 200K corrupt. Probably not ok for most apps

Backup is non-trivial

May have mixture of transactional and non-transactional backup and restore

Different versions of business logic might be live at once

If services are independently versioned, maybe have many combinations of versions

Replicas aren’t sufficient – replicas may sync corruption

Study of 19 data recovery efforts at Google

Most common user-visible data loss caused by deletion or loss of referential integrity due to software bugs

Hardest cases were low-grade corruption discovered weeks to months later

Defense in depth

First layer: soft deletion

Users should be able to delete their data

But that means that users will be able to accidentally delete their data

Also, account hijacking, etc.

Accidentally deletion can also happen due to bugs

Soft deletion delays actual deletion for some period of time

Second layer: backups

Need to figure out how much data it’s ok to lose during recovery, how long recovery can take, and how far back backups need to go

Want backups to go back forever, since corruption can go unnoticed for months (or longer)

But changes to code and schema can make recovery of older backups expensive

Google usually has 30 to 90 day window, depending on the service

Third layer: early detection

Out-of-band integrity checks

Hard to do this right!

Correct changes can cause checkers to fail

But loosening checks can cause failures to get missed

No notes on the two interesting case studies covered.

Chapter 27: Reliable product launches at scale

No notes on this chapter in particular. A lot of this material is covered by or at least implied by material in other chapters. Probably worth at least looking at example checklist items and action items before thinking about launch strategy, though. Also see appendix E, launch coordination checklist