“Understanding Slowness”

Time to fix outages directly correlates to how broad the engineer’s
focus is.

Senior Engineers are more cognizant of the steaming pile that is
their architecture

Map #1 – High-level map

Architectural Components

Connectedness

Data flow

Map #2 – Low-level map

Component versions

Component languages

OS/NICs/HBAs

Location

Switches/Routers/FW

Connected Service details

2 Types useful SREs:

Spanning several boundaries (deep, not wide)

Spanning all boundaries (wide, not deep)

Who’s On First?

Establish who is responsible for each component in each context

Establish who is responsible when that person fails (upward)

Establish who is responsible when that person needs help (upward & downward)

GAME DAY EXERCISES mitigate these challenges

Expectations

Set expectations for breakages and slowdowns

What you build will break, understanding under what stress is
your job as an engineer

(Choosing which of these that you need to know is part of the
challenge in small/all companies)

0 Tech Loyalty

Construct a solution from parts

Parts are replaceable

(Different parts by different providers will have different
tolerances, which can be good for your infrastructure… Ex. If one
provider is failing at X tolerance, maybe the other will fail at
a different point?)

Logistics matter (when things are broken [or slow])

Observability

Tool parity

Safety harnesses (you can change production code in a defined/protected scope)

Latency

You must subdue it

First, you must understand it

Histograms over Aggregations

Averages are for chumps

Reduce many observations S to N values is the definition of “lossy”

AKA “you don’t know shit”

Quantiles

Time series histograms are a lot of information to digest

Moving quantiles can often provide much more insight

Min & Max are the most valuable quantiles if you only have
2 – it bounds reality for you