SRE - drupal day aveiro 2016

Site Reliability Engineering enables agility and stability.
SREs use Software Engineering to automate themselves out of the Job.
My advice, if you want to implement this change in your company is to start with action items, alter your training and hiring, implement error budgets, do blameless postmortems and reduce toil.

6.
7 ricardo.amaro@acquia.com
➔ Term crafted by Google in 2003.
➔ When Ben Treynor was hired to run
“production” and ended up “applying
software engineering to an operations
function”
Site Reliability Engineering

9.
10 ricardo.amaro@acquia.com
“Reliability is the most fundamental feature of any product.”
Ben Treynor, Google’s VP for 24/7 Operations
Site Reliability Engineering

10.
11 ricardo.amaro@acquia.com
DevOps & S.R.E.
DevOps is a practice, which was coined around
2008, that encompasses automation of manual
tasks, continuous integration and continuous
delivery. It applies to a wide audience of
companies whereas SRE might be considered a
subset of DevOps that possesses additional skill
sets.
Source:
https://en.wikipedia.org/wiki/Site_reliability_engineering

13.
14 ricardo.amaro@acquia.com
➔ Hire only coders
➔ Have a Service Level Objective (SLO) for your service
➔ Measure and report performance against SLOs
➔ Use Error Budgets and gate launches on them
➔ Have a Common staffing pool for SRE and DEV
➔ Excess Ops work overflows to DEV team
➔ Cap SRE operational load at 50% and share 5% with the DEV team
➔ Oncall teams at least 8 people at one location or 6 on each, each product
➔ Maximum of 2 events per oncall shift
➔ Post mortem for every event
➔ Post mortems are BLAMELESS and focus on process and technology, not people
How to achieve S.R.E.
Treynor’s Action items

18.
19 ricardo.amaro@acquia.com
Then came the web...
● Software as a Service
● Platform as a Service
● Cloud computing
● ...
➔ Operations overhead not on the customer side
➔ Features could now be delivered faster
➔ Customer feedback important for product improvements
Product
Development
Ship Features
Operations
Users

21.
22 ricardo.amaro@acquia.com
An Old Solution to Toil
Caption goes here
● Scale with bodies
In the old operations model, you throw
people at a reliability problem and keep
pushing (sometimes for a year or more)
until the problem either goes away or
blows up in your face.

22.
23 ricardo.amaro@acquia.com
As your business grows,
workload trends to infinity
(x) time
● Cap Ops Workload
As your business grows, you need to
reduce manual labor in order to
continue delivering features. Put a 50%
cap on Ops work and leave most of the
SRE team time for writing code and
reducing Toil.
(y)customers/traffic
Workload/Toil over time

23.
24 ricardo.amaro@acquia.com
Google’s example
➔ Keep operational work (i.e., toil) below 50% of each SREs time
➔ More than 50% of each SREs time is spent on:
◆ engineering project work to reduce toil
◆ add service features - improving reliability, performance, utilization
➔ Improves career planning for the SRE
➔ Improves morale on the organization
➔ An SRE team can easily devolve into an Ops team if the 50% target is broken.
Why less Toil is Better
S.R.E. - A modern solution

24.
25 ricardo.amaro@acquia.com
S.R.E. - A modern solution
DEV + OPS
➔ This conflict is not inevitable
➔ The solution is: Error Budgets!
➔ Everyone agrees on an Error Budget (has we will explain next)
➔ SRE only prevents releases or Launches if the Error Budget is exceeded.
Dev Ops

26.
27 ricardo.amaro@acquia.com
Example: A 99.9% availability SLO means that the service can be 0.1% unavailable, which is the error budget.
What is an Error Budget?
The business or the product establishes Service Level Objectives (SLOs) for the system, based on
Service Level indicators such as error rate, availability or latency...
Error Budget

27.
28 ricardo.amaro@acquia.com
➔ 100% is the wrong reliability target for basically everything.
➔ Set an SLO that acknowledges the trade-off and leaves an error budget
➔ Error budget can be spent on anything: launching features, etc.
➔ Error budget allows for discussion about how phased rollouts and 1%
experiments can maintain tolerable levels of errors.
➔ Goal of SRE team isn’t “zero outages” – SRE and product devs are
incentive aligned to spend the error budget to get maximum feature
velocity.
➔ Out of Budget? No problems. Do more testing between releases.
How to obtain and use the Error Budget?

28.
29 ricardo.amaro@acquia.com
➔ This puts an incentive to developers that drives them to value stability (not just change).
➔ And gives control that drives SREs to permit change (not just stability).
➔ It forces decisions based on metrics, not politics- nor feelings, just data.
Error Budget
A Self-regulating mechanism

30.
31 ricardo.amaro@acquia.com
➔ Development and SRE teams share a
single staffing pool
◆ If all is Reliable Devs are
rewarded with teammates
◆ If Ops is overloaded, SREs are
contracted to support code
How are Development & Operations
teams organized?
Now tell me… Why should I hire you?

31.
32 ricardo.amaro@acquia.com
➔ SREs are developer/sys-admin hybrids
◆ They perform more Dev work as
things become stable
Development & Operations
Systems, code…
Are you able to cook also?

32.
33 ricardo.amaro@acquia.com
➔ SRE can only spend up to 50% of their time on ops work
➔ If operational load exceeds 50%, the ops work overflows to Dev
➔ Allow them to move to other projects
Development & Operations

34.
35 ricardo.amaro@acquia.com
➔ An engineer can only react with urgency a few
times a day before they get fatigued.
➔ Every page should be actionable.
➔ Every page response should require intelligence.
➔ Pages should be about a new problem or an
event that hasn’t been seen before.
Pager fatigue
A serious a problem to be addressed

38.
39 ricardo.amaro@acquia.com
➔ Document written for ALL significant incidents
➔ Non-paged incidents are even more valuable - monitoring gaps
➔ Explain what happened in detail
➔ Find all root causes of the event
➔ Assign actions to correct the problem or improve how it is addressed next time
What are Postmortems?

39.
40 ricardo.amaro@acquia.com
➔ Use a blame free postmortem culture, with the goal
of exposing faults
◆ apply engineering to fix these faults,
◆ Try not just avoid or minimize them.
Postmortems Are Blameless!