@mattstratton
THE DATA
50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS
PagerDuty commissioned a study across over 10,000 companies over 100 different segments.

@mattstratton
THE DATA
50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS
▸
60 million notifications during dinner hours
PagerDuty commissioned a study across over 10,000 companies over 100 different segments.

@mattstratton
THE DATA
50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS
▸
60 million notifications during dinner hours
▸
82 million notifications during evening hours
PagerDuty commissioned a study across over 10,000 companies over 100 different segments.

@mattstratton
THE DATA
50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS
▸
60 million notifications during dinner hours
▸
82 million notifications during evening hours
▸
250 million notifications during sleeping hours
PagerDuty commissioned a study across over 10,000 companies over 100 different segments.

@mattstratton
THE DATA
50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS
▸
60 million notifications during dinner hours
▸
82 million notifications during evening hours
▸
250 million notifications during sleeping hours
▸
122 million notifications on weekends
PagerDuty commissioned a study across over 10,000 companies over 100 different segments.

@mattstratton
THE DATA
50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS
▸
60 million notifications during dinner hours
▸
82 million notifications during evening hours
▸
250 million notifications during sleeping hours
▸
122 million notifications on weekends
▸
A total of 750,000 nights with sleep-interrupting notifications
PagerDuty commissioned a study across over 10,000 companies over 100 different segments.

@mattstratton
THE DATA
50,000 RESPONDERS RECEIVING A TOTAL OF 760 MILLION NOTIFICATIONS
▸
60 million notifications during dinner hours
▸
82 million notifications during evening hours
▸
250 million notifications during sleeping hours
▸
122 million notifications on weekends
▸
A total of 750,000 nights with sleep-interrupting notifications
▸
A total of 330,000 weekend days with interrupt notifications
PagerDuty commissioned a study across over 10,000 companies over 100 different segments.

@mattstratton
LET’S HAVE SOME DATA
THE MOST MEANINGFUL METRICS ON ATTRITION ARE

@mattstratton
LET’S HAVE SOME DATA
THE MOST MEANINGFUL METRICS ON ATTRITION ARE
▸
Number of days where a responder’s work and
life are interrupted

@mattstratton
LET’S HAVE SOME DATA
THE MOST MEANINGFUL METRICS ON ATTRITION ARE
▸
Number of days where a responder’s work and
life are interrupted
▸
Number of days when a responder is woken
overnight

@mattstratton
LET’S HAVE SOME DATA
THE MOST MEANINGFUL METRICS ON ATTRITION ARE
▸
Number of days where a responder’s work and
life are interrupted
▸
Number of days when a responder is woken
overnight
▸
Number of weekend days interrupted by
notifications.

@mattstratton
EXAMPLES OF MEMES ARE TUNES,
IDEAS, CATCH-PHRASES, CLOTHES
FASHIONS, WAYS OF MAKING POTS OR
OF BUILDING ARCHES. JUST AS
GENES PROPAGATE THEMSELVES IN
THE GENE POOL BY LEAPING FROM
BODY TO BODY, SO MEMES
PROPAGATE THEMSELVES IN THE
MEME POOL BY LEAPING FROM BRAIN
TO BRAIN VIA IMITATION.
Richard Dawkins
@mattstratton

@mattstratton
SNOW CRASH
Remember, memes are another way of evolving across generations. This happens in the world of Snow Crash, but it can happen in your organization as well.

@mattstratton
SNOW CRASH
▸
In the book, “Snow Crash” itself is a neural-
linguistic virus.
Remember, memes are another way of evolving across generations. This happens in the world of Snow Crash, but it can happen in your organization as well.

@mattstratton
SNOW CRASH
▸
In the book, “Snow Crash” itself is a neural-
linguistic virus.
▸
The bad guys figure out how to unlock it, and it
spreads from hacker to hacker like a meme
Remember, memes are another way of evolving across generations. This happens in the world of Snow Crash, but it can happen in your organization as well.

@mattstratton
SNOW CRASH
▸
In the book, “Snow Crash” itself is a neural-
linguistic virus.
▸
The bad guys figure out how to unlock it, and it
spreads from hacker to hacker like a meme
▸
Plus, lots of swordplay
Remember, memes are another way of evolving across generations. This happens in the world of Snow Crash, but it can happen in your organization as well.

@mattstratton
SNOW CRASH
▸
In the book, “Snow Crash” itself is a neural-
linguistic virus.
▸
The bad guys figure out how to unlock it, and it
spreads from hacker to hacker like a meme
▸
Plus, lots of swordplay
“IDEOLOGY IS A VIRUS.”

NEAL STEPHENSON
Remember, memes are another way of evolving across generations. This happens in the world of Snow Crash, but it can happen in your organization as well.

@mattstratton
WHAT IF YOU ARE THE SUPREME LEADER?

@mattstratton
WHAT IF YOU ARE THE SUPREME LEADER?
▸
“Command and control” doesn’t work

@mattstratton
WHAT IF YOU ARE THE SUPREME LEADER?
▸
“Command and control” doesn’t work
▸
Use measurement for good, not for evil

@mattstratton
WHAT IF YOU ARE THE SUPREME LEADER?
▸
“Command and control” doesn’t work
▸
Use measurement for good, not for evil
▸
Avoid “executive swoop”

@mattstratton
WHAT IF YOU ARE THE SUPREME LEADER?
▸
“Command and control” doesn’t work
▸
Use measurement for good, not for evil
▸
Avoid “executive swoop”

@mattstratton
REVIEW. REVIEW. REVIEW
A CULTURE OF LEARNING
http://bit.ly/2KpzKKW
If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the
Normalization of Deviance
effect. In this case, we start to
accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.

@mattstratton
REVIEW. REVIEW. REVIEW
A CULTURE OF LEARNING
▸
In a generative, performance-oriented organization, “failure leads to inquiry.”
http://bit.ly/2KpzKKW
If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the
Normalization of Deviance
effect. In this case, we start to
accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.

@mattstratton
REVIEW. REVIEW. REVIEW
A CULTURE OF LEARNING
▸
In a generative, performance-oriented organization, “failure leads to inquiry.”
▸
Don’t take my word for it. Ask Ron Westrum.
http://bit.ly/2KpzKKW
If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the
Normalization of Deviance
effect. In this case, we start to
accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.

@mattstratton
REVIEW. REVIEW. REVIEW
A CULTURE OF LEARNING
▸
In a generative, performance-oriented organization, “failure leads to inquiry.”
▸
Don’t take my word for it. Ask Ron Westrum.
▸
You can also ask Dr. Nicole Forsgren - @nicolefv
http://bit.ly/2KpzKKW
If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the
Normalization of Deviance
effect. In this case, we start to
accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.

@mattstratton
REVIEW. REVIEW. REVIEW
NORMALIZATION OF DEVIANCE
http://bit.ly/2Ihj1wV
If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the
Normalization of Deviance
effect. In this case, we start to
accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.

@mattstratton
REVIEW. REVIEW. REVIEW
NORMALIZATION OF DEVIANCE
▸
The gradual process through which unacceptable practice or standards
become acceptable. As the deviant behavior is repeated without catastrophic
results, it becomes the social norm for the organization.
http://bit.ly/2Ihj1wV
If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the
Normalization of Deviance
effect. In this case, we start to
accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.

@mattstratton
REVIEW. REVIEW. REVIEW
NORMALIZATION OF DEVIANCE
▸
The gradual process through which unacceptable practice or standards
become acceptable. As the deviant behavior is repeated without catastrophic
results, it becomes the social norm for the organization.
▸
This happened to NASA. Twice.
http://bit.ly/2Ihj1wV
If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the
Normalization of Deviance
effect. In this case, we start to
accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.

@mattstratton
REVIEW. REVIEW. REVIEW
NORMALIZATION OF DEVIANCE
▸
The gradual process through which unacceptable practice or standards
become acceptable. As the deviant behavior is repeated without catastrophic
results, it becomes the social norm for the organization.
▸
This happened to NASA. Twice.
▸
In our case, we start to accept alerts or degradations as acceptable.
http://bit.ly/2Ihj1wV
If we don’t treat every outage or alert as something to learn from or something to improve, we run the risk of the
Normalization of Deviance
effect. In this case, we start to
accept alerts or degradations as acceptable. Our standards suffer. We let things slip through the cracks.

@mattstratton
QUESTION METRICS
@mattstratton
Let’s make sure that we are setting the proper expectations. We don’t want to just expect five 9’s of reliability because “well, five is better than four.” Why do you need
five? Have you tied your metrics to a business outcome?

Likewise, your speed metrics shouldn’t be “faster than last month.” And beware of inaccurate extrapolation. You might have data suggesting that
if your page load time
increases by a second, conversion drops by 50 percent. But that doesn’t mean that if you reduce load time by a second, conversion will
increase
by 50 percent.
Correlation doesn’t always equal causation, and the same numbers don’t move the dials in both directions.

@mattstratton
QUESTION METRICS
WHY ARE WE USING THESE NUMBERS?
Let’s make sure that we are setting the proper expectations. We don’t want to just expect five 9’s of reliability because “well, five is better than four.” Why do you need
five? Have you tied your metrics to a business outcome?

Likewise, your speed metrics shouldn’t be “faster than last month.” And beware of inaccurate extrapolation. You might have data suggesting that
if your page load time
increases by a second, conversion drops by 50 percent. But that doesn’t mean that if you reduce load time by a second, conversion will
increase
by 50 percent.
Correlation doesn’t always equal causation, and the same numbers don’t move the dials in both directions.

@mattstratton
QUESTION METRICS
WHY ARE WE USING THESE NUMBERS?
▸
What is the data that drive your incident process
Let’s make sure that we are setting the proper expectations. We don’t want to just expect five 9’s of reliability because “well, five is better than four.” Why do you need
five? Have you tied your metrics to a business outcome?

Likewise, your speed metrics shouldn’t be “faster than last month.” And beware of inaccurate extrapolation. You might have data suggesting that
if your page load time
increases by a second, conversion drops by 50 percent. But that doesn’t mean that if you reduce load time by a second, conversion will
increase
by 50 percent.
Correlation doesn’t always equal causation, and the same numbers don’t move the dials in both directions.

@mattstratton
QUESTION METRICS
WHY ARE WE USING THESE NUMBERS?
▸
What is the data that drive your incident process
▸
Are your metrics tied to business outcomes?
Let’s make sure that we are setting the proper expectations. We don’t want to just expect five 9’s of reliability because “well, five is better than four.” Why do you need
five? Have you tied your metrics to a business outcome?

Likewise, your speed metrics shouldn’t be “faster than last month.” And beware of inaccurate extrapolation. You might have data suggesting that
if your page load time
increases by a second, conversion drops by 50 percent. But that doesn’t mean that if you reduce load time by a second, conversion will
increase
by 50 percent.
Correlation doesn’t always equal causation, and the same numbers don’t move the dials in both directions.

@mattstratton
QUESTION METRICS
WHY ARE WE USING THESE NUMBERS?
▸
What is the data that drive your incident process
▸
Are your metrics tied to business outcomes?
▸
Correlation doesn’t always equal causation
Let’s make sure that we are setting the proper expectations. We don’t want to just expect five 9’s of reliability because “well, five is better than four.” Why do you need
five? Have you tied your metrics to a business outcome?

Likewise, your speed metrics shouldn’t be “faster than last month.” And beware of inaccurate extrapolation. You might have data suggesting that
if your page load time
increases by a second, conversion drops by 50 percent. But that doesn’t mean that if you reduce load time by a second, conversion will
increase
by 50 percent.
Correlation doesn’t always equal causation, and the same numbers don’t move the dials in both directions.

@mattstratton
KEEP IT SIMPLE
At the heart of every complex resilient system is the hubris that someone believed they could predict everything that could go wrong. Fate, and the internet, laughs

@mattstratton
THE MORE RESILIENTLY THE SYSTEM IS
DESIGNED, THE MORE LIKELY IT IS TO CAUSE
A NEGATIVE BUSINESS IMPACT
KEEP IT SIMPLE
At the heart of every complex resilient system is the hubris that someone believed they could predict everything that could go wrong. Fate, and the internet, laughs

@mattstratton
THE MORE RESILIENTLY THE SYSTEM IS
DESIGNED, THE MORE LIKELY IT IS TO CAUSE
A NEGATIVE BUSINESS IMPACT
Stratton’s Law of Catastrophic Predestination
KEEP IT SIMPLE
At the heart of every complex resilient system is the hubris that someone believed they could predict everything that could go wrong. Fate, and the internet, laughs

@mattstratton
COMMUNICATE.
TALK TO PEOPLE
ask how the on call is feeling during stand ups. give them the opportunity to mention they might be burning out.

@mattstratton
COMMUNICATE.
TALK TO PEOPLE
▸
Who are your customers? What are their
expectations?
ask how the on call is feeling during stand ups. give them the opportunity to mention they might be burning out.

@mattstratton
COMMUNICATE.
TALK TO PEOPLE
▸
Who are your customers? What are their
expectations?
▸
Whose customer are you? Can you help them
out?
ask how the on call is feeling during stand ups. give them the opportunity to mention they might be burning out.

@mattstratton
COMMUNICATE.
TALK TO PEOPLE
▸
Who are your customers? What are their
expectations?
▸
Whose customer are you? Can you help them
out?
▸
What are the perceptions of your team?
ask how the on call is feeling during stand ups. give them the opportunity to mention they might be burning out.

@mattstratton
INCIDENT COMMAND
LEARN TO TAKE COMMAND
volunteer to help as an incident commander (what’s that? Maybe we should have them!)

@mattstratton
MAKE IT NICE ON THE BRIDGE
DURING A CALL
You want to get all the right people on the call as soon as you need to…but you also want to get them OFF of the call as
soon as possible.

@mattstratton
MAKE IT NICE ON THE BRIDGE
DURING A CALL
▸
Have clearly defined roles
You want to get all the right people on the call as soon as you need to…but you also want to get them OFF of the call as
soon as possible.

@mattstratton
MAKE IT NICE ON THE BRIDGE
DURING A CALL
▸
Have clearly defined roles
▸
Avoid bystander effect
You want to get all the right people on the call as soon as you need to…but you also want to get them OFF of the call as
soon as possible.

@mattstratton
MAKE IT NICE ON THE BRIDGE
DURING A CALL
▸
Have clearly defined roles
▸
Avoid bystander effect
▸
Rally fast, disband faster
You want to get all the right people on the call as soon as you need to…but you also want to get them OFF of the call as
soon as possible.

@mattstratton
MAKE IT NICE ON THE BRIDGE
DURING A CALL
▸
Have clearly defined roles
▸
Avoid bystander effect
▸
Rally fast, disband faster
▸
Don’t litigate severity
You want to get all the right people on the call as soon as you need to…but you also want to get them OFF of the call as
soon as possible.

@mattstratton
MAKE IT NICE ON THE BRIDGE
DURING A CALL
▸
Have clearly defined roles
▸
Avoid bystander effect
▸
Rally fast, disband faster
▸
Don’t litigate severity
▸
Have a clear mechanism for making decisions
You want to get all the right people on the call as soon as you need to…but you also want to get them OFF of the call as
soon as possible.

@mattstratton
SHARE ALL TESTS
SHARING IS CARING

@mattstratton
SHARE ALL TESTS
TESTS ARE FOR SWE AND SRE BOTH

@mattstratton
SHARE ALL TESTS
TESTS ARE FOR SWE AND SRE BOTH
▸
All functional tests used in preproduction should have a corresponding
monitor in production

@mattstratton
SHARE ALL TESTS
TESTS ARE FOR SWE AND SRE BOTH
▸
All functional tests used in preproduction should have a corresponding
monitor in production
▸
All monitoring functionality in production should have corresponding tests in
the build/release process

@mattstratton
SHARE ALL TESTS
TESTS ARE FOR SWE AND SRE BOTH
▸
All functional tests used in preproduction should have a corresponding
monitor in production
▸
All monitoring functionality in production should have corresponding tests in
the build/release process
▸
Monitoring is testing with at time dimension.
There should be full parity
between preproduction and production.

@mattstratton
DO ONE NICE THING
EVERY SPRINT

@mattstratton
HELP YOUR RESPONDERS IN EACH AND EVERY SPRINT
Even if it’s not on a card

@mattstratton
HELP YOUR RESPONDERS IN EACH AND EVERY SPRINT
▸
In each sprint/work unit, add value to your responders
Even if it’s not on a card

@mattstratton
HELP YOUR RESPONDERS IN EACH AND EVERY SPRINT
▸
In each sprint/work unit, add value to your responders
▸
Even if it’s not on a card
Even if it’s not on a card

@mattstratton
HELP YOUR RESPONDERS IN EACH AND EVERY SPRINT
▸
In each sprint/work unit, add value to your responders
▸
Even if it’s not on a card
▸
You rebel, you.
Even if it’s not on a card

@mattstratton
ADDING VALUE
SOME EXAMPLES
These might seem obvious, but if they’re so obvious, I assume you’ve done them already?

@mattstratton
ADDING VALUE
▸
If you use feature flags, add a description field to the configuration

@mattstratton
ADDING VALUE
▸
If you use feature flags, add a description field to the configuration
▸
If you use runbooks, ensure they are up to date every time you cut a release. If
you don’t do this, abandon the runbook altogether (an incorrect runbook is
considered harmful)

@mattstratton
ADDING VALUE
▸
If you use feature flags, add a description field to the configuration
▸
If you use runbooks, ensure they are up to date every time you cut a release. If
you don’t do this, abandon the runbook altogether (an incorrect runbook is
considered harmful)
▸
SIMPLIFY, MAN!

Link for this presentation:

HTML code for embedding:

Share on social media:

Richard Dawkins described memes as being a form of cultural propagation, which is a way for people to transmit social memories and cultural ideas to each other. Not unlike the way that DNA and life will spread from location to location, a meme idea will also travel from mind to mind.

Getting your organization to take a step back and look at how ops affects people (awareness of alert fatigue, burnout risk, proactive/reactive approaches) can be a tough challenge.

In this talk, I will discuss how the very DNA of an organization can evolve through the use of actionable communications from all levels - management, strategy, and practitioners. The “virus” of humane ops will infect your organization, providing a more sustainable approach to on-call, incident resolution, post-mortems, and more. There also will be copious references to the Neal Stephenson classic novel, Snow Crash.

After this talk, you will have ideas of practical approaches to effect change in your organization, regardless of your level of influence. While not every group will use the same “viruses”, you will take away a good understanding of where to get started as Patient Zero.

Resources

The following resources were mentioned during the presentation or are useful additional information.