Tag Archives: CRE

By Will Tipton, Site Reliability Engineer and Alex Bramley, Customer Reliability Engineer

In past posts, we’ve discussed the importance of creating an explicit policy document describing how to escalate SLO violations, and given a real-world example of a document from an SRE team at Google. This final post is an exercise in hypotheticals, to provide some scenarios that exercise the policy and illustrate edge cases. The following scenarios all assume a "three nines" availability SLO for a service that burns half its error budget on background errors, i.e., our error budget is 0.1% errors, and serving 0.05% errors is "normal."

First, let's recap the policy thresholds:

Threshold 1: Automated alerts notify SRE of an at-risk SLO

Threshold 2: SREs conclude they need help to defend SLO and escalate to devs

Threshold 3: The 30-day error budget is exhausted and the root cause has not been found; SRE blocks releases and asks for more support from the dev team

Threshold 4: The 90-day error budget is exhausted and the root cause has not been found; SRE escalates to executive leadership to commandeer more engineering time for reliability work

With that refresher in mind, let’s dig in to these SLO violations.

Scenario 1: A short but severe outage is quickly root-caused to a dependency problem

Scenario: A bad push of a critical dependency causes a 50% outage for an hour as the team responsible for the dependency rolls back the bad release. The error rate returns to previous levels when the rollback completes, and the team identifies the commit that is the root cause and reverts it. The team responsible for the dependency writes a postmortem to which SREs for the service contribute some AIs to prevent recurrence.

Error budget: Assuming three nines, the service burned 70% of a 30-day error budget during this outage alone. It has exceeded the 7-day budget, and, given background errors, the 30-day budget as well. Escalation: SRE was alerted to deal with the impact to production (Threshold 1). If SRE judges the class of issue (bad config or binary push) to be sufficiently rare or adequately mitigated, then the service is "brought back into SLO", and the policy requires no other escalation. This is by design—the policy errs on the side of development velocity, and blocking releases solely because the 30-day error budget is exhausted goes against that.

If this is a recurring issue (e.g., occurring two weeks in a row, burning most of a quarter's error budget) or is otherwise judged likely to recur, then it’s time to escalate to Threshold 2 or 3.

Scenario 2: A short but severe outage has no obvious root cause

Scenario: The service has a 50% outage for an hour, cause unknown.

Error budget: Same as previous scenario: the service has exceeded both its 7-day and 30-day error budgets.

Escalation: SRE is alerted to deal with the impact to production (Threshold 1). SRE escalates quickly to the dev team. They may request new metrics to provide additional visibility of the problem, install fallback alerting, and the SRE and dev oncall prioritize investigating the issue for the next week (Threshold 2).

If the root cause continues to evade understanding, SRE pauses feature releases after the first week until the outage passes out of the 30-day SLO measurement window. More of the SRE and dev teams are pulled from their project work to debug or try to reproduce the outage—this is their number-one priority until the service is back within SLO or they find the root cause. As the investigation continues, SRE and dev teams shift towards mitigation, work-arounds and building defense-in-depth. Ideally, by the time the outage passes over the SLO horizon, SRE is confident that any recurrence will be easier to root-cause and will not have the same impact. In this situation, they can justify resuming release pushes even without having identified a concrete root cause.

Scenario: A regression makes it into prod at time T, and the service begins serving 0.15% errors. The root cause of the regression eludes both SRE and developers for weeks; they attempt multiple fixes but don’t reduce the impact.

Error budget: If left unresolved, this burns 1.5 months' error budget per month.

Escalation: The SRE oncall is notified via ticket at around T+5 days, when the 7-day budget is burned. SRE triggers Threshold 2 and escalates to the dev team at about T+7 days. Because of some correlations with increased CPU utilization, the SRE and dev oncall hypothesize that the root cause of the errors is a performance regression. The dev oncall finds some simple optimizations and submits them; these get into production at T+16 days as part of the normal release process. The fixes don't resolve the problem, but now it is impractical to roll back two weeks of daily code releases for the affected service.

At T+20 days, the service exceeds its 30-day error budget, triggering Threshold 3. SRE stops releases and escalates for more dev engagement. The dev team agrees to assemble a working party of two devs and one SRE, whose priority is to root-cause and remedy the problem. With no good correlation between code changes and the regression, they start doing in-depth performance profiling and removing more bottlenecks. All the fixes are aggregated into a weekly patch on top of the current production release. The efficiency of the service increases noticeably, but it still serves an elevated rate of errors.

At T+60 days, the service expends its 90-day error budget, triggering Threshold 4. SRE escalates to executive leadership to ask for a significant quantity of development effort to address the ongoing lack of reliability. The dev team does a deep dive and finds some architectural problems, eventually reimplementing some features to close out an edge-case interaction between server and client. Error rates revert to their previous state and blocked features roll out in batches once releases begin again.

Scenario 4: Recurring, transient SLI excursions

Error budget: If the errors don’t threaten the long term SLO, it's entirely SRE's responsibility to ensure the alert is properly tuned. So, for the sake of this scenario, assume that the error spikes don’t threaten the SLO over any window longer than a few hours.

Escalation: SLO-based alerting initially notifies SREs, and the escalation path proceeds as in the slow-burn case. In particular, the issue should not be considered "resolved" simply because the SLI returns to normal between releases, since the bar for bringing a service back into SLO is set higher than that. SRE may tune the alert so that a longer or faster burn of error budget triggers a page, and may decide to investigate the issue as part of their ongoing work on the service’s reliability.

Summary

It's important to “war-game” any escalation policy with hypothetical scenarios to make sure there are no unexpected edge cases and to check that the wording is clear. Circulate draft policies widely and address concerns that are raised in appendices, but don't clutter the policy itself with justifications of the chosen inflection points. Expect to have an extended discussion if your peers find your proposals contentious. Remember that the example shared here is just that—a real-life SRE team looking to meet a high availability target would likely structure their escalation policy quite differently.

In an earlier blog post, we discussed the spectrum of engineering effort between reliability and feature development and the importance of describing when and how an organization should dedicate engineering time towards the reliability of a service that is out of SLO. In this post, we show a lightly-edited SLO escalation policy and associated rationales from a Google SRE team to illustrate the trade-offs that particular teams make to maintain a high development velocity.

This SRE team works with large teams of developers focused on different areas of the serving stack, which comprises around ten high-traffic services and a dozen or so smaller ones, all with SRE support. The team has shards in Europe and America, each covering 12 hours of a follow-the-sun on-call rotation. The supported services have both coarse top-level SLOs representing desired user experience and finer-grained SLOs representing the availability requirements of stack components; crucially the SRE team can route pages to dev teams at the granularity of an individual SLO, making "revoking support" for an SLO both cheap and quick. Alerting is configured to page when the service has burned nine hours of error budget within an hour, and file a ticket when it has burned one week of error budget over the previous week.

It's important to note that this policy is just an example, and probably a poor one if your SRE team supports a service with availability targets of 99.99% or higher. The industry that this Google team operates in is highly competitive and moves quickly, making feature iteration speed and time-to-market more important than maintaining high levels of availability.

Escalation policy preamble

Before getting into the specifics of the escalation policy, it's important to consider the following broad points.

The intent of an escalation policy is not to be completely proscriptive; SREs are expected to make judgement calls as to appropriate responses to situations they face. Instead, this document establishes reasonable thresholds for specific actions to take place, with the intent of reducing the likely range of responses and achieving a measure of consistency. It's structured as a series of thresholds that, when crossed, trigger the redirection of more engineering effort towards addressing an SLO violation.

Furthermore, SRE must focus on fixing the class of issue before declaring an incident resolved. This is a higher bar than fixing the issue itself. For example, if a bad flag flip causes a severe outage, reverting the flag flip is insufficient to bring the service back into SLO. SRE must instead ensure that flag flips in general are extremely unlikely to threaten the SLO in the future, with staged rollouts, automated rollbacks on push failures, and versioned configuration to tie flags to binary versions.

For the following four thresholds in the escalation policy, "bringing a service back into SLO" means:

finding the root cause and fixing the relevant class of issue, or

automating remediation such that ongoing manual intervention is no longer necessary, or

simply waiting one week, if the class of issue is extremely unlikely to recur with frequency and severity sufficient to threaten the SLO in the future

In other words, a plan for manual remediation is not sufficient to consider the service back within SLO. Bear in mind that you usually need to understand the root cause of a violation to conclude that it's unlikely to recur or to automate remediation.

Escalation policy thresholds

Threshold 1 - wherein SRE are notified that an SLO is potentially impacted

SRE will maintain alerting so as to be notified of danger to supported SLOs. Upon being notified, SRE will investigate and attempt to find and address the root cause. SRE will consider taking mitigating actions, including redirecting traffic at the load balancers and rolling back binary or configuration pushes. SRE on-call engineers will notify the dev team about the SLO impact and keep them updated as necessary, but no action on their part is required at this point.

Threshold 2 - wherein SRE escalates to the developers

If,

SRE have concluded they cannot bring the service into SLO without help, and

Conditions for the previous threshold are met for at least one week, and

The service has not been brought back into SLO, and

The 30-day error budget is exhausted

Then during the following week,

Only cherry-picked fixes for diagnosed root causes may be pushed to production

SRE may escalate to their leadership and dev management to request that members of the dev team prioritize finding and fixing the root cause over any non-emergency work

Daily updates may be made to an "escalations" mailing list (used to broadcast information about outages to a wide audience, including executive leadership).

When the service is brought back into SLO,

Normal binary releases resume

SRE creates a postmortem

Team members may re-prioritize normal project work

Threshold 4 - wherein SRE may escalate or revoke support

If,

Conditions for the previous threshold are met for at least one week, and

The service has not been brought back into SLO, and

The 90-day error budget is exhausted or the dev team is unwilling to pause feature work to improve reliability

Then,

SRE may escalate to executive leadership to commandeer more people dedicated to fixing the problem

SRE may revoke support for the SLO or the service, and re-direct or disable relevant alerting

On escalation and incident response

SREs are first responders, and there's an expectation that they'll make a reasonable effort to bring the service back within SLO before escalating to developers. As such, threshold 1 applies when the SRE team is notified about a violation, despite the one-week ticket alert indicating the seven-day budget is already exhausted. SRE should wait no longer than one week from the initial violation notification before escalating to developers, but they may exercise their own judgement as to whether escalation is appropriate before this point.

Every time SRE escalates, it’s important to ask developers whether the availability goals still represent the desired balance between reliability and development velocity. This gives them the choice between preserving availability goals by rolling back a new feature and temporarily relaxing them to preserve the availability of that feature for users if the latter is the desired user experience. For repeated violations of the same SLO in a short time window, you probably don't need to ask the question over and over again, though that's a strong signal that further escalation is necessary. It's also OK to insist that developers take back the pager for the service until they're willing to restore the previously-agreed availability targets—if they want to run a less reliable service temporarily so that a business-critical feature remains available while they work on its reliability, they can also shoulder the burden of its failures.

On blocking releases

Blocking releases is an appropriate course of action for three main reasons:

Commonly, the largest source of burnt error budget at steady state is the release push. If you’ve already burned all your budget, not pushing new releases lowers the steady-state burn rate, bringing the service back into SLO more quickly

It eliminates the risk of further unexpected SLO violations due to bugs in new code. This is also why any fixes for diagnosed root causes must be patched into the current release, rather than rolling forward to a new release

While blocking releases is not intended as a punitive measure, it does directly impact release velocity, which the dev org cares about deeply. As such, tying SLO violations to reduced velocity aligns the incentives of both organizations. SRE wants the service to stay within SLO, the dev org wants to build new features quickly. This way, either both happen or neither do.

SRE should prefer to unblock feature releases sooner rather than later, once the root cause(s) of a violation has been found and fixed. Giving our dev teams the benefit of the doubt that there will be no further service degradation before the SLO is in compliance over a 30-day window strikes a more acceptable balance between reliability and velocity. This is effectively "borrowing" future error budget to unblock the release before the service is compliant, with the expectation that it will be within a reasonable timeframe. Absent any push-related outages, new features should increase user happiness with the service, repaying some of the unhappiness caused by the SLO violation.

SRE may choose not to unblock releases if pre-violation error-budget burn rates were close to the SLO threshold. In this case, there's less future budget to borrow, thus the risk of further violations is higher and the time until the service is SLO compliant will be significantly longer if releases are allowed to continue.

Summary

We hope that the above example gives you some ideas about how to make trade-offs between reliability and development velocity for a service where the latter is a key business priority. The main concessions to velocity are that SRE doesn’t immediately block releases when an SLO is violated, and provides a mechanism for them to resume before the SLO has returned to compliance with the informed consent of SRE. In the final post of the series, we'll take these policy thresholds out for a spin with some hypothetical scenarios.

Previous episodes of CRE life lessons have talked in detail about the importance of quantifying a service's availability and using SLOs to manage the competing priorities of features-focused development teams ("devs") versus a reliability-focused SRE team. Good SLOs can help reduce organizational friction and maintain development velocity without sacrificing reliability. But what should happen when SLOs are violated?

In this blogpost, we discuss why you should create a policy on how SREs and devs respond to SLO violations, and provide some ideas for the structure and components of that policy. Future posts will go over an example taken from an SRE team here at Google, and work through some scenarios that put that policy into action.

Features or reliability?

In the ideal world (assuming spherical SREs in a vacuum), an SLO represents the dividing line between two binary states: developing new features when there's error budget to spare, and improving service reliability when there isn't. Most real engineering organizations will instead vary their effort on a spectrum between these two extremes as business priorities dictate. Even when a service is operating well within its SLOs, choosing to do some proactive reliability work may reduce the risk of future outages, improve efficiency and provide cost savings; conversely it's rare to find an organization that completely drops all in-flight feature development as soon as an SLO is violated.

Describing key inflection points from that spectrum in a policy document is an important part of the relationship between an SRE team and the dev teams with whom they partner. This ensures that all parts of the organization have roughly the same understanding around what is expected of them when responding to (soon to be) violated SLOs, and – most importantly – that the consequences of not responding are clearly communicated to all parties. The exact choice of inflection points and consequences will be specific to the organization and its business priorities.

Inflection points

Having a strong culture of blameless postmortems and fixing root causes should eventually mean that most SLO violations are unique – informally, “we are in the business of novel outages.” It follows that the response to each violation will also be unique; making judgement calls around these is a part of an SREs job when responding to the violation. But a large variance in the range of possible responses results in inconsistency of outcomes, people trying to game the system and uncertainty for the engineering organization.

For the purposes of an escalation policy, we recommend that SLO violations be grouped into a few buckets of increasing severity based on the cumulative impact of the violation over time (i.e., how much error budget has been burned over what time horizon), with clearly defined boundaries for moving from one bucket to another. It's useful to have some business justification for why violations are grouped as they are, but this should be in an appendix to the main policy to keep the policy itself clear.

It's a good idea to tie at least some of the bucket boundaries to any SLO-based alerting you have. For example, you may choose to page SREs to investigate when 10% of the weekly error budget has been burned in the past hour; this is an example of an inflection point tied to a consequence. It forms the boundary between buckets we might informally title "not enough error budget burned to notify anyone immediately" and "someone needs to investigate this right now before the service is out of its long-term SLO." We'll examine more concrete examples in our next post, where we look at a policy from an SRE team within Google.

Consequences

The consequences of a violation are the meat of the policy. They describe actions that will be taken to bring the service back into SLO, whether this is by root causing and fixing the relevant class of issue, automating any stop-gap mitigation tasks or by reducing the near-term risk of further deterioration. Again, the choice of consequence for a given threshold is going to be specific to the organization defining the policy, but there are several broad areas into which these fall. This list is not exhaustive!

Notify someone of potential or actual SLO violation

The most common consequence of any potential or actual SLO violation is that your monitoring systems tells a human that they need to investigate and take remedial action. For a mature, SRE-supported service, this will normally be in the form of a page to the oncall when a large quantity of error budget has been burned over a short window, or a ticket when there’s an elevated burn rate over a longer time horizon. It's not a bad idea for that page to also create a ticket in which you can record debugging details, use as a centralized communication point and reference when escalating a serious violation.

The relevant dev team should also be notified. It's OK for this to be a manual process; the SRE team can add value by filtering and aggregating violations and providing meaningful context. But ideally a small group of senior people in the dev team should be made aware of actual violations in an automated fashion (e.g., by CCing them on any tickets), so that they're not surprised by escalations and can chime in if they have pertinent information.

Escalate the violation to the relevant dev team

The key difference between notification and escalation is the expectation of action on the part of the dev team. Many serious SLO violations require close cooperation between SREs and developers to find the root cause and prevent recurrence. Escalation is not an admission of defeat. SREs should escalate as soon as they're reasonably sure that input from the dev team will meaningfully reduce the time to resolution. The policy should set an upper bound on the length of time an SLO violation (or near miss) can persist without escalation.

Escalation does not signify the end of SRE’s involvement with an SLO violation. The policy should describe the responsibilities of each team and a lower bound on the amount of engineering time they should divert towards investigating the violation and fixing the root cause. It will probably be useful to describe multiple levels of escalation, up to and including getting executive-level support to commandeer the engineering time of the entire dev team until the service is reliable.

Mitigate risk of service changes causing further impact to SLOs

Since a service in violation of its SLO is by definition making users unhappy, day-to-day operations that may increase the rate at which error budget is burned should be slowed or stopped completely. Usually, this means restricting the rate of binary releases and experiments, or stopping them completely until the service is again within SLO. This is where the policy needs to ensure all parties (SRE, development, QA/testing, product and execs) are on the same page. For some engineering organizations, the idea that SLO violations will impact their development and release velocity may be difficult to accept. Reaching a documented agreement on how and when releases will be blocked – and what fraction of engineers will be dedicated to reliability work when this occurs – is a key goal.

Revoke support for the service

If a service is shown to be incapable of meeting its agreed-upon SLOs over an extended time period, and the dev team responsible for that service is unwilling to commit to engineering improvements to its reliability, then SRE teams at Google have the option of handing back the responsibility for running that service in production. This is unlikely to be the consequence of a single SLO violation, rather the combination of multiple serious outages over an extended period of time, where postmortem AIs have been assigned to the dev team but not prioritized or completed.

This has worked well at Google, because it changes the incentives behind any conversation around engineering for reliability. Any dev team that neglects the reliability of a service knows that they will bear the consequences of that neglect. By definition, revoking SRE support for a service is a last resort, but stating the conditions that must be met for it to happen makes it a matter of policy, not an idle threat. Why should SRE care about service reliability if the dev team doesn't?

Summary

Hopefully this post has helped you think about the trade-off between engineering for reliability and features, and how responding to SLO violations moves the needle towards reliability. In our next post, we'll present an escalation policy from one of Google's SRE teams, to show the choices they made to help the dev teams they partner with maintain a high development velocity.

The end of the year is a time for reflection . . . and making lists. As 2017 comes to a close, we thought we’d review some of the most memorable Google Cloud Platform (GCP) product announcements, white papers and how-tos, as judged by popularity with our readership.

As we pulled the data for this post, some definite themes emerged about your interests when it comes to GCP:

You love to hear about advanced infrastructure: CPUs, GPUs, TPUs, better network plumbing and more regions.

How we harden our infrastructure is endlessly interesting to you, as are tips about how to use our security services.

Open source is always a crowd-pleaser, particularly if it presents a cloud-native solution to an age-old problem.

You’re inspired by Google innovation — unique technologies that we developed to address internal, Google-scale problems.
So, without further ado, we present to you the most-read stories of 2017.

Open, hybrid development

When you think about GCP and open source, Kubernetes springs to mind. We open-sourced the container management platform back in 2014, but this year we showed that GCP is an optimal place to run it. It’s consistently among the first cloud services to run the latest version (most recently, Kubernetes 1.8) and comes with advanced management features out of the box. And as of this fall, it’s certified as a conformant Kubernetes distribution, complete with a new name: Google Kubernetes Engine.

Part of Kubernetes’ draw is as a platform-agnostic stepping stone to the cloud. Accordingly, many of you flocked to stories about Kubernetes and containers in hybrid scenarios. Think Pivotal Container Service and Kubernetes’ role in our new partnership with Cisco. The developers among you were smitten with Cloud Container Builder, a stand-alone tool for building container images, regardless of where you deploy them.

Google innovation

In distributed database circles, Google’s Spanner is legendary, so many of you were delighted when we announced Cloud Spanner and a discussion of how it defies the CAP Theorem. Having a scalable database that offers strong consistency and great performance seemed to really change your conception of what’s possible — as did Cloud IoT Core, our platform for connecting and managing “things” at scale. CREs, meanwhile, showed you the Google way to handle an incident.

Lastly, we want to thank all our customers, partners and readers for your continued loyalty and support this year, and wish you a peaceful, joyful, holiday season. And be sure to rest up and visit us again Next year. Because if you thought we had a lot to say in 2017, well, hold onto your hats.

The end of the year is a time for reflection . . . and making lists. As 2017 comes to a close, we thought we’d review some of the most memorable Google Cloud Platform (GCP) product announcements, white papers and how-tos, as judged by popularity with our readership.

As we pulled the data for this post, some definite themes emerged about your interests when it comes to GCP:

You love to hear about advanced infrastructure: CPUs, GPUs, TPUs, better network plumbing and more regions.

How we harden our infrastructure is endlessly interesting to you, as are tips about how to use our security services.

Open source is always a crowd-pleaser, particularly if it presents a cloud-native solution to an age-old problem.

You’re inspired by Google innovation — unique technologies that we developed to address internal, Google-scale problems.
So, without further ado, we present to you the most-read stories of 2017.

Open, hybrid development

When you think about GCP and open source, Kubernetes springs to mind. We open-sourced the container management platform back in 2014, but this year we showed that GCP is an optimal place to run it. It’s consistently among the first cloud services to run the latest version (most recently, Kubernetes 1.8) and comes with advanced management features out of the box. And as of this fall, it’s certified as a conformant Kubernetes distribution, complete with a new name: Google Kubernetes Engine.

Part of Kubernetes’ draw is as a platform-agnostic stepping stone to the cloud. Accordingly, many of you flocked to stories about Kubernetes and containers in hybrid scenarios. Think Pivotal Container Service and Kubernetes’ role in our new partnership with Cisco. The developers among you were smitten with Cloud Container Builder, a stand-alone tool for building container images, regardless of where you deploy them.

Google innovation

In distributed database circles, Google’s Spanner is legendary, so many of you were delighted when we announced Cloud Spanner and a discussion of how it defies the CAP Theorem. Having a scalable database that offers strong consistency and great performance seemed to really change your conception of what’s possible — as did Cloud IoT Core, our platform for connecting and managing “things” at scale. CREs, meanwhile, showed you the Google way to handle an incident.

Lastly, we want to thank all our customers, partners and readers for your continued loyalty and support this year, and wish you a peaceful, joyful, holiday season. And be sure to rest up and visit us again Next year. Because if you thought we had a lot to say in 2017, well, hold onto your hats.

In our previous post we discussed the benefits of sharing internal postmortems outside your company. You may adopt a one:many approach with an incident summary that tells all your customers what happened and how you'll prevent it from happening again. Or, if the incident impacted a major customer, you may share something close to your original postmortem with them.

In this post, we consider how to review a postmortem with your affected customer(s) for better actionable data and also to help customers improve their systems and practices. We also present a worked example of a shared postmortem based on the SRE Book postmortem template.

Postmortems should fix your customer too

How to get outages to benefit everyone

Even if the fault was 100% on you, the platform side, an external postmortem can still help customers improve their reliability. Now that we know what happens when a particular failure occurs, how can we generalize this to help the customer mitigate the impact, and reduce MTTD and MTTR for a similar incident in the future?

One of the best sources of data for any postmortem is your customers’ SLOs, with their ability to measure the impact of a platform outage. Our CRE team talks about SLOs quite a lot in the CRE Life Lessons series, and there’s a reason why: SLOs and error budgets inform more than just whether to release features in your software.

For customers with defined SLOs who suffered a significant error budget impact, we recommend conducting a postmortem review with them. The review is partly to ensure that the customer’s concerns were addressed, but also to identify “what went wrong,” “where we got lucky” and how to identify actions which would address these for the customer.

For example, the platform’s storage service suffered increased latency for a certain class of objects in a region. This is not the customer’s fault, but they may still be able to do something about it.

The internal postmortem might read something like:

What went well

The shared monitoring implemented with CustomerName showed a clear single-region latency hit which resulted in a quick escalation to storage oncall.

What went wrong

A new release of the storage frontend suffered from a performance regression for uncached reads that was not detected during testing or rollout.

Where we got lucky

Only reads of objects between 4KB and 32KB in size were materially affected.

Action items

Add explicit read/write latency testing in testing for both cached and uncached objects in buckets of 1KB, 4KB, 32KB, …

Have paging alerts for latency over SLO limits, aggregated by Cloud region, for both cached and uncached objects, in buckets of 1KB, 4KB, 32KB, ...

When a customer writes their own postmortem about this incident, using the shared postmortem to understand better what broke in the platform and when, that postmortem might look like:

What went well

We had anticipated a generic single-region platform failure and had the capability to fail over out of an affected region.

What went wrong

Although the latency increase was detected quickly, we didn’t have accessible thru-stack monitoring that could show us that it was coming from platform storage-service rather than our own backends.

Our decision to fail out of the affected region took nearly 30 minutes to complete because we had not practiced it for one year and our playbook instructions were out of date.

Where we got lucky

This happened during business hours so our development team was on hand to help diagnose the cause.

Action items

Add explicit dashboard monitoring for aggregate read and write latency to and from platform storage-service.

Run periodic (at least once per quarter) test failovers out of a region to validate that the failover instructions still work and increase ops team confidence with the process.

Prioritize and track your action items

A postmortem isn’t complete until the root causes have been fixed

Sharing the current status of your postmortem action items is tricky. It's unlikely that the customer will be using the same issue tracking system as you are, so neither side will have a “live” view of which action items from a postmortem have been resolved, and which are still open. Within Google we have automation which tracks this and “reminds” us of unclosed critical actions from postmortems, but customers can’t see those unless we surface them in the externally-visible part of our issue tracking system, which is not our normal practice.

Currently, we hold a monthly SLO review with each customer, where we list the major incidents and postmortem/incident report for each incident; we use that occasion to report on open critical bug statuses from previous months’ incidents, and check to see how the customer is doing on their actions.

Other benefits

Opening up is an opportunity

There are practical reliability benefits of sharing postmortems, but there are other benefits too. Customers who are evolving towards an SRE culture and adopting blameless postmortems can use the external postmortem as a model for their own internal write-ups. We’re the first to admit that it’s really hard to write your own first postmortem from scratch—having a collection of “known-good” postmortems as a reference can be very helpful.

At a higher level, shared postmortems give your customer a “glimpse behind the curtain.” When a customer moves from on-premises hardware to the cloud, it can be frightening; they're giving up a lot of control of and visibility into the platform on which their service runs. The cloud is expected to encapsulate the operational details of the services it offers, but unfortunately it can be guilty of hiding information that the customer really wants to see. A detailed external postmortem makes that information visible, giving the customer a timeline and deeper detail, which hopefully they can relate to.

Joint postmortems

If you want joint operations, you need joint postmortems

The final step in the path to shared postmortems is creating a joint postmortem. Until this point, we’ve discussed how to externalize an existing document, where the action items, for example, are written by you and assigned to you. With some customers, however, it makes sense to do a joint postmortem where you both contribute to all sections of the document. It will not only reflect your thoughts from the event, but it will also capture the customer’s thoughts and reactions, too. It will even include action items that you assign to your customer, and vice-versa!

Of course, you can’t do joint postmortems with large numbers of your customers, but doing so with at least a few of them helps you (a) build shared SRE culture, and (b) keep the customer perspective in your debugging, design and planning work.

Joint postmortems are also one of the most effective tools you have to persuade your product teams to re-prioritize items on their roadmap, because they present a clear end-user story of how those items can prevent or mitigate future outages.

Summary

Sharing your postmortems with your customers is not an easy thing to do; however, we have found that it helps:

Gain a better understanding of the impact and consequences of your outages

Increase the reliability of your customers’ service

Give customers confidence in continuing to run on your platform even after an outage.

To get you started, here's an example of an external postmortem for the aforementioned storage frontend outage, using the SRE Book postmortem template. (Note: Text relating to the customer (“JaneCorp”) is marked in purple for clarity.) We hope it sets you on the path to learning and growing from your outages. Happy shared postmortem writing!

We here on Google’s Site Reliability Engineering (SRE) teams have found that writing a blameless postmortem — a recap and analysis of a service outage — makes systems more reliable, and helps service owners learn from the event.

Postmortems are easy to do within your company — but what about sharing them outside your organization? Why indeed, would you do this in the first place? It turns out that if you're a service or platform provider, sharing postmortems with your customers can be good for you and them too.

In this installment of CRE Life Lessons, we discuss the benefits and complications that external postmortems can bring, and some practical lessons about how to craft them.

Well-known external postmortems

There is prior art, and you should read it.

Over the years, we’ve had our share of outages and recently, we’ve been sharing more detail about them than we used to. For example, on April 11 2016 Google Compute Engine dropped inbound traffic, resulting in this public incident report.

Other companies are also publishing detailed postmortems about their own outages. Who can forget the time when:

We in Google SRE love reading these postmortems — and not because of schadenfreude. Indeed, many of us read them and think “there but for the grace of (Deity) go we” and wonder whether we would withstand a similar failure. Indeed, when you’re thinking this, it’s a good time to run a DiRT exercise.

For platform providers that offer a wide range of services to a wide range of users, fully public postmortems such as these make sense (even though they're a lot of work to prepare and open you up to criticism from competitors and press). But even if the impact of your outage isn’t as broad, if you are practising SRE, it can still make sense to share postmortems with customers that have been directly impacted. Caring about your customers’ reliability means sharing the details of your outages.

This is the position we take on the Google Cloud Platform (GCP) Customer Reliability Engineering. To help customers run reliably on GCP, we teach them how to engineer increased reliability for their service by implementing SRE best practices in our work together. We identify and quantify architectural and operational risks to each customer’s service, and work with them to mitigate those risks and drive to sustain system reliability at their SLO (Service Level Objectives) target.

Specifically, the CRE team works with each customer to help them meet the availability target expressed by their SLOs. For this, the principal steps are to:

Define a comprehensive set of business-relevant SLOs

Get the customer to measure compliance to those SLOs in their monitoring platform (how much of the service error budget has been consumed)

Share that live SLO information with Google support and product SRE teams (which we term shared monitoring)

Jointly monitor and react to SLO breaches with the customer (shared operational fate)

If you run a platform — or some approximation thereof — then you too should practice SRE with your customers to get that increased reliability, prevent your customers from tripping over your changes, and gain better insights into the impact and scope of your failures.

Then, when an incident occurs that causes the service to exceed its error budget — or consumes an unacceptably high proportion of the error budget — the service owner needs to determine:

How much of the error budget did this consume in total?

Why did the incident happen?

What can / should be done to stop it from happening again?

Answering Question #1 is easy, but the mechanism for evaluating Questions #2 and #3 is a postmortem. If the incident root cause was purely on the customer’s side, that’s easy — but what if the trigger was an event on your platform side? This is when you should consider an external postmortem.

Foundations of an external postmortem

Analyzing outages — and subsequently writing about them in a postmortem — benefits from having a two-way flow of monitoring data between the platform operator and the service owner, which provides an objective measure of the external impact of the incident: When did it start, how long did it last, how severe was it, and what was the total impact on the customer’s error budget? Here on the GCP CRE team, we have found this particularly useful, since it's hard to estimate the impact of problems in lower-level cloud services on end users. We may have observed a 1% error rate and increased latency internally, but was it noticeable externally after traveling through many layers of the stack?

Based on the monitoring data from the service owner and their own monitoring, the platform team can write their postmortem following the standard practices and our postmortem template. This results in an internally reviewed document that has the canonical view of the incident timeline, the scope and magnitude of impact, and a set of prioritized actions to reduce the probability of occurrence of the situation (increased Mean Time Between Failures), reduce the expected impact, improve detection (reduced Mean Time To Detect) and/or recover from the incident more quickly (reduced Mean Time To Recover).

With a shared postmortem, though, this is not the end: we want to expose some —though likely not all — of the postmortem information to the affected customer.

Selecting an audience for your external postmortem

If your customers have defined SLOs, they (and you) know how badly this affected them. Generally, the greater the error budget that has been consumed by the incident, the more interested they are in the details, and the more important it will be to share with them. They're also more likely to be able to give relevant feedback to the postmortem about the scope, timing and impact of the incident, which might not have been apparent immediately after the event.

If your customer’s SLOs weren’t violated but this problem still affected their customers, that’s an action item for the customer’s own postmortem: what changes need to be made to either the SLO or its measurements? For example, was the availability measurement further down in the stack compared to where the actual problem occurred?

If your customer doesn’t have SLOs that represent the end-user experience, it’s difficult to make an objective call about this. Unless there are obvious reasons why the incident disproportionately affected a particular customer, you should probably default to a more generic incident report.

Another factor you should consider is whether the customers with whom you want to share the information are under NDA; if not, this will inevitably severely limit what you're able to share.

If the outage has impacted most of your customers, then you should consider whether the externalized postmortem might be the basis for writing a public postmortem or incident report, like the examples we quoted above. Of course, these are more labor-intensive than external postmortems shared with select customers (i.e., editing the internal postmortem and obtaining internal approvals), but provide additional benefits.

The greatest gain from a fully public postmortem can be to restore trust from your user base. From the point of view of a single user of your platform, it’s easy to feel that their particular problems don’t matter to you. A public postmortem gives them visibility into what happened to their service, why, and how you're trying to prevent it from happening again. It’s also an opportunity for them to conduct their own mini-postmortem based on the information in the public post, asking themselves “If this happened again, how would I detect it and how could I mitigate the effects on my service?”

Deciding how much to share, and why?

Another question when writing external postmortems is how deep to get into the weeds of the outage. At one end of the spectrum you might share your entire internal postmortem with a minimum of redaction; at the other you might write a short incident summary. This is a tricky issue that we’ve debated internally.

The two factors we believe to be most important in determining whether to expose the full detail of a postmortem to a customer, rather than just a summary, are:

How important are the details to understanding how to defend against a future re-occurrence of the event?

How badly did the event damage their service, i.e., how much error budget did it consume?

As an example, if the customer can see the detailed timeline of the event from the internal postmortem, they may be able to correlate it with signals from their own monitoring and reduce their time-to-detection for future events. Conversely, if the outage only consumed 8% of their 30-day error budget then all the customer wants to know is whether the event is likely to happen more often than once a month.

We have found that, with a combination of automation and practice, we can produce a shareable version of an internal postmortem with about 10% additional work, plus internal review. The downside is that you have to wait for the postmortem to be complete or nearly complete before you start. By contrast, you can write an incident report with a similar amount of effort as soon as the postmortem author is reasonably confident in the root cause.

What to say in a postmortem

By the time the postmortem is published, the the incident has been resolved, and the customer really cares about three questions:

Why did this happen?

Could it have been worse?

How can we make sure it won’t happen again?

“Why did this happen?” comes from the “Root causes and Trigger” and “What went wrong” sections of our postmortem template. “Could it have been worse?” comes from “Where we got lucky.”These are two sections which you should do your best to retain as-is in an external postmortem, though you may need to do some rewording for clarity.

“How can we make sure it won’t happen again” will come from the Action items table of the postmortem.

What not to say

With that said, postmortems should never include these three things:

Names of humans - Rather than “John Smith accidentally kicked over a server”, say “a network engineer accidentally kicked over a server,” Internally, we try to express the role of humans in terms of role rather than name. This helps us keep a blameless postmortem culture.

Names of internal systems - The names of your internal systems are not clarifying for your users and creates a burden on them to discover how these things fit together. For example, even though we’ve discussed Chubby externally, we still refer to it in postmortems we make external as “our globally distributed lock system.”

Customer-specific information - The internal version of your postmortem will likely say things like “on XX:XX, Acme Corp filed a Support ticket alerting us to a problem.” It’s not your place to share this kind of detail externally as it may create an undue burden for the reporting company (in this case Acme Corp.). Rather, simply say “on XX:XX, a customer filed…”. If you’re going to reference more than one customer, then just label them Customer A, Customer B, etc..

Other things to watch out for

Another source of difficulty when rewriting a shared postmortem is the “Background” section that sets the scene for the incident. An internal postmortem assumes the reader has basic knowledge of the technical and operational background; this is unlikely to be true for your customer. We try to write the least detailed explanation that still allows the reader to understand why the incident happened; too much detail here is more likely to be off-putting than helpful.

Google SREs are fans of embedding monitoring graphs in postmortems; monitoring data is objective and doesn’t generally lie to you (although our colleague Sebastian Kirsch has some very useful guidance as to when this is not true). When you share a postmortem outside the company, however, be careful what information these graphs reveal about traffic levels and number of users of a service. Our rule of thumb is to leave the X axis (time) alone, but for the Y axis either remove the labels and quantities all together, or only show percentages. This is equally true for incorporating customer-generated data in an internal postmortem.

A side note on the role of luck

With apologies to Tina Turner, What’s luck got to do, got to do with it? What’s luck but a source of future failures?

As well as “What went well” and “What went badly” our internal postmortem template includes the section “Where we got lucky.” This is a useful place to tease out risks of future failures that were revealed by an incident. In many cases an incident had less impact than it might have, because of relatively random factors such as timing, presence of a particular person as the on-call, or co-incidence with another outage that resulted in more active scrutiny of the production systems than normal.

“Where we got lucky” is an opportunity to identify additional action items for the postmortem, e.g.,

“the right person was on-call” implies tribal knowledge that needs to be fed into a playbook and exercised in a DiRT test

“this other thing (e.g., a batch process or user action) wasn’t happening at the same time” implies that your system may not have sufficient surplus capacity to handle a peak load, and you should consider adding resources

“the incident happened during business hours” implies a need for automated alerting and 24-hour pager coverage by an on-call

“we were already watching monitoring” implies a need to tune alerting rules to pick up the leading edge of a similar incident if it isn’t being actively inspected.

Sometimes teams also add “Where we got unlucky,” when the incident impact was aggravated by a set of circumstances that are unlikely to re-occur. Some examples of unlucky behavior are:

an outage occurred on your busiest day of the year

you had a fix for the problem that hadn't been rolled out for other reasons

a weather event caused a power loss.

A major risk in having a “Where we got unlucky” category is that it’s used to label problems that aren’t actually due to blind misfortune. Consider this example from an internal postmortem:

Where we got unlucky

There were various production inconsistencies caused by past outages and experiments. These weren’t cleaned up properly and made it difficult to reason about the state of production.

This should instead be in “What went badly,” because there are clear action items that could remediate this for the future.

When you have these unlucky situations, you should always document them as part of "What went badly," while assessing the likelihood of them happening again and determining what actions you should take. You may choose not to mitigate every risk since you don’t have infinite engineering time, but you should always enumerate and quantify all the risks you can see so that “future you” can revisit your decision as circumstances change.

Summary

Hopefully we've provided a clear motivation for platform and service providers to share their internal postmortems outside the company, at some appropriate level of detail. In the next installment, we'll discuss how to get the greatest benefit out of these postmortems.

By Robert van Gent and Stephen Thorne, Customer Reliability Engineers and Cody Smith, Site Reliability Engineer

In a previous episode of CRE Life Lessons, we discussed how choosing good service level indicators (SLIs) and service level objectives (SLOs) is critical for defining and measuring the reliability of your service. There’s also a whole chapter in the SRE book about this topic. In this episode, we’re going to get meta and go into more detail about some best practices we use at Google to formulate good SLOs for our SLIs.

SLO musings

SLOs are objectives that your business aspires to meet and intends to take action to defend; just remember, your SLOs are not your SLAs (service level agreements)! You should pick SLOs that represent the most critical aspects of the user experience. If you meet an SLO, your users and your business should be happy. Conversely, if the system does not meet the SLO, that implies there are users who are being made unhappy!

Your business needs to be able to defend an endangered SLO by reducing the frequency of outages, or reducing the impact of outages when they occur. Some ways to do this might include: slowing down the rate at which you release new versions, or by implementing reliability improvements instead of features. All parts of your business need to acknowledge that these SLOs are valuable and should be defended through trade-offs.

Here are some important things to keep in mind when designing your SLOs:

An SLO can be a useful tool for resolving meaningful uncertainty about what a team should be doing. The objective is a line in the sand between "we definitely need to work on this issue" and "we might not need to work on this issue." Therefore, don’t pick SLO targets that are higher than what you actually need, even if you happen to be meeting them now, as that reduces your flexibility to change things in the future, including trade offs against reliability, like development velocity.

Group queries into SLOs by user experience, rather than by specific product elements or internal implementation details. For example, direct responses to user action should be grouped into a different SLO than background or ancillary responses (e.g., thumbnails). Similarly, “read” operations (e.g., view product) should be grouped into a different SLO than lower volume but more important “write” ones (e.g., check out). Each SLO will likely have different availability and latency targets.

Be explicit about the scope of your SLOs and what they cover (which queries, which data objects) and under what conditions they are offered. Be sure to consider questions like whether or not to count invalid user requests as errors, or happens when a single client spams you with lots of requests.

Finally, though somewhat in tension with the above, keep your SLOs simple and specific. It’s better not to cover non-critical operations with an SLO than to dilute what you really care about. Gain experience with a small set of SLOs; launch and iterate!

Example SLOs

Availability

Here we're trying to answer the question "Was the service available to our user?" Our approach is to count the failures and known missed requests, and report the measurement as a percentage. Record errors from the first point that is in your control (e.g., data from your Load Balancer, not from the browser’s HTTP requests). For requests between microservices, record data from the client side, not the server side.

That leaves us with an SLO of the form:

Availability: <service> will <respond successfully> for <a customer scope> for at least <percentage> of requests in the <SLO period>

For example . . .

Availability:Node.js will respond with a non-503 within 30 seconds for browser pageviews for at least 99.95% of requests in the month.

. . . and . . .

Availability: Node.js will respond with a non-503 within 60 seconds for mobile API calls for at least 99.9% of requests in the month.

For requests that took longer than 30 seconds (60 second for mobile), the service might as well have been down, so they count against our availability SLO.

Latency

Latency is a measure of how well a service performed for our users. We count the number of queries that are slower than a threshold, and report them as a percentage of total queries. The best measurements are done as close to the client as possible, so measure latency at the Load Balancer for incoming web requests, and from the client not the server for requests between microservices.

Latency: <service> will respond within <time limit> for at least <percentage> of requests in the <SLO period>.

For example . . .

Latency: Node.js will respond within 250ms for at least 50% of requests in the month, and within 3000ms for at least 99% of requests in the month.

Percentages are your friend . . .

Note that we expressed our latency SLI as a percentage: “percentage of requests with latency < 3000ms” with target of 99%, not “99th percentile latency in ms” with target “< 3000ms”. This keeps SLOs consistent and easy to understand, because they all have the same unit and the same range. Also, accurately computing percentiles across large data sets is hard, while counting the number of requests below a threshold is easy. You’ll likely want to monitor multiple thresholds (e.g., percentage of requests < 50ms, < 250ms, . . .), but having SLO targets of 99% for one threshold, and possibly 50% for another, is generally sufficient.

Avoid targeting average (mean) latency — it's almost never what you want. Averages can hide outliers, and sufficiently small values are indistinguishable from zero; users will not notice a difference between 50 ms and 250 ms for a full page response time, and thus they should be comparably good. There’s a big difference between an average of 250ms because all requests are taking 250ms, and an average of 250ms because 95% of requests are taking 1ms and 5% of requests are taking 5s.

. . . except 100%

A target of 100% is impossible over any meaningful length of time. It’s also likely not necessary. SREs use SLOs to embrace risk; the inverse of your SLO target is your error budget, and if your SLO target is 100% that means you have no error budget! In addition, SLOs are a tool for establishing team priorities — dividing top-priority work from work that's prioritized on a case-by-case basis. SLOs tend to lose their credibility if every individual failure is treated as a top priority.

Regardless of the SLO target that you eventually choose, the discussion is likely to be very interesting; be sure to capture the rationale for your chosen target for posterity.

Reporting

Report on your SLOs quarterly, and use quarterly aggregates to guide policies, particularly pager thresholds. Using shorter periods tends to shift focus to smaller, day-to-day issues, and away from the larger, infrequent issues that are more damaging. Any live reporting should use the same sliding window as the quarterly report, to avoid confusion; the published quarterly report is merely a snapshot of the live report.

Example quarterly SLO summary

This is how you might present the historical performance of your service against SLO, e.g., for a semi-annual service report, where the SLO period is one quarter:

SLO

Target

Q2

Q3

Web Availability

99.95%

99.92%

99.96%

Mobile Availability

99.9%

99.91%

99.97%

Latency ≤ 250ms

50%

74%

70%

Latency ≤ 3000ms

99%

99.4%

98.9%

For SLO-dependent policies such as paging alerts or freezing of releases when you’ve spent the error budget, use a sliding window shorter than a quarter. For example, you might trigger a page if you spent ≥1% of the quarterly error budget over the last four hours, or you might freeze releases if you spent ≥ ⅓ of the quarterly budget in the last 30 days.

Breakdowns of SLI performance (by region, by zone, by customer, by specific RPC, etc.) are useful for debugging and possibly for alerting, but aren’t usually necessary in the SLO definition or quarterly summary.

Finally, be mindful about with whom you share your SLOs, especially early on. They can be a very useful tool for communicating expectations about your service, but the more broadly they are exposed the harder it is to change them.

Conclusion

SLOs are a deep topic, but we’re often asked about handy rules of thumb people can use to start reasoning about them. The SRE book has more on the topic, but if you start with these basic guidelines, you’ll be well on your way to avoiding the most common mistakes people make when starting with SLOs. Thanks for reading, we hope this post has been helpful. And as we say here at Google, may the queries flow, your SLOs be met and the pager stay silent!

In the first part of this series, we introduced you to the concept of dark launches. In a dark launch, you take a copy of your incoming traffic and send it to the new service, then throw away the result. Dark launches are useful when you want to launch a new version of an existing service, but don’t want nasty surprises when you turn it on.

This isn’t always straightforward as it sounds, however. In this blog post, we’ll look at some of the circumstances that can make things difficult for you, and teach you how to work around them.

Finding a traffic source

Do you actually have existing traffic for your service? If you’re launching a new web service which is not more-or-less-directly replacing an existing service, you may not.

As an example, say you’re an online catalog company that lets users browse items from your physical store’s inventory. The system is working well, but now you want to give users the ability to purchase one of those items. How would you do a dark launch of this feature? How can you approximate real usage when no user is even seeing the option to purchase an item?

One approach is to fire off a dark-launch query to your new component for every user query to the original component. In our example, we might send a background “purchase” request for an item whenever the user sends a “view” request for that item. Realistically, not every user who views an item will go on to purchase it, so we might randomize the dark launch by only sending a “purchase” request for one in every five views.

This will hopefully give you an approximation of live traffic in terms of volume and pattern. Note that this can’t be expected to be totally accurate when it comes to to live traffic when the service is launched. But, it’s better than nothing.

Dark launching mutating services

Generally, a read-only service is fairly easy to dark-launch. A service with queries that mutate backend storage is far less easy. There are still strong reasons for doing the dark launch in this situation, because it gives you some degree of testing that you can’t reasonably get elsewhere, but you’ll need to invest significant effort to get the most from dark-launching.

Unless you’re doing a storage migration, you’ll need to make significant effort/payoff tradeoffs doing dark launches for mutating queries. The easiest option is to disable the mutates for the dark-launch traffic, returning a dummy response after the mutate is prepared but before it’s sent. This is safe, but it does mean that you’re not getting a full measurement of the dark launched service — what if it has a bug that causes 10% of the mutate requests to be incorrectly specified?

Alternatively, you might choose to send the mutation to a temporary duplicate of your existing storage. This is much better for the fidelity of your test, but great care will be needed to avoid sending real users the response from your temporary duplicate. It would also be very unfortunate for everyone if, at the end of your dark launch, you end up making the new service live when it’s still sending mutations to the temporary duplicate storage.

Storage migration

If you’re doing a storage migration — moving an existing system’s stored data from one storage system to another (for instance, MySQL to MongoDB because you’ve decided that you don’t really need SQL after all) — you’ll find that dark launches will be crucial in this migration, but you’ll have to be particularly careful about how you handle mutation-inducing queries. Eventually you’ll need mutations to take effect in both your old and new storage systems, and then you’ll need to make the new storage system the canonical storage for all user queries.

A good principle is that, during this migration, you should always make sure that you can revert to the old storage system if something goes wrong with the new one. You should know which of your systems (old and new) is the master for a given set of queries, and hence holds the canonical state. The mastership generally needs to be easily mutable and able to revert responsibility to the original storage system without losing data.

The universal requirement for a storage migration is a detailed written plan reviewed by not just your system stakeholders but also by your technical experts from the involved systems. Inevitably, your plan will miss things and will have to adapt as you move through the migration. Moving between storage systems can be an awfully big adventure — expect us to address this in a future blog post.

Duplicate traffic costs

The great thing about a well-implemented dark launch is that it exercises the full service in processing a query, for both the original and new service. The problem this brings is that each query costs twice as much to process. That means you should do the following:

Make sure your backends are appropriately provisioned for 2x the current traffic. If you have quota in other teams’ backends, make sure it’s temporarily increased to cover the dark launch as well.

If you’re connection-sensitive, ensure that your frontends have sufficient slack to accommodate a 2x connection count.

You should already be monitoring latency from your existing frontends, but keep a close eye on this monitoring stat and consider tightening your existing alerting thresholds. As service latency increases, service memory likely also increases, so you’ll want to be alert for either of these stats breaching established limits.

In some cases, the service traffic is so large that a 100% dark launch is not practical. In these instances, we suggest that you determine the largest percentage launch that is practical and plan accordingly, aiming to get the most representative selection of traffic in the dark launch. Within Google, we tend to launch a new service to Googlers first before making the service public. However, experience has taught us that Googlers are often not representative of the rest of the world in how they use a service.

An important consideration if your service makes substantial use of caching is that a sub-50% dark launch is unlikely to see material benefits from caching and hence will probably significantly overstate estimated load at 100%.

You may also choose to test-load your new service at over 100% of current traffic by duplicating some traffic — say, firing off two queries to the new service for every original query. This is fine, but you should scale your quota increases accordingly. If your service is cache-sensitive, then this approach will probably not be useful as your cache hit rate will be artificially high.

Because of the load impact of duplicate traffic, you should carefully consider how to use load shedding in this experiment. In particular, all dark launch traffic should be marked “sheddable” and hence be the first requests to be dropped by your system when under load.

In any case, if your service on-call sees an unexpected increase in CPU/memory/latency, they should drop the dark launch to 0% and see if that helps.

Summary

If you’re thinking about a dark launch for a new service, consider writing a dark launch plan. In that plan, make sure you answer the following questions:

Do you have existing traffic which you can fork and send to your new service?

Where will you fork the traffic: the application frontend, or somewhere else?

Will you fire off the message to the new backend asynchronously, or will you wait for it and impose a timeout?

What will you do with requests that generate mutations?

How and where will you log the responses from the original and new services, and how will you compare them?

Are you logging the following things: response code, backend latency, and response message size?

Will you be diffing responses? Are there fields that cannot meaningfully be diffed which you should skip in your comparison?

Have you made sure that your backends can handle 2x the current peak traffic, and have you given them temporary quota for it?

If not, at what percentage traffic will you stop the dark launch?

How are you going to select traffic for participation in the dark launch percentage: randomly, or by hashing on a key such as user ID?

Which teams need to know that this dark launch is happening? Do they know how to escalate concerns?

What’s your rollback plan after you make your new service live?

It may be that you don’t have enough surprises or excitement in your life; in that case, you don’t need to worry about dark launches. But if you feel that your service gives you enough adrenaline rushes already, dark launching is a great technique to make service launches really, really boring.

Say you’re about to launch a new service. You want to make sure it’s ready for the traffic you expect, but you also don’t want to impact real users with any hiccups along the way. How can you find your problem areas before your service goes live? Consider a dark launch.

A dark launch sends a copy of real user-generated traffic to your new service, and discards the result from the new service before it's returned to the user. (Note: We’ve also seen dark launches referred to as “feature toggles,” but this doesn’t generally capture the “dark” or hidden traffic aspect of the launch.)

Dark launches allow you to do two things:

Verify that your new service handles realistic user queries in the same way as the existing service, so you don’t introduce a regression.

Measure how your service performs under realistic load.

Dark launches typically transition gradually from a small percentage of the original traffic to a full (100%) dark launch where all traffic is copied to the new backend, discovering and resolving correctness and scaling issues along the way. If you already have a source of traffic for your new site — for instance, when you’re migrating from an existing frontend to a new frontend — then you’re an excellent candidate for a dark launch.

Where to fork traffic: clients vs. servers

When considering a dark launch, one key question is where the traffic copying/forking should happen. Normally this is the application frontend, i.e. the first service, which (after load balancing) receives the HTTP request from your user and calculates the response. This is the ideal place to do the fork, since it has the lowest friction of change — specifically, in varying the percentage of external traffic sent to the new backend. Being able to quickly push a configuration change to your application frontend that drops the dark launch traffic fraction back down to 0% is an important — though not crucial — requirement of a dark launch process.

If you don’t want to alter the existing application frontend, you could replace it with a new proxy service which does the traffic forking to both your original and a new version of the application frontend and handles the response diffing. However, this increases the dark launch’s complexity, since you’ll have to juggle load balancing configurations to insert the proxy before the dark launch and remove it afterwards. Your proxy almost certainly needs to have its own monitoring and alerting — all your user traffic will be going through it, and it’s completely new code. What if it breaks?

One alternative is to send traffic at the client level to two different URLs, one for the original service, and the other for the new service. This may be the only practical solution if you’re dark launching an entirely new app frontend and it’s not practical to forward traffic from the existing app frontend — for instance, if you’re planning to move a website from being served by an open-source binary to your own custom application. However, this approach comes with its own set of challenges.

The main risk in client changes is the lack of control over the client’s behavior. If you need to turn down the traffic to the new application, then you’ll at least need to push a configuration update to every affected mobile application. Most mobile applications don’t have a built-in framework for dynamically propagating configuration changes, so in this case you’ll need to make a new release of your mobile app. It also potentially doubles the traffic from mobile apps, which may increase user data consumption.

Another client change risk is that the destination change gets noticed, especially for mobile apps whose teardowns are a regular source of external publicity. Response diffing and logging results is also substantially easier within an application frontend than within a client.

How to measure a dark launch

It’s little use running a dark launch if you’re not actually measuring its effect. Once you’ve got your traffic forked, how do you tell if your new service is actually working? How will you measure its performance under load?

The easiest way is to monitor the load on the new service as the fraction of dark launch traffic ramps up. In effect, it’s a very realistic load test, using live traffic rather than canned traffic. Once you’re at 100% dark launch and have run over a typical load cycle — generally, at least one day — you can be reasonably confident that your server won’t actually fall over when the launch goes live.

If you’re planning a publicity push for your service, you should try to maximize the additional load you put on your service and adjust your launch estimate based on a conservative multiplier. For example, say that you can generate 3 dark launch queries for every live user query without affecting end-user latency. That lets you test how your dark-launched service handles three times the peak traffic. Do note, however, that increasing traffic flow through the system by this amount carries operational risks. There is a danger that your “dark” launch suddenly generates a lot of “light” — specifically, a flickering yellow-orange light which comes from the fire currently burning down your service. If you’re not already talking to your SREs, you need to open a channel to them right now to tell them what you’re planning.

Different services have different peak times. A service that serves worldwide traffic and directly faces users will often peak Monday through Thursday in the morning in the US as this is where users normally dominate traffic. By contrast, a service like a photo upload receiver is likely to peak on weekends when users take more photos, and will get huge spikes on major holidays like New Years. Your dark launch should try to cover the heaviest live traffic that it’s reasonable to wait for.

We believe that you should always measure service load during a dark launch as it is very representative data for your service and requires near-zero effort to do.

Load is not the only thing you should be looking at, however, as the following measurements should also be considered.

Logging needs

The point where incoming requests are forked to the original and new backends — generally, the application front end — is typically also the point where the responses come back. This is, therefore, a great place to record the responses for later analysis. The new backend results aren’t being returned to the user, so they’re not normally visible directly in monitoring at the application frontend. Instead, the application will want to log these responses internally.

Typically the application will want to log response code (e.g. 20x/40x/50x), latency of the query to the backend, and perhaps the response size, too. It should log this information for both the old and new backends so that the analysis can be a proper comparison. For instance, if the old backend is returning a 40x response for a given request, the new backend should be expected to return the same response, and the logs should enable developers to make this comparison easily and spot discrepancies.

We also strongly recommend that responses from original and new services are logged and compared throughout dark launches. This tells you whether your new service is behaving as you expect with real traffic. If your logging volume is very high, and you choose to use sampling to reduce the impact on performance and cost, make sure that you account in some way for the undetected errors in your traffic that were not included in the logs sample.

Timeouts as a protection

It’s quite possible that the new backend is slower than the original — for some or all traffic. (It may also be quicker, of course, but that’s less interesting.) This slowness can be problematic if the application or client is waiting for both original and new backends to return a response before returning to the client.

The usual approaches are either to make the new backend call asynchronous, or to enforce an appropriately short timeout for the new backend call after which the request is dropped and a timeout logged. The asynchronous approach is preferred, since the latter can negatively impact average and percentile latency for live traffic.

You must set an appropriate timeout for calls to your new service, and you should also make those calls asynchronous from the main user path, as this minimizes the effect of the dark launch on live traffic.

Diffing: What’s changed, and does it matter?

Dark launches where the responses from the old and new services can be explicitly diff’ed produce the most confidence in a new service. This is often not possible with mutations, because you can’t sensibly apply the same mutation twice in parallel; it’s a recipe for conflicts and confusion.

Diffing is nearly the perfect way to ensure that your new backend is drop-in compatible with the original. At Google, it’s generally done at the level of protocol buffer fields. There may be fields where it’s acceptable to tolerate differences, e.g. ordering changes in lists. There’s a trade-off between the additional development work required for a precise meaningful comparison and the reduced launch risk this comparison brings. Alternatively, if you expect a small number of responses to differ, you might give your new service a “diff error budget” within which it must fit before being ready to launch for real.

You should explicitly diff original and new results, particularly those with complex contents, as this can give you confidence that the new service is a drop-in replacement for the old one. In the case of complex responses, we strongly recommend either setting a diff “error budget” (accept up to 1% of responses differing, for instance) or excluding low-information, hard-to-diff fields from comparison.

This is all well and good, but what’s the best way to do this diffing? While you can do the diffing inline in your service, export some stats, and log diffs, this isn't always the best option. It may be better to offload diffing and reporting out of the service that issues the dark launch requests.

Within Google, we have a number of diffing services. Some run batch comparisons, some process data in live streams, others provide a UI for viewing diffs in live traffic. For your own service, work out what you need from your diffing and implement something appropriate.

Going live

In theory, once you’ve dark-launched 100% of your traffic to the new service, making it go “live” is almost trivial. At the point where the traffic is forked to the original and new service, you’ll return the new service response instead of the original service response. If you have an enforced timeout on the new service, you’ll change that to be a timeout on the old service. Job done! Now you can disable monitoring of your original service, turn it off, reclaim its compute resources, and delete it from your source code repository. (A team meal celebrating the turn-down is optional, but strongly recommended.) Every service running in production is a tax on support and reliability, and reducing the service count by turning off a service is at least as important as adding a new service.

Unfortunately, life is seldom that simple. (As id Software’s John Cash once noted, “I want to move to ‘theory,’ everything works there.”) At the very least, you’ll need to keep your old service running and receiving traffic for several weeks in case you run across a bug in the new service. If things start to break in your new service, your reflexive action should be to make the original service the definitive request handler because you know it works. Then you can debug the problem with your new service under less time pressure.

The process of switching services may also be more complex than we’ve suggested above. In our next blog post, we’ll dig into some of the plumbing issues that increase the transition complexity and risk.

Summary

Hopefully you’ll agree that dark launching is a valuable tool to have when launching a new service on existing traffic, and that managing it doesn’t have to be hard. In the second part of this series, we’ll look at some of the cases that make dark launching a little more difficult to arrange, and teach you how to work around them.