Splicing SRE DNA Sequences in the Biggest Software Company on the Planet (Greg Veith, Director of Azure Site Reliability Engineering @ Microsoft)

Microsoft Azure has 58 service offerings, is in 24 regions and 100 data centers, receives 120K subscriptions per month and there are 3.3 million messages processed per second only by the Azure IoT services. There has been a transformation in the company towards open source, Linux and Apache software.

Deep down I believe that features already shipped and customers are using are the most important; new features are a nice-to-have. In 2009, six months after launch, Bing suffered a 5 minute outage which led to a couple days of postmortem analysis as we tried to learn and apply those learnings as action items. In the end, there were many changes made to how we built and released software. The executive team decided to invest a lot in building an SRE after this.

SRE is not something that you can call yourself, it is something that you earn. I labelled my team as SRE as a matter of intent and direction. There is a cultural engineering problem that needs to be dealt with: I joined the team, was welcomed and immediately received a pager. People usually want an easy button for SRE. However, SRE is not Operations 2.0. There were a couple hundred folks in service operations and they weren't an offense team; they were defensive and a classic operations team. The right folks for SRE are pioneers and you need to have the right models in place, and the right people in place. Production services remain online for many years and they need to be properly looked after. As I built out the SRE function in Azure, I was given three goals: 1) build SRE, 2) don't mess up operations, 3) improve optics into the system.

Symptoms of Success

Availability and reliability meet SLOs (defend customer trust)

Eliminate human touches to production (toil elimination)

Speed up deployments (reduce inventory, ship fast, safely)

All the above areas reinforce measurement which is reliability's foundation.

3 Strategic Pillars

Start SRE at Microsoft (establish principles)

Prove the model (apply principles)

Accelerate and improve (scale the principles)

I laid the groundwork by hiring pioneers - folks who were subject matter experts from Google, EMC, Amazon, etc. We are a team that is constantly challenging itself and eliminating group think. We additionally built a cross-Microsoft group to try to define this SRE role and try to maintain a bar together as a company - not only within Azure.

Newer Service Facing Rapid Growth: Growth and Maturation - SRE attaches to team, develop targeted improvements to prepare for growth, get on call.

Greenfield Services or Redesign: Design and Architecture - Operability and continuous innovation, design for scale from the beginning.

Examples

SRE is like your doctor or your lawyer, the more you share with us, the more we can help you.

The IoT services were new, we delivered an opportunities document that touched on reducing build time, idempotency of the builds, real-time metrics to measure the SLO of the services, and refactorings to decouple services. This was a team of seven engineers. Our principle was to leave it better than we found it.

The Azure storage service was gigantic and at planetary scale. The SRE team built a demand-shaping service, balancing capacity across multiple regions. This was a stack-augmentation approach which is important to note because we're fully formed engineers; we're not managing stuff that is thrown over the wall. It required us to traverse more of the system. We eliminated toil and we were also on-call.

Production Virtuous Cycle

Goal: enable this loop to run as fast and often as possible while maintaining SLOs.

Nobody needs more data than SRE. If you don't measure it, it doesn't matter. Measure what matters so you have the capability to quickly identify issues. One of the first things I did was to build the Operational Intelligence team. We defined SLOs, work on data hygiene and quality, and made sure the metrics were world-class. I needed them to be able to contribute to any code base very quickly.

Incident Response

The most important features are the ones that we've already shipped and customers are using. Therefore, minimize incident time.

Critical Moves, Learnings

Build and protect the SRE brand: pulling in operational leaders from different parts of the company to co-create this. Interaction with the HRBP was a great advantage. Engaging recruiting has been an important relationship because we hadn't built SRE teams before.

Manage the change

Meet teams where they are: not where you want them to be, happy to jump in, not accept that fire alarms are the way to go, eliminating the toil

Grab a shovel (and build a backhoe)

Find the bright spots

Culture is not the problem, it is an outcome of incentives and results. The power to drive this change exists within your organization and it's a multi-year effort.

In the SRE function at Youtube, we encounter the use case of a lot of video uploads and a lot of people watching videos simultaneously. There's this magical number that appears next to each video indicating the number of views; with some spam views removed. All of this information is stored in a MySQL database. One of the problems with MySQL is that it doesn't rate limit very well. Therefore, we were left with the problem of figuring out the best way for these tasks, that computed views and did spam removal, to rate limit themselves.

We wanted a solution in which any number of client tasks would coordinate among themselves so that together they wouldn't send more traffic than it could handle to a shared resource such as MySQL, but at the same time we always wanted to fully utilize the available capacity. From that came Doorman, a system to do global, distributed, client-side rate limiting:

Doorman

Apportions available capacity and leases it to the clients globally. Each client gets their fair share, for some pluggable definition of fair. The Doorman server gives out capacity based on a client's requested (wanted) capacity. Clients uphold the limits with the help of a Doorman-provided library.

Doorman Protocol

Any client that wants to participate in Doorman will do so using the Doorman protocol. The first step is for it to discover the master. A request is composed of:

client id + priority

resource id

current capacity (has)

wanted capacity (wants)

The response to any request contains the:

assigned capacity (gets)

lease expiration time (in 5 mins)

refresh interval (every 5 secs)

It should be noted that the determination of the wanted capacity parameter is hard for a client. For this reason, the client libraries provided by Doorman have magical logic baked in them to look at the patterns of waits when executing the RPC calls to try and figure out how fast the client would like to go. Capacity is a floating point number - the Doorman server is agnostic as to what it means (i.e. queries per second, query cost per second, max inflight transactions).

Doorman Client Library

It's currently supported for the Go programming language. The Python and C++ libraries were (or will be) open sourced and there are plans to support Java and Lisp. The client library is an interface for:

registering a new resource

declaring the wanted capacity for that resource

registering a callback for when a new capacity lease comes in

There is a helper within the library for rate limiting with the added bonus, as noted earlier, that it figures out the wanted capacity for the client.

Python has some issues with multi-threaded code, and if you’re not aware of these issues, you shouldn’t be writing Python code!

Doorman Server Tree

There could be Doorman servers per country or per data center. The highest-level Doorman servers know all of their clients - how much capacity they want and how much capacity they have. The larger the tree, the slower you'll achieve convergence. In a three node tree, within 30 seconds the tree will rebalance itself. If a client on the right-hand side of the tree begins to want more capacity, the tree will rebalance itself by stealing resources from other clients to serve the needs of the heavy client on the right.

Doorman Server

There is a single master server for each node in the server tree. Etcd master election is used to appoint the master from a number of candidates. It keeps the entire state in RAM. No syncing with the replicas, no background storage. When starting up the server learns the current state of the world from its clients (learning mode). Clients use an application-specific discovery protocol to find the master for their node.

We’re SREs and we don’t write systems that have single points of failure!

In the previous blog posts, we've learned how to set up Carbon (caches) and Whisper, publish metrics and visualize the information and the behavior of the Carbon processes. In this blog post, I'll present another feature of Carbon - the aggregator.

In this blog post, I'll go through the process of stress testing a single carbon cache process and analyze the behavior as more and more metrics are published. I will also introduce the Stresser, a simple metric publishing tool.

Now that we have the back-end components up and running and storing numeric time-series data in the formats that we have specified, it's time to take a look at the front-end components of Graphite. Specifically, we need a way to query and visualize the information that is stored.

In the previous blog post, we installed Carbon and Whisper - the backend components of Graphite. We then spun up a carbon-cache process to listen for incoming data points and store them using Whisper. In this blog post I describe in more detail how Whisper stores the data points in the filesystem and how you can control these details.