Crafting sustainable on-call rotations

With the advent of devops, many engineers these days find themselves participating in an on-call rotation, something that was once solely the responsibility of sysadmins or operations engineers. Carrying the pager (usually with some off-hours requirements) is not a task that most people enjoy. On-call has the tendency to be disruptive to our sleep, disruptive to the regular work we try to get done during our days, and disruptive to our lives as a whole. With more and more teams participating in these on-call rotations, what can we, as individuals, teams, and organizations, do to make the requirements of on-call more sustainable and more humane?

Often the first thing that people think of when they think about going on-call is the negative impact it will have on their sleep; nobody wants to be woken up in the middle of the night by PagerDuty. If your organization or team grows large enough, you could adopt a “follow-the-sun” rotation, where teams across multiple time zones participate in the same rotation, with on-call shifts being shorter so that each time zone is only on-call during their local business (or at least waking) hours. Setting up such a rotation can do wonders for reducing the toll that on-call can take on its participants.

If you don’t have the headcount and geographic distribution for follow-the-sun, there are still things that can be done to reduce the likelihood that people will get woken up unnecessarily. (After all, it’s one thing to rouse yourself from bed at 4am to solve an actual, customer-facing issue; it’s another thing entirely to wake up only to find yourself dealing with a false alarm). It can help to audit all of the alerts you have configured and ask your team which of the alerts truly need to wake someone up off-hours, and whether these alerts can wait until morning. It may be difficult to get people to agree to turn off some off-hours alerts, especially if issues being missed have caused problems in the past, but it’s important to remember that an engineer who is sleep-deprived is not the most effective engineer. Save those off-hours alerts for when they’re really, truly important. Whether it be Nagios’ notification periods or setting up different schedules in PagerDuty, most alerting tools these days allow you to configure different rules for off-hours alerts.

Other ways of dealing with the disruptions to sleep involve more cultural changes. One way to approach this is to track your alerts, paying specific attention to when alerts come in and whether or not they are actionable. Opsweekly is a tool created and open-sourced by Etsy that allows teams to track and categorize the alerts they get. It can generate graphs that show you how many alerts have woken people up (by leveraging opt-in data from Jawbone or Fitbit sleep trackers) as well as how many alerts actually required action to be taken by a human. Using these technologies, you can track the effectiveness of your on-call rotation (and its impact on sleep) over time.

The people sharing a rotation can also play a part in making sure that each person on the team is getting sufficient rest. Create a culture that encourages people to look out for themselves: if you lose sleep because you got paged during the night, it should be okay for you to sleep in the next morning to try to catch a bit more shut-eye before starting your day. Team members can look out for each other as well: when teams are sharing their sleep data with each other through something like Opsweekly, they can go to their on-call colleagues and say, “Hey, it looks like you had a rough night with PagerDuty last night—want me to cover for you tonight so you can get some rest?” Encourage people to support each other in this manner, and discourage the sort of “hero culture” where people will stretch themselves to the breaking point to avoid asking for help.

When engineers are tired because on-call has woken them up, they obviously won’t be operating at 100% capacity during the day at work, but even without considering sleep deprivation, on-call can have other impacts on work as well. One of the biggest drains from on-call comes from the interruption factor: a single interruption can cause at least 20 minutes to be wasted due to loss of focus and context switching. It’s likely that your teams will have other sources of interruptions, such as tickets being generated by other teams or requests or questions that come in via chat or email. Depending on the volume of these other interrupts, you might consider adding them to your existing on-call rotation or setting up a second rotation just to deal with these other requests.

It is important to consider on-call when you are planning the work that the team will be doing, both on a long-term and a short-term basis. If your team tends to have fairly intense on-call shifts, this fact will need to be factored into long-term planning, as you may have an entire headcount effectively dedicated to on-call work (rather than other work) at any given time. In sprint or other short-term planning, you may find that the on-call person isn’t able to meet deadlines because of their on-call responsibilities—this should be expected, and the rest of the team should be willing to adjust and help in order to make sure that work gets done and the on-call person gets supported. Unless your on-call person is never paged, this rotation will have an impact on people’s capacity for other work—do not expect the on-call person to work nights to complete scheduled projects in addition to carrying the pager with them off-hours.

Teams will need to figure out a sustainable way to deal with the additional work that is generated by on-call. That work might be actual work to fix real actual problems that were discovered via the monitoring and alerting systems, or it might be work to fix the monitoring and alerting to reduce false-positive alerts. Whatever the nature of the work generated, it is important to distribute that work across the team in a fair and sustainable manner. Not all on-call shifts are created equal, so saying that the person who got the alert is the person responsible for all the work generated by it is likely to result in an uneven distribution of work. It might make more sense for the on-call person to be responsible for planning or distributing the work, with the expectation that the rest of the team will be willing to help out with the completion of the work that is generated.

Consider the impact that on-call participation has on the lives outside of work of its participants. When you’re on-call, you’ll likely feel tied to your pager and laptop, whether that means that you carry a laptop and a MiFi with you everywhere or if you just don’t leave your home except to go to and from the office. Being on-call usually means having to forego things like making plans with friends or family during the duration of your shift. This means that depending on the number of people in your rotation, how long each shift is, and how intensive each shift tends to be, this might be placing an undue burden on the people involved. You may need to experiment with the length and scheduling of your shifts to find a schedule that works for at least most of the people involved, as different teams and people will have different priorities and preferences.

It is critical to recognize the impact that on-call will have on people’s lives, both at the management level and at the individual level. It has to be noted that these impacts will be felt more by people with less privilege. For example, if you have to spend time taking care of children or other family members, or if you find that the majority of housework falls on your shoulders, you already have less time and energy than someone who doesn’t have these responsibilities. This sort of “second shift” or “third shift” work tends to disproportionately impact women and people of color (who are already paid less than their white male colleagues for doing the same work), and if you set up an on-call rotation with a schedule or intensity that assumes the participants have no real responsibilities outside of the office, you are limiting the people who will be able to participate on your team.

Encourage people to try to maintain as much of their regular schedule as possible. You should consider doing something like providing a team MiFi so people can leave the house with their laptop/MiFi and still have some semblance of a life. Encourage people to swap hours with each other if need-be for short periods of time so that people can go to the gym or attend doctor’s appointments while on-call. Don’t create a culture where on-call is expected to mean that engineers do literally nothing but be on-call. Work/life balance is an important part of any job, but especially so when considering off-hours work, and the more senior members of your team should be leading by example and show the rest of the team what it looks like to have as much work/life balance as possible while on-call.

On an individual level, make sure that you explain what on-call means to your friends, family members, partners, pets, etc. (Your cats will probably not care, since they’re already up at 4am when PagerDuty calls anyways, though they will be in no way willing to help out with those pages). Make sure you make up for any missed friend/family time once your shift is over, and if you can, consider setting up a silent alarm (like a smartwatch) that can wake you up by buzzing your wrist to avoid waking up anyone around you as well. Find ways of taking care of yourself while you are in the midst of your on-call shift and when it is over. You might want to put together an “on-call emergency survival kit” of things that help you relax: listen to a playlist of your favorite music, read a favorite book, or set aside time to play with a pet. Managers should be encouraging self-care by giving people a day off after a week-long on-call shift and by making sure that people are asking for (and receiving) help when they need it.

In general, on-call work should not be accepted as just being terrible: you have the opportunity and the responsibility as someone involved in an on-call rotation to actively work to make that rotation better for the people on it, and this generally means making sure people get paged less and get paged smarter. Again, tracking the usefulness of your alerts using something like Opsweekly can go a long way towards figuring out what makes your rotation’s alerts annoying and fixing them. For unactionable alerts, ask yourself if there are ways to get rid of these alerts—maybe this means only making them fire during business hours, because there are some things you just don’t need to respond to in the middle of the night. Don’t be afraid to delete alerts, change when they fire, or change them to be email-only rather than paging. Experimentation and iteration are key to making on-call rotations better over time.

For alerts that are actually actionable, you’ll want to consider how easy it is for the on-call engineer to take the action they need. Every alert that fires should have a runbook that goes along with it—consider using a tool like nagios-herald to add runbook links to your alerts. If an alert is so simple that it doesn’t need a runbook, it’s probably simple enough that you can automate the response using something like Nagios’s event handlers, which save humans the effort of having to wake up or get interrupted for easily automatable tasks. Both runbooks and nagios-herald can help you to add valuable context to your alerts that can help people respond to them more effectively. See if you can answer common questions such as: when was the last time this alert fired? who responded to it last time, and what action did they end up taking (if any)? what other alerts tend to pop up at the same time as this one, and are they related? This sort of contextual information often ends up living only in people’s brains, so encouraging a culture of documenting and sharing contextual information can reduce the amount of overhead necessary to respond to alerts.

A significant part of the drain that comes from being on-call is the fact that it never ends—if your team has an on-call rotation, it is unlikely that that rotation will end anytime in the foreseeable future. On-call is never-ending and it tends to feel like it will always be terrible. This lack of hope is a big mental drain that can contribute to stress and burnout, so addressing the perception (in addition to the reality) that on-call will always be awful is a good place to start when thinking about your rotations in the long term.

In order to give people hope that on-call will ever improve, it is necessary to have visibility into the system (the same on-call tracking and categorization that I mentioned previously). Keep track of how many alerts you have over time, what percentage of them are actionable, how many of them wake people up, and then work on creating a culture that encourages people to make things better. If you have a big rotation, it can be tempting, as soon as you get off-call, to throw up your hands and say “that’s future-me’s problem” rather than digging in to fix things—who wants to spend more effort on on-call-related work than they absolutely have to? This is where a culture of empathy can make a big difference, because you aren’t caring only about your own on-call happiness, but that of your coworkers as well.

Empathy is a big part of what allows us to effectively incentivize work that improves the on-call experience. As either a manager or an individual contributor, you can positively recognize or even reward people for behavior that makes on-call better. Operations is one of those fields where engineers often feel like people only pay attention to them when things go wrong: people will be right there to yell at them when the site goes down, but they rarely recognize all the behind-the-scenes work that operations engineers invest into making sure the site stays the rest of the time. Recognition of operations work can go a long way, whether that is thanking someone in a meeting or in a team-wide email for improving a particular alert or technical aspect of on-call, or giving someone time off for covering someone else’s shift for a while.

Encourage people to spend time and effort working on things that will improve on-call in the long term. If your team has an on-call rotation, you need to be planning and prioritizing this work the same way you would any other work on your roadmap. On-call is 90% entropy, and unless you actively work to make it better, it will tend to get worse and worse over time. Work with your team to find out what motivates and rewards the people on it, and then use that to encourage people to reduce alert noise, write runbooks, and create tooling that solves their on-call problems. Whatever you do, don’t just accept a terrible on-call rotation as an unchangeable part of the status quo.

About the author

Ryn Daniels is a senior operations engineer at Etsy and the author of “Effective DevOps.”