Server Monitoring Alerts: War Stories from the Ops Trenches

No alerts, no incidents, nothing. Your infrastructure works like clockwork without a hitch and therefore without alerts. No news is good news. Right?

Um, yes. But what happens when the illusion is inevitably shattered? What types of scenarios do we face here at Server Density, and how do we respond to them? What systems and tools do we use?

This post is a collection of war stories from our very own ops environment.

A Note on our Infrastructure

Most of it is hosted on a hybrid cloud/dedicated environment at Softlayer. A small portion, however, is hosted on Google Compute Engine. That includes our Puppet Master (1 server), our Build servers (2 servers), and the Staging environment (12 servers).

There are a number of reasons why we chose Google Compute Engine for this small part of our infrastructure. The main one was that we wanted to keep those services completely separate from production. If we were to host them on Softlayer we would need a different account.

Here is a low-res snapshot of our infrastructure.

5 Types of Server Monitoring Alerts

Let’s start by stating the obvious. We monitor and get alerts for a whole bunch of services: MongoDB, RabbitMQ, Zookeeper and Kafka, Web Servers, Load Balancers, you name it. Here are some examples.

1. Non events

As much as we try to minimise the amount of noise (see below), there will always be times when our alerts are inconsequential. At some point 50% of the alerts we got were simply ignored. Obviously, we couldn’t abide by such a high level of interruption. So we recently ran a project that systematically eliminated the majority of those. We’ve now reached a point where such alerts are the rare exception rather than the rule.

Further up the value chain are those alerts we resolve quickly, almost mechanically. It’s the types of incidents we add zero value and shouldn’t have to deal with (in an ideal world). For example, the quintessential . . .

2. “Random disk failures in the middle of the night”

Sound familiar? We get woken up by a no data alert, open an incident in Jira, try to SSH, no response, launch the provider console, see a bunch of io errors, open a ticket with the provider for them to change disks, go back to bed. Process takes less than 30 minutes.

Speaking of providers, here is another scenario we’ve seen a couple of times.

3. “They started our server without telling us”

Provider went through a schedule maintenance. As part of this they had to reboot the entire datacenter. We were prepared for the downtime. What we didn’t expect was for them to subsequently restart the servers that were explicitly shut down to begin with. Obviously we were on the lookout for strange things to happen so we were quick to shut those servers down as soon as we got their alerts.

Occasionally, the alerts we receive are mere clues. We have to piece things together before we can unearth broader system issues. Here is an example.

4. Queue Lengths

We monitor our own queue lengths. As part of asynchronous processing, a worker will take an item in a queue and process it. Long queues could indicate either that the workers are too slow or that the producers go too fast. The real underlying issue however could have nothing to do with workers, producers or their queues.

What if there is too much network traffic, for example? A network problem won’t break the system and we may never know about it until we deduce it from indirect metrics like, say, queue lengths.

5. MongoDB seconds_behind_master alert

If a replica slows down, failover is at risk. If there is packet loss or a small capacity link then there is no sufficient traffic between primary and secondary. This means that certain operations can’t be replicated to secondary. The risk is that in the case of a failover, the delta is lost.

In October 2014 we experienced a weird outage that exemplifies this type of failure. Our primary MongoDB cluster is hosted on our Washington DC datacenter (WDC) and the secondary is in San Jose (SJC).

Interesting things started to happen when some roadworks in the neighboring state of Virginia caused a fibre cut. This cut broke the connection between the two data centers. It also severed the link between WDC to the voting server in Dallas (DAL). The voting server carries no data. It is a mere arbiter that votes for promotions and demotions of the other two servers.

Not long after the outage, based on a majority vote from both DAL and SJC, the latter was promoted to Primary. Here is where things get hairy. Following the SJC promotion, the link between the voting server (DAL) and WDC was somehow restored, but the link between WDC and SJC remained down. This weird succession of events left both WDC and SJC in primary server mode for a short amount of time, which meant we had to roll back some operations.

How often does that happen?

War Games

Responding to alerts involves fixing production issues. Needless to say, tinkering with production environments requires some knowledge. One of the greatest time sinks in team environments is knowledge silos. To combat that risk, we document absolutely everything, and we use checklists. On top of that, every three months we organise War Games.

As the name suggests, War Games is a simulation of incidents that test our readiness (knowledge) to resolve production incidents. Pedro, our Operations Manager hand-picks a varied range of scenarios and keeps them private until the day of the main event.

He then sets up a private HipChat room with each engineer (everyone who participates in our on-call rotation). And then, once everything is ready, he sounds a bulb horn to signal the arrival of alerts.

The same alert appears in every private HipChat room. Each participant then types down their troubleshooting commands and Pedro simulates the system responses.

Other than the obvious benefits of increasing team readiness, War Games have often helped us discover workarounds for several limitations we thought we had. There is always a better and faster way of fixing things, and War Games is a great way to surface those. Knowledge sharing at its best. A side-outcome is improved documentation too.

Summary

Here at Server Density we spend a significant amount of our time improving our infrastructure. This includes rethinking and optimising the nature of alerts we receive and—perhaps most crucially—how we respond to them.

We’ve faced a lot of ops scenarios over the last six years, and—together with scar tissue—we’re also accumulating knowledge. To stay productive, we strive to keep this knowledge “fresh” and accessible to everyone in the team. We leverage documentation, checklists, and we also host War Games: the best way to surface hidden nuggets of knowledge for everyone in the team to use.

What about you? What types of incidents do you respond to, and how’ve you improved your ops readiness?