VictorOps PIR of an Incident on December 22, 2017

In an effort to provide transparency and share key learnings, we’re making our Post-Incident Review notes public. Read on to understand our approach to detection, response, remediation, analysis, and readiness of this recent incident.

Background Information

Reliable alert delivery is, without question, a core operational tenet at VictorOps. On December 22, 2017, we encountered a technical challenge adversely affecting our ability to deliver alerts in a timely manner.

Above and beyond providing the technical forensics around this incident, we also extend a genuine and sincere apology for any challenges this incident caused you, your colleagues, or your customers.
Summary of the Incident

On December 22 at 02:32 MST, the VictorOps production Cassandra cluster experienced failures affecting alert delivery and functionality for all customers. Alerts were still being processed and notifications were being sent to customers, but there were delays.

This continued until 03:50 MST when our support team started receiving reports of delayed alerts, delayed acknowledgment, and delayed resolve operations issued by our clients. Support escalated to engineering teams at 03:59 MST to assist in troubleshooting. A fix was identified and deployed by 04:17 MST, and at 04:22 MST the issue had been resolved.

Services Impacted

Alert processing was delayed
Ack/resolve operations were delayed

Duration

This incident lasted 110 minutes, from 02:32 MST to 04:22 MST.

Severity

P1 – this was a customer-affecting issue with a straightforward remediation.

Customer Impact

Most customers experienced delays in alert processing due to increased latency and failures in our production Cassandra services. This caused intermittent delays when sending notifications as critical alerts were held up in processing.

Some customers also experienced delays in acknowledgment and resolve operations issued from all clients. This resulted in our application continuing to page some users after the incidents were acknowledged.

Timeline

12/14/2017: Cassandra upgrades were performed on our production cluster. The upgrades completed without issue

12/14/17 - 12/22/17: The upgraded cluster was operating normally

12/22/17 02:53 MST: Service interruption started

12/22/17 04:17 MST: Fix was deployed to the Cassandra cluster

12/22/17 04:22 MST: Customer issues were confirmed resolved

12/22/17 - 12/28/17: To completely remediate the issue with our Cassandra ring, we rebuilt and deployed each member of our production cluster. This service maintenance did not have any customer impact

Causal and Contributing Factors

DSE upgrade to 5.1.5 on Dec. 14 exposed us to a bug. The bug (introduced in DSE 5.1.3) can cause a node to hang while doing compaction if the node is configured to use multiple drives—as ours were.

We did not have monitoring in place to detect the kinds of processing delays customers experienced during this service disruption.

Countermeasures and Resolution

Reconfigure our Cassandra ring so each node only has one drive, effectively mitigating the bug in question.

Reach out to DataStax to perform an audit of our Cassandra configuration to check for any non-standard settings that could potentially lead to issues.

Introduce new monitoring checks that detect processing delays such as those experienced on 12/22/17.

We’re hopeful the information provided in this report allows you a detailed and transparent view into the incident itself—and the lessons we learned in response to this incident going forward.

Posted 8 months ago. Dec 28, 2017 - 13:32 MST

Resolved

This incident has been resolved.

Posted 8 months ago. Dec 22, 2017 - 04:35 MST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted 8 months ago. Dec 22, 2017 - 04:29 MST

Identified

We have identified the issue and the Engineering team is working toward resolution.

Posted 8 months ago. Dec 22, 2017 - 04:25 MST

Investigating

We are sorry for the inconvenience but are in the process of investigating a minor partial outage, affecting the (incident timeline, email notifications, etc.). Follow @VOSupport or this page for updates.