Introduction

PagerDuty released Multi-User Alerting in early 2014, which allowed notifying and assigning multiple people when an incident is triggered. In addition to assigning multiple users to an incident, multi-user alerting also makes it possible for an incident to have multiple acknowledgers. This post will demonstrate the changes we made to our data model to implement multi-user alerting and the resulting sophistication added to our SQL queries to maintain their performance.

Data Model Migrations

The logic and data model behind our incident process exists as part of a Rails app. Before the migration, our data model looked like:

At the surface, this type of migration seems straightforward: A 1:1 relationship becomes a 1:many relationship. A total cake walk straight out of Data Modeling 101, right?

Not really.

Some of the most common queries PagerDuty performs require finding all incidents for an account with a given state: triggered, acknowledged, or resolved. We’ll discuss how we migrated each of these queries, starting with the easiest.

Resolved Incidents

The query for resolved incidents did not change at all with Multi-User Alerting. An incident can only be resolved once, with or without Multi-User Alerting. That leads to the following scopes:

Triggered Incidents

Before Multi-User Alerting, the triggered scope was:

scope :triggered, unresolved.where(acknowledged_at:nil)

A triggered incident was merely one where both acknowledged_at and resolved_at were nil. Under Multi-User Alerting, an incident can have multiple acknowledgements, so this scope needed to be changed. We needed a query to find unresolved incidents that had exactly zero acknowledgements.

In SQL, how does one find if a join table has exactly zero associations? LEFT JOIN of course!

As a refresher, a left join guarantees that the result set contains all of the rows from the original table.

For example, consider the following two incidents, where one has some acknowledgements:

Acknowledged Incidents

Before Multi-User Alerting, the scope for acknowledged incidents was:

scope :acknowledged, unresolved.where("acknowledged_at IS NOT NULL")

Extending this to support multiple acknowledgements was similar to how triggered was extended. In this case, we are looking for unresolved incidents with at least one entry in the acknowledgements table. Instead of using LEFT JOIN with acknowledgements.incident_id IS NOT NULL, we can simply use a plain old INNER JOIN.

Unfortunately, naively using INNER JOIN as we did with triggered incidents is incorrect. To illustrate, consider the following implementation of acknowledged:

The result set has each incident occurring once for each acknowledgement. However, we want an incident to occur exactly once if it has at least one acknowledgement, so we need a way to remove these duplicate incidents. It turns out that efficiently removing duplicates in SQL is not as trivial as it seems.

Unfortunately, the above query is not very performant. For accounts with a large number of acknowledged incidents, we found this query was 2 to 3 times slower than the straightforward (though incorrect) query. The explain plan for the above query showed Using temporary, which indicates that MySQL was using a temporary table to compute the distinct rows. We suspect that a large part of the slowdown is based on MySQL copying all of the columns of each row to the temporary table for deduplication.

Use GROUP BY

In suspecting that the slowdown was caused by copying all of the columns (DISTINCT incidents.*), we tried to find a way to have DISTINCT run on just one column. GROUP BY can be used exactly for this purpose, so we attempted the following query as a starting point:

For accounts with large numbers of acknowledged incidents, this query was significantly faster than using DISTINCT. Furthermore, the explain plan for this query did not contain Using temporary. Unfortunately, this query too had one problem due to the implementation of .count in Rails.

GROUP BY affects the results of aggregate functions, specifically COUNT. In Rails, to count the total number of incidents from a scope, you will use count (e.g., Incident.acknowledged.count). However, with group, .count returns the count per group, not the total count. In other words,
instead of getting an integer returned:

Incident.acknowledged.count => 7

the return value is a map of incident id to count for that incident

Incident.acknowledged.count => {19 => 1, 20 => 1, 21 => 1}

Lots of places in our app assume that calling count returns an Integer, so changing that behaviour was not desirable. Wrapping the GROUP BY query in its own subquery restored the correct behaviour of count, but that killed the performance gains since a temporary table was again required. So close, yet so far.

Use a second LEFT JOIN

The crux of the initial problem of joining acknowledgements directly against incidents was that a single incident could have multiple acknowledgements, leading to duplicate incidents in the joined result.

What if there was a way to select a single exemplar acknowledgement for each incident? The MySQL manual describes a technique using LEFT JOIN for finding the group-wise maximum of a certain column. Using that trick with our query, we can find the acknowledgement with the max id for a given incident. We finally arrive at the following query:

From the MySQL manual, the LEFT OUTER JOIN works on the basis that when acknowledgements.id is at its maximum value, there is no acknowledgements_uniq.id with a greater value and the acknowledgements_uniq rows values will be NULL. Nick Kallen, author of Arel, calls this technique “one of the great mind-blowing SQL queries”.

The fantastic thing about this query is that it is just as performant as the GROUP BY version, and does not require the use of a temporary table. In addition, this query does not adversely affect count in Rails. Perfect!

Conclusion

SQL’s left join operator turned out to be a valuable tool in writing correct and efficient queries for multi-user alerting. Its usage was typical for the triggered incidents scope, which involved finding incidents with zero acknowledgements. Left join additionally proved versatile for the acknowledged incidents scope, where it was used as an efficient proxy for DISTINCT to find unique incidents with at least one acknowledgement.

Deriving efficient queries involved profiling the original SQL queries, understanding the behaviour of Percona’s query optimizer, and experimenting with alternative queries. If challenges problems like these excite you, PagerDuty is hiring.

Thanks to Steve Rice and Evan Gilman for reviewing drafts of this article.