Why Consistency Matters When Building Reliable Systems

We recently faced what could have been a serious user-facing outage: we rolled out a bad patch that caused the deletion and recreation of several hundred thousand folders from user’s inboxes. As a Nylas customer, you may have seen dozens of folders getting deleted then recreated.

As we do after every outage, we conducted an internal postmortem to learn what went wrong and prevent issues like this from happening in the future.

This time, we’ve decided to share this postmortem publicly because we think the bug was interesting!

What caused it?

Before digging into the issue, it’s helpful to have context on how the Nylas sync engine works. The sync engine is the core of our infrastructure and is responsible for syncing billions of emails, calendar data, and contacts with our customer’s platforms -- which are often customer relationship management platforms (CRMs), applicant tracking systems, and such.

Keeping email data in a two-way sync between the end-user’s inbox and our customer’s platform is a pretty interesting challenge because protocols like IMAP and Exchange weren’t really built for bi-directional sync (in fact, they were built long before most people even know what a CRM was).

For example, when you rename an IMAP folder, you’re not just re-writing the folder name. Instead, your email app has to re-download all the emails in this folder and reconcile them with the emails it’s previously synced. This means that we can not delete IMAP folders immediately. Instead, we mark them as deleted and move on. Later on, we'll detect whether deletion was actually a rename by looking at the messages inside it. Folders that weren't renamed are then garbage-collected by a separate process to make sure we’re not storing them forever.

This garbage collection process was the source of the incident. Let’s dive into the part of the source code that caused the issue (if you’re curious, you can find the full source code here).

Here and below, Categories is the model we use to track IMAP folders.

def gc_deleted_categories(self):
# Delete categories which have been deleted on the backend.
# Go through all the categories and check if there are messages
# associated with it. If not, delete it.
with session_scope(self.namespace_id) as db_session:
categories = db_session.query(Category).filter(
Category.namespace_id == self.namespace_id,
Category.deleted_at > EPOCH)
for category in categories:
# Check if no message is associated with the category. If yes,
# delete it.
count = db_session.query(func.count(MessageCategory.id)).filter(
MessageCategory.category_id == category.id).scalar()
if count == 0:
db_session.delete(category)
db_session.commit()

Pretty straightforward, right? Get all deleted Categories, then check if they have actual messages associated with them. If not, delete them.

So, this is a pretty standard model too. It defines a bunch of fields and sets a UniqueConstraint to make sure we’re not storing duplicate folders. There is one problem though — because of a 12-year-old MySQL bug we have to set deleted_at = epoch in order to mark folders as not deleted.

Fast forward to two weeks ago: we noticed a rare condition where we would delete a folder right before detecting a rename. To work around this, we decided to add a six hour delay before deleting folders. To do that, we changed our previously innocuous query from this:

This code would be perfectly normal if Category.deleted_at had a default value of None. However, in our case it had a default value of EPOCH to work around this MySQL bug — causing us to delete a whopping 300k (empty) folders in the 45 minutes it took us to notice the issue and roll back this patch.

Timeline

(All times in PST)

13:15 We started rolling out this patch as part of our daily code push to production.

13:25 The oncall engineer got paged because of an elevated number of accounts failing. They immediately created a status page incident and started looking into the issue.

13:35 After escalating the issue, several engineers started digging into our logs to figure out what was happening.

13:45 We decided to rollback as a preventive measure

Later in the afternoon, we dug through the patches that got shipped and found the real cause of the issue.

What measures are we taking to prevent this from happening again?

This is an interesting bug because it highlights the importance of consistency in complex systems: all our models follow the convention where deleted_at is NULL if the object isn’t soft-deleted. Both the person who wrote the code and the two people who reviewed assumed it was innocuous. There is no obvious culprit; just unexpected interactions.

To make sure that this doesn’t happen again we started working on two things: