Speeding up our Webhooks System 60x

Last December, we had an interesting problem. Our webhook system, which lets our customers know whenever a change happens to a Nylas account, was struggling to keep up with the traffic. We were seeing some webhooks being delayed as much as 10 minutes, which is a long time for most of our end-users. We decided to take the plunge and rearchitect our webhook system to be faster. In the end, we made it 60x faster and made sure it’ll stay that way for the foreseeable future. In this post you’ll see how.

Detour: A brief introduction to the Nylas transaction log

The Nylas API is built around the idea of a transaction log. The transaction log is an append-only log of all the changes that happened to our API objects. If you send a message through the Nylas API, we will create a transaction log entry noting that a “message” object was created. The same thing happens if you update an event, calendar or any other API object.

We use the transaction log to power all our change notification APIs. For example, the transaction log is how you can ask our delta stream API all the changes for a specific account in the last 24 hours.

Behind the scenes, the transaction log is implemented as a regular MySQL table. We’ve instrumented our ORM (SQLAlchemy) to write to this table whenever there is a change to an API object. In practice, this works relatively well, even though this part of the codebase relies a lot on SQLAlchemy internals, which makes it very brittle (if you’re curious about how it works, feel free to take a lot at the sync-engine source code).

How our legacy webhook system worked

Our original webhook system was pretty simple and really reliable. It would spawn one reader thread per webhook. Each reader thread would sequentially read from each of our MySQL shards and send the changes it found. This let us offload most of the reliability work to MySQL – for example, if one of the webhooks machines crashed, we would just have to reboot it and it would pick up things where it left off. That also meant that if a customer webhook went down then came back up, we’d be able to send them all the changes that happened in the interval, which made outage recovery for our customers a lot easier.

Unfortunately, as we grew from a couple MySQL shards to several hundred, and from a half-dozen customers to hundreds, this architecture started making less and less sense. Having one thread per customer meant that as our number of customer grew, our system would get slower and slower.

This might sound like a glaring limitation of our legacy system, but this was the right call when it was written three years ago. At the time, it was more important to have a simple and reliable system than to have something that was fast and scaled.

Rebuilding webhooks

Once we decided to rebuild the system, we had to figure out what kind of architecture would work best for our workload. To do that, we started by looking at the flamegraphs for our legacy system. Here’s a typical one:

One things jumps out immediately: we’re spending a lot of time executing SQLAlchemy code, and waiting for our MySQL shards. Seeing this confirmed a hunch we’ve had for a long time – we had too many readers.

To figure out if this was right, we decided to built a prototype that uses a single-reader architecture to send webhooks. Here’s the architecture we came up with:

Basically, we’d be moving from several readers per shards to one reader per-shard. We decided to try this out and see if it would solve our load problems.

The phantom reads problem

After three weeks of work, we felt that we had a system that was reliable enough to run production workloads, so we decided to ship a test version of the system that wouldn’t send actual webhooks. This way, we’d be able to iron out performance issues and make sure that there were no consistency issues between the legacy and new systems.

After doing a lot of testing, one issue kept happening – sometimes the new system wouldn’t send a transaction that should have been sent. This happened unpredictably and at any time of the day.

We spent a lot of time trying to figure out the issue – was it a problem in the way we were creating transaction objects? Was there a subtle bug in the way we were consuming the transaction table?

The problem was both more and less complicated. Every time we’d send a transaction to a customer we’d save its id to know where to resume in case of an interruption. However, it turns out that this isn’t safe at all, because of a misunderstanding we had about MySQL autoincrements: we assumed that they were generated at COMMIT time – which means that they’d be always incrementing.

However, this isn’t the case all the time, for example if two transactions are executing concurrently, one of these may not be – here’s an example of why:

This issue meant we couldn’t rely on MySQL to get transactions in order. From our point of view, there were three different ways we could go:

We could read from the MySQL binlog, which is the mechanism MySQL uses for its replication

We could use Apache Kafka

We could use Amazon Web Services’ Kafka clone, AWS Kinesis

In the end, we ended up going with AWS Kinesis mainly because it’s the easiest one to operate – both the MySQL binlog and Kafka have a significant operational cost that we couldn’t pay at the time (by the way, we’re hiring ops people 😅).

Kinesis to the rescue

Kinesis is an interesting system – it’s really reliable and easy to operate, as long as you fit inside of its (seemingly arbitrary) constraints. For example, Kinesis supports sharding, and you have to decide of the shard assignment yourself. This is fine but there’s a catch – every Kinesis is limited to 1MB of writes per second and 2MB of reads. On top of that, you can only have 5 transactions per seconds for reads, so having several processes reading from the same shard was out of question.

Obviously, this constraints aren’t the end of the world be we had to build our system around them. Here’s the system we ended up with – like the previous system we end up having a single reader per shard, except this time it’s reading from Kinesis instead of the database.

One interesting property of the new system is that it only uses Kinesis for getting an ordered log of changes – for the rest (like for example, catching up a customer webhook after a period of downtime), the new system will read the data from our database, which is a plus for durability, and helps us avoid Kinesis’ five transactions per second limitation.

Roll out

Webhooks are really really important for our customers. Breaking the system would mean breaking their apps for an undefined amount of time. To avoid doing this, we had to roll out this new service a little differently than the way we usually do it.

The way we usually roll out a new service is to do a staged rollout – we start by shipping the changes to 10% of our userbase, and then gradually increase the numbers throughout the day. We could have done this with the new webhooks service but we wanted to avoid any unexpected issues, were they dropping webhooks, or performance regressions, etc.

To do this, we decided at the start of the project to work on a version of the service that would simulate sending webhooks instead of actually sending them. This would give us confidence that we wouldn’t run into any issues at the launch. By the time we rolled out the service to our customers, it had been running in “simulated mode” for two months. That gave us a lot of confidence in the reliability of the new system.

We didn’t stop there though – given that our new system shouldn’t be dropping webhooks, ever, we spent several weeks instrumenting it to make sure that wasn’t the case. There weren’t a lot of good solutions for this, so we ended up instrumenting both the legacy webhooks systems and the new system to make sure that they were sending the same transactions. There’s no good way to do this, so we had to build a custom Flask app that got pinged whenever we sent a webhook.

Eventually, we became confident enough in the new system to roll it out to all our customers. That meant making a copy of each existing webhook, pointing that copy to the new system, turning off the old webhooks then turning on the new one and making sure no transactions were dropped.

Conclusion

Once the rollout was complete, we were able to measure the latency improvement for all our customers. The results were drastic – the old system would very often take more than a minute just to process a transaction and send it. Our P90 latency was over 90 seconds. With the new system, our P90 latency sits at around 1.5 secs – that’s a 60x improvement!

Finally, thanks to Russell Cohen who’s contracted with Nylas on this project and was instrumental in getting it out the door!