Why We Moved to Amazon SQS

SparkPost’s technology delivers over 25% of the world’s non-spam email. As you might expect, that means we generate a lot of data. Seriously a lot. In an earlier post, I described the stack we use to manage that reporting. Now I’m back to share how we’ve improved our analytics and webhook infrastructure for better scalability and reliability.

That’s good for our customers—and for our operations team. The math is pretty simple: these improvements mean fewer pages to our on-call engineers. Fewer pages = more sleep. More sleep = happier engineers. Happier engineers = a better product for you. Allow me to explain in a little more detail why our on-call engineers will get more sleep, and how we did it.

Late last year, we successfully replaced RabbitMQ (RMQ) with Amazon Simple Queue Service (SQS). What does this means for you? Besides a more reliable service, you shouldn’t see a significant change in any behavior with our reporting and analytics functionality, including our Message Events API and metrics reporting. There are a couple small exceptions: Webhook batch sizes have decreased from a max of 10K to 100 per batch, and suppressions soon will take effect in real time.

RabbitMQ

RabbitMQ is a great message queue but the manner in which we used it was no longer meeting our needs. This piece of our stack was a good fit for earlier generations of our on-premises software. We needed a scalable reporting system, and had to move millions of events into a database. We also needed an architecture that was manageable for our customers without much effort, so we decided to put RMQ as close to the data as possible, which meant on the same servers as our MTAs. We implemented the topic exchange pattern, and bound queues by routing keys to the exchange. It was great because we published the message once but it routed to 4 different queues—metrics, webhooks, message events, and suppressions. Based on routing keys, each queue only got a subset of the data.

We had various business requirements including not losing data (surprise, surprise!) and not transmitting the same data twice, so we had to persist and be smart about ensuring the data was already processed. Persistence means writing the data somewhere durable, to disk in this case, which required us to have disk volumes optimized for I/O.

When we started building out SparkPost we decided to keep this piece of architecture because it had worked well for us at scale. However, we quickly learned that what worked well for building scalable commercial software would not work as well for managing a service at scale.

One issue we ran into was that when downstream entities (databases, webhook consumers) became congested with traffic, so did the queue. This posed higher risks for our MTAs since they were sharing resources with RMQ. Persistent queues work well when data is constantly flowing, but when RMQ gets backed up it has to write to disk and pull from disk, which creates more overhead on CPU, memory and I/O. In extreme cases, we put on our firefighter hat and worked to ensure we weren’t losing your emails or data about those emails. No one wanted to go through this high-stress process, and just like everything else it usually happened at odd times. I won’t get into the details of how we worked around the risk of failures for our service but it required many people troubleshooting networking, nginx, ETLs, MTAs, etc. I can’t speak for the team, but I wanted to be like Steve Winwood and be back in the high life again… enter SQS and Omni ETL: one ETL to rule them all.

Omni ETL

We learned a lot from using RMQ and separate Node.js ETL processes. It made batching, transforming, loading, and acknowledging (deleting from message queue) of messages easy, but as you saw above there were some drawbacks. We investigated message queue alternatives and landed on Amazon SQS. If you’re a regular reader of our blog, you know we are all about AWS. SQS came out on top but we had to change how we did pub-sub of this data. Previously, our event hose logged events one at a time to a RMQ exchange, then the ETL read them off and batched and acknowledged. A lot of wasted cycles were happening reading and batching data for each distinct process.

We ended up changing where we batched up messages. Rather than doing this in our ETLs (consumers) it is now done in the event hose (producer) which also compresses the batch before posting to SQS. Compression of messages was required due to the size limitations on a message, which is 256kb. We went from having separate clustered node.js processes, to one parent process that created different child processes, which our clever engineer Jason Sorensen aptly named “Omni ETL.”

We needed a technology to quickly keep track of state for each service so it could determine when it was ok to remove (acknowledge in RMQ terms) the message from an SQS queue. We opted for ElastiCache with a redis backend, and we are now really obsessed with seeing what else we can use it for within SparkPost. ElastiCache is just a managed service around a redis cluster, which helps us scale as we need. So what does this buy us?

There were many drivers to this new architecture; not just avoiding impending fires. Cost efficiency was a big driver. The disks we had to use for the RMQ partition aren’t cheap, and they weren’t fully utilized except for at peak periods. Simplification was another reason. When this is completely rolled out, it will cut down the number of node.js processes by 75%. It also makes it easier to auto-scale during peak periods, and have more centralized monitoring and remediation. Lastly, we’re no longer putting our MTAs at risk when fighting the queue backup-related fire I explained above, since SQS does not have the same side effects as RMQ.

What’s Next?

Cloud-first is a core value at our company and Omni ETL is a testament to that. We were able to eliminate another self-managed piece of technology in our architecture, which pays dividends in our engineers’ sleep. As our CEO Phillip Merrick recently said, SparkPost is growing fast, and our focus is delivering email at scale. These changes to our stack were a critical step as the volume of email (and data) we process continues to grow. It gives our ops team some peace of mind, and allows all of us to continue to deliver new scalable solutions.

I hope this has shed some light on the innards of SparkPost and how we can continue to deliver the data you need to improve your email operations. I did say real-time suppressions are coming soon, and because of technologies we started using in Omni ETL, we are close to launching. Stay tuned! If you just stumbled upon this post and have no idea what our reporting capabilities, webhooks, or suppressions can do, sign up for an account. Take it for a test drive for free. And as always, come talk to us on SparkPost community slack.

2 Comments

RPon Apr. 19, 2017 at 10:50 am

This is a great case study. Thank you for sharing.
Given your situation I am curious if you considered the Kinesis family of services (mainly Streams and Firehose) as alternatives to SQS (and likely SNS for multicast topics)? If so, why did you move away from those as options? Given your move from RMQ exchanges and a heavy in-pipeline ETL use case I am surprised Firehose (with Lambda transforms), or Kinesis Streams consumers wouldn’t be more appropriate… Even AWS hosted Kafka with KStreams may have been a consideration as well. I am sincerely interested in your evaluation criteria and vetting process that led to this decision – I am not trying to be a troll.

Great questions and I appreciate your interest. One reason we went with this approach is we considered it to be the most straightforward and least risky incremental step from our prior architecture. We were able to make modifications to our existing ETL and webhook processing written in Node.js to swap out RabbitMQ for SQS. Our ETL custom software is very modular and working well, so we were able to avoid rewriting it completely. Admittedly, the batching we had to add to the event producer/consumer turned out to me more complicated than we expected.
That said, we will probably look at Kaftka or Kinesis in the future as our needs evolve. If we were starting from scratch we probably would have more seriously considered something like that.