This post is written by Dan Di Spaltro, Director of Product for Rackspace Cloud Monitoring. Cloud Monitoring lets you monitor any server in any data center so that you can always make sure that your application infrastructure is a-ok. Before leading the Cloud Monitoring engineering team, Dan co-founded Cloudkick, a Y-Combinator startup focused on monitoring cloud infrastructure that was acquired by Rackspace in 2010.

The engineering team at Rackspace is always looking for ways to use Rackspace products in the solutions that we build for customers. We believe that when we use our own products, we build better software for our customers because we can provide feedback to the team sitting right across the hallway. During the Rackspace acquisition of Mailgun, we had a unique opportunity to test out Rackspace’s newest set of services in our Cloud Monitoring product.

A big part of what Cloud Monitoring does when an error occurs is send an email to the customer. These emails can be triggered when the website being monitored has timed out, responds with a 404 error, or when some other user-defined condition in your Alarm Language is met. When we investigated Mailgun, our objective was to reliably enable email notifications for our customers, not master the details of email creation, delivery and analytics, three essential components of email automation at our scale. When we took Mailgun out for a spin we were very impressed. We got the benefits we expected and even some we didn’t. For example:

Sane bouncing – Mailgun offers a rational, API-based way to deal with bounces which happen frequently for lots of different reasons.

Analytics and tagging – We are constantly testing failure scenarios in our production environment, so we needed a way to isolate certain accounts we’ve marked “special”. Mailgun can segment all reporting based on tags which makes testing easy.

Accountability – Just sending an email is not enough for us. Mailgun logs provide an end-to-end view of exactly what happened to each email.

Before looking at how we implemented and rolled out Mailgun, let’s look at how Cloud Monitoring is architectured

Cloud Monitoring Pipeline

The Rackspace Cloud Monitoring system is broken into a pipeline of different services, just as a refresher lets look at a 50,000 ft view of the architecture.

Each step represents one or more distributed services, connected by a data stream. We use Facebook’s Scribe software for point-to-point distribution, routing and buffering. With that datastream we can slice, dice and distribute the data to whichever corner of infrastructure we like. Easy experimentation with the data stream is critical to our velocity. It allows us to ship data anywhere and it is a core part of the design that supports our high-availability as well as product acceleration and development (side fact: it is why the product is code named ELE.).

Preparing for the test

For a production system, wholesale switching is never a great idea, so we decided to take a different approach. We phased Mailgun in first on accounts that we use for our testing. This allowed us to go beyond just micro-validations and unit test validations to get comfortable with the system. Thanks to the easy support for HTTP in Node.js, we were able to make quick work of this integration.

Mocking the Mailgun API

Any time we add an external dependency we try to mock out the API if feasible. Mocking an external API can be very simple or very elaborate but important nevertheless. Our goal is to not add extra work on the developers but to remain confident in our interactions with an external service and how it is supposed to function. This day and age, most API’s are versioned so major changes come with a version bump. This allows us to quickly and confidently mock out API’s to build more elaborate end-to-end tests in a controlled environment.

After determining how we wanted to roll out the test on our testing account, we built our own Mailgun API, which allowed us to get comfortable with the new integration. This process was dead simple. The Mailgun API offered by all the outgoing and analytics operations was simple and straightforward to implement. Only after the mocking out did we realize that there is a test-mode to sandbox sending emails, but this was a valuable learning experience nonetheless.

Surgical Replacement

With preparations for the test complete, we begin to implement and monitor email performance to gain confidence that Mailgun was something we could roll out more broadly to our users. In the Cloud Monitoring application, the component to ensure delivery of an alert is a complex piece of software aptly-named messenger. It does a variety of tasks to increase the reliability of our email notifications. Internally, messenger is a staged execution pipeline. It has the following stages:

Receive message

Lock on message atom

Perform account/entity/check/alarm lookups

Find notification plan

Deliver Alerts

Create email

Deliver

Track

During the test, we added code to look at the special account during the Deliver Alert stage to start routing messages to Mailgun. This process was simple because very little code was involved.

What’s going on above? First, we call the sendMailgunEmail function to deliver the monitoring notification email. In a quest to measure EVERYTHING, we also measure the time it takes to use the Mailgun API. And there is a series of retries that wraps this code with a fallback option, a best practice when relying on any 3rd-party system in your application.
Want to see how simple the sendMailgunEmail function is?

1) The request function abstracts some dealing with complex HTTP requests and handling of response codes, so expectedresponsecodes is used for that.
2) Easily deal with sending HTML, which is generally a pain.
3) Merge in headers allowing for easy way to enrich emails. These headers allow us to create customized emails for each user with details about their alert like alarmId, checkId, and tenantId. You can pass almost anything to customize your emails in these headers that would just normally clutter the body of the email.
4) Easily flip a bit for sending test emails
5) User-defined tags allow tracking and analytical faceting used in our production testing scenarios.

Look at how easy this was. We’ve fully automated email creation, delivery, and tracking using Mailgun, something that is difficult with with a traditional DIY setup.

Summary

Making a switch to a hosted offering is a big decision for any application, especially on something as critical as email notifications. With a rigorous test plan and fanatical support from the Mailgun team on everything from setting up DKIM records to API integration, we were able to successfully transfer all our email notifications for Cloud Monitoring to Mailgun. Added bonus, the Mailgun team sits right next to me ;).