Unscheduled downtime - July 20, 2013

by Colin Nederkoorn

Customer.io’s management interface and inbound event processing experienced 4 hours 52 minutes of downtime on Saturday, July 20th at 5:22 eastern. Here are the events that occurred to cause and extend that outage.

As you may know, we have just moved to a new data-center. We’re now at OVH in Canada and have been pleased with the responsiveness of the staff and the increased performance of the network and servers.

On Saturday we had had our first experience with what happens when something goes wrong.

5:22 pm app server 1 (out of 2) has a hardware failure

OVH detected a hardware failure and automated systems booted the first of our two app servers in to recovery mode. This on it’s own is fine. We have two app servers. Then…

5:23 pm app server 2 (out of 2) is hard rebooted by OVH.

Ordinarily in a reboot scenario, services will restart. As we had just migrated the week before, we haven’t yet finalized all the scripts for the services, and the application server did not start on boot. At this point, one server was down and the other server was responding with a 500 error.

These servers run the Customer.io management interface. They also receive and process data from our geographically distributed collection servers (we call them “pickers”).

Almost immediately a customer got in touch saying the app was down.

At this time, John is on a plane

I had spoken to John a few minutes before the outage and he was in the airport. John’s responsible for our infrastructure and has been handling ops for Customer.io. He did the migration to this new data center the week before. Now he was on his way back to New York on a flight and out of contact. I first verified that the outage was limited and isolated. Rather than risking damaging anything, I didn’t attempt to do anything other than diagnostics on the servers.

So what went right during this outage?

Data collection stays up & no data is lost

We collect data on clusters of geographically distributed servers. It gets queued for processing. The first thing I checked was whether or not data collection was still working. It was.

Rather than being able to send the data on, the distributed servers had the data sitting in the queue waiting for our servers to become responsive again.

Emails continue to be sent

We also have a bunch of background workers that run to look for people who match to send emails to. Checking our logs, people were still matching your campaigns and receiving emails.

Where do we go from here?

Overall, I’m happy that critical pieces of our infrastructure stayed operating and there was no data loss. The management site was inoperable for several hours during a saturday evening, but most of you probably had no idea.

We’ll continue to work on infrastructure, making each piece of the infrastructure that delivers Customer.io more resilient – including having multiple people on the team able to respond to an outage.

I know we can do better than the 4 hours 52 minutes the site was down for you. We’ll be working to improve that. Thanks for your trust and confidence in us to deliver the Customer.io service to you.