Loggly a cloud service that provides as one of its services System Monitoring and Alerting.

Systems Monitoring & Alerting

Alerting on log events has never been so easy. Alert Birds will help you eliminate problems before they start by allowing you to monitor for specific events and errors. Create a better user experience and improve customer satisfaction through proactive monitoring and troubleshooting. Alert Birds are available to squawk & chirp when things go awry.

But, Loggly has suffered an extended outage that was caused by AWS rebooting 100% of their servers, but that was only half the time down. The other half was due to not knowing the service was down.

Loggly's Outage for December 19th

Posted 19 Dec, 2011 by Kord Campbell

Sometimes there's just no other way to say "we're down" than just admitting you screwed up and are down. We're coming back up now, and in theory by the time this is read, we'll be serving the app again normally. There will be a good amount of time until we can rebuild the indexes for historic data of our paid customers. This is our largest outage to date, and I'm not at all proud of it.

...

Loggly uses a variety of monitoring mechanisms to ensure our services are healthy. These include, but are not limited to, extensive monitoring with Nagios, external monitors like Zerigo, and using a slew of our own API calls for monitoring for errors in our logs. When the mass reboot occurred we failed to alert because a) our monitoring server was rebooted and failed to complete the boot cycle, b) the external monitors were only set to test for pings and established connections to syslog and http (more about that in a moment), and c) the custom API calls using us were no longer running because we were down.

Combined, these failures effectively prevented us from noticing we were down. This in of itself is was the cause of at least half our down time, and to me, the most unacceptable part of this whole situation.

The other half of the outage was caused by Loggly not testing for a 100% reboot of all machines.

The Human Element

The other cause to our failures is what some of you on Twitter are calling "a failure to architect for the cloud". I would refine that a bit to say "a failure to architect for a bunch of guys randomly rebooting 100% of your boxes". A reboot of all boxes has never been tested at Loggly before. It's a test we've failed completely as of today. We've been told by Amazon they actually had to work hard at rebooting a few of our instances, and one scrappy little box actually survived their reboot wrath.

One of the lessons that Loggly learned that some of my SW buddies and I are using in a SW design is to add more than one monitoring solution.

The second step is to ensure more robust external monitoring. With multiple deployments, this issue becomes less of an issue, but clearly we need more reliable checks than what we rely on with Zerigo or other services. Sorry, but simple HTTP checks, pings and established connections to a box do not guarantee it's up!