We were down...

On 23 February 2013, we’ve been down for 8 hours. We received a monitoring alert (and some e-mails and tweets as well) at around 9:50 PM (CET) notifying us of this event. After looking into it, we discovered there was not much to do about this besides sit and wait… We do wish to apologize and clarify the events for this outage (our first since July 2012).

The root cause of this outage was a global outage of Windows Azure Storage. This service is one of the code building blocks of Windows Azure and has never in the past failed (that’s 4 years of no issues). This storage system works based on the HTTP protocol and has both http and https endpoints. Most applications built on top of Windows Azure, including the platform’s own building blocks, are using the https endpoints to prevent transport-level attacks, including MyGet. Unfortunately, most clusters of Windows Azure Storage were running an expired SSL certificate on this https endpoint, the reason for this global outage of Windows Azure and every application hosted on the platform, including MyGet and the official NuGet package source and every service directly depending on www.nuget.org.

MyGet runs in the Windows Azure Europe West region (Amsterdam), with a cold disaster recovery location in the Windows Azure Europe North region (Dublin). We can fail over to this location within hours if compute or storage in the main datacenter location fail. In case of a serious outage, we can restore a disaster recovery copy of our services in any Windows Azure region around the globe. Unfortunately, there’s nothing much we can do in case of a global outage…

Our status page can always be found at http://status.myget.org. For reference, here are our uptime numbers for the past year.