Month: March 2012

This issue appears to be due to a time calculation that was incorrect for the leap year

Later, on March 9, there was a detailed follow-up on why this downtime actually occurred. As speculated, it was a leap year bug that caused this issue. Here’s an excerpt:

In Windows Azure, cloud applications consist of virtual machines running on physical servers in Microsoft data centers. Servers are grouped into “clusters” of about 1000 that are each independently managed by a scaled-out and redundant platform software component called the Fabric Controller (FC).

Part of Windows Azure’s Platform as a Service (PaaS) functionality requires its tight integration with applications that run in VMs through the use of a “guest agent” (GA) that it deploys into the OS image used by the VMs.

Each server has a “host agent” (HA) that the FC leverages to deploy application secrets, like SSL certificates that an application includes in its package for securing HTTPS endpoints, as well as to “heart beat” with the GA to determine whether the VM is healthy or if the FC should take recovery actions.

When the GA creates the transfer certificate, it gives it a one year validity range. It uses midnight UST of the current day as the valid-from date and one year from that date as thevalid-to date. The leap day bug is that the GA calculated the valid-to date by simply taking the current date and adding one to its year. That meant that any GA that tried to create a transfer certificate on leap day set a valid-to date of February 29, 2013, an invalid date that caused the certificate creation to fail.

Learnings

There was a lot of analysis that was done after the event. Here’s what MS learnt:

Service Dashboard. The Windows Azure Dashboard is the primary mechanism to communicate individual service health to customers. However the service dashboard experienced intermittent availability issues, didn’t provide a summary of the situation in its entirety, and didn’t provide the granularity of detail and transparency our customers need and expect.

Customer Support. During this incident, we had exceptionally high call volumes that led to longer than expected wait times. While we are staffed to handle high call volumes in the event of an outage the intermittent availability of the service dashboard and lack of updates through other communication channels contributed to the increased call volume. We are evaluating our customer support staffing needs and taking steps to provide more transparent communication through a broader set of channels.

Other Communication Channels. A significant number of customers are asking us to better use our blog, Facebook page, and Twitter handle to communicate with them in the event of an incident. They are also asking that we provide official communication through email more quickly in the days following the incident. We are taking steps to improve our communication overall and to provide more proactive information through these vehicles. We are also taking steps to provide more granular tools to customers and support to diagnose problems with their specific services.

Service Credits

MS realized that money was also plays a part in an apology. An email was also sent confirming the credit to all customers.

Microsoft recognizes that this outage had a significant impact on many of our customers. We stand behind the quality of our service and our Service Level Agreement (SLA), and we remain committed to our customers. Due to the extraordinary nature of this event, we have decided to provide a 33% credit to all customers of Windows Azure Compute, Access Control, Service Bus and Caching for the entire affected billing month(s) for these services, regardless of whether their service was impacted.