Leap-Day Bug Fingered in Microsoft Azure Cloud Outage

Microsoft uses the tagline "I laugh in the face of unpredictability" in ads for Azure. Photos: Microsoft

Microsoft’s Azure cloud platform experienced a major outage on Wednesday, with its service management system down for several hours, reports said earlier today. But the fix did not follow the cloud script, and the problem — identified as a leap day security certification bug — continues.

“We have started a gradual rollout of the hotfix in North Central US sub-region. As we proceed through the rollout, we will progressively enable service management back for customers. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers,” Microsoft said earlier today.

But the situation that was supposed to be resolved quickly, thanks to the cloud, instead got worse, according to Microsoft: “We continue to work through the issues that are blocking the restoration of service management for some customers in North Central US, South Central US and North Europe sub-regions. Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers.”

At about noon on Wednesday Pacific Time, the Azure service dashboard showed multiple outages including Access Control 2.0 in the South Central U.S. and Northern Europe, as well as SQL Azure Management Portals in at least six regions, and its service management remains down globally:

Microsoft’s notification that the problem related to a certificate triggered today being Feb. 29, or Leap Day, which happens every four years, will not go a long way to ensuring confidence in the cloud.

One of the key benefits cited regarding the cloud is that a centralized platform can be patched more quickly when unexpected things like this happen. Today’s Azure outage is not following the narrative.

Amazon has had its fair share of outages too, so this is not a Microsoft issue. But what kind of damage does it do to the cloud’s image? Is this just the nature of computing, cloud or not? Or is it reasonable to expect more from our cloud providers across the board — pricing, service availability, interoperability etc?

Update: As of 9am PST, March 1, most of the problems with Azure have been resolved, except for “Windows Azure Compute [South Central US]” ] (according to the Service Dashboard.)