Details of the December 28th, 2012 Windows Azure Storage Disruption in US South

Introduction

On December 28th, 2012 there was a service interruption that affected 1.8% of the Windows Azure Storage accounts. The affected storage accounts were in one storage stamp (cluster) in the U.S. South region. We apologize for the disruption and any issues it caused affected customers. We want to provide more information on the root cause of the interruption, the recovery process, what we’ve learned and what we’re doing to improve the service. We are proactively issuing a service credit to impacted customers as outlined below. We are hard at work implementing what we have learned from this incident to improve the quality of our service.

Windows Azure Overview

Before diving into the details of the service disruption and to help you better understand what happened; we’d first like to share some additional information on the components of Windows Azure.

Windows Azure runs many cloud services across different data centers and different geographic regions. Windows Azure storage runs as a cloud service on Windows Azure. There are multiple physical storage service deployments per region, which we call stamps. Each storage stamp has multiple racks of storage nodes. The incident affected a single storage stamp in the US South region.

The Windows Azure Fabric Controller is a resource provisioning and management layer that manages the hardware, and provides resource allocation, deployment/upgrade, and management for cloud services on the Windows Azure platform. There is a separate Fabric Controller per storage stamp in order to prevent certain classes of errors from affecting more than one stamp, which helped in this incident.

The Fabric Controller provides node management, network configuration, health monitoring, starting/stopping of service instances, and service deployment for the storage stamp. In addition, the stamp retrieves network topology information, physical layout of the nodes, and hardware configuration of the storage nodes from the Fabric Controller. The storage service is responsible for managing the replication and data placement across the disks and load balancing the data and application traffic in the storage cluster.

To prevent the Fabric Controller from erroneously performing an action like reformatting disks, the Fabric Controller has a line of defense called “node protection”. Our storage service leverages this capability to protect all its nodes. A storage stamp is meant to always be protected but there are legitimate scenarios where we want to turn the protection off for a given storage node. The most common one is when a machine is serviced (out for repair). Therefore, when a node comes back from repair the protection is turned off while it is being prepared to be assimilated back into the fabric. After the preparatory actions are done, the node protection is turned back on for that storage node.

Root Cause

There were three issues that when combined led to the disruption of service.

First, within the single storage stamp affected, some storage nodes, brought back (over a period of time) into production after being out for repair, did not have node protection turned on. This was caused by human error in the configuration process and led to approximately 10% of the storage nodes in the stamp running without node protection.

Second, our monitoring system for detecting configuration errors associated with bringing storage nodes back in from repair had a defect which resulted in failure of alarm and escalation.

Finally, on December 28th at 7:09am PST, a transition to a new primary node was initiated for the Fabric Controller of this storage stamp. A transition to a new primary node is a normal occurrence that happens often for any number of reasons including normal maintenance and hardware updates. During the configuration of the new primary, the Fabric Controller loads the existing cluster state, which in this case resulted in the Fabric Controller hitting a bug that incorrectly triggered a ‘prepare’ action against the unprotected storage nodes. A prepare action makes the unprotected storage nodes ready for use, which includes a quick format of the drives on those nodes. Node protection is intended to insure that the Fabric Controller will never format protected nodes. Unfortunately, because 10% of the active storage nodes in this stamp had been incorrectly flagged as unprotected, they were formatted as a part of the prepare action.

Within a storage stamp we keep 3 copies of data spread across 3 separate fault domains (on separate power supplies, networking, and racks). Normally, this would allow us to survive the simultaneous failure of 2 nodes with your data within a stamp. However, the reformatted nodes were spread across all fault domains, which, in some cases, lead to all 3 copies of data becoming unavailable.

Recovering the Storage Service

We determined that the nodes had been reformatted and were considering two approaches for restoring service:

Recover Data in Place in US South (Plan A) - Restore volumes and all of the disks on the storage stamp in US South to restore availability with no data loss.

Geo Failover to US North (Plan B) - Failover the customers in the storage stamp from US South to US North (geo-replicated location) to restore availability. This would have resulted in

Loss of very recent updates to Windows Azure Blobs and Table, since geo-replication is asynchronous.

Loss of Windows Azure Queue data for customers in the stamp, since Windows Azure Queues have only local redundancy at this time.

Our primary goal is to preserve customer data so we went with Plan A with the idea that we would fall back to Plan B if we were unable to execute Plan A within a reasonable period of time.

Next, we’ll describe the steps we took to recover the storage service in place, and then go into more detail about geo replication and failover process.

Restoring the Data in Place (Plan A)

Though very targeted, this outage was a unique situation. In order to recover the storage stamp in US South we had to design an approach to recover the data on the volumes in place. Our approach also needed to be efficient to avoid copying data from one drive to the next.

The Fabric Controller had performed a ‘quick format’ of the volumes on these storage nodes to prepare them for use. This type of format creates all new metadata files with an initial MFT (Master File Table). We were able to recover the MFT, except for a few initial records. After recovering the MFT, running ‘chkdsk /f’ on the volume will recreate the entire volume except the few initial MFT records that were overwritten by the quick format.

These initial MFT records contained the root of the volume’s directory structure, which was lost. When we performed ‘chkdsk /f’ it placed the files for which it was unable to find a parent directory in the “found” directory on the volume. Each of the files in the found directory represented an extent (file containing customer data). Our naming convention for these files includes a checksum which contains the full path of the original file. This allowed us to determine the proper full path of each extent to fully recover the original directory structure and move the recovered extents from the “found” directory to the proper location.

One of the disks on each storage node is a journal drive which needed more work to recover. The journal drive is used to quickly commit writes to a storage node and acknowledge the write back to the customer while the storage node puts those writes to one of the many destination drives. For the journal volumes we needed to write a tool to find the journal file signature on the volume in order to copy the journal for recovery.

After 32 hours into the incident we had confirmed that we could use the above techniques to fully recover the volumes off a storage node in our test environment, which gave us confidence to continue with Plan A.

We tested our solution before putting it into product and discovered an additional complexity that added time to the recovery. To perform the recovery of the volumes, we had to lock a volume. When the Fabric Controller discovered that the volume was locked it would consider the node as bad and would reboot it, which would have been an issue when we were in the middle of a ‘chkdsk /f’ to recover the volume. So, we needed to put the nodes into a protected state that removed them from any Fabric Controller service management activity during the data recovery period.

Additionally, when bringing back nodes after they are out for repair, the Fabric Controller expects all of the disks to be functional. The problem is that we were bringing back nodes, which may have one or more bad disks, since we do not send a storage node out for repair until there are multiple failed disks on the node. Under normal circumstances, the Fabric Controller would not allow these nodes to come back online to be part of the functioning storage stamp. To get around this, we rolled out a hotfix to the Fabric Controller for this storage stamp allowing nodes with missing disks to be brought back into the fold.

By 2:00pm PST, 12/30/12, all tools and processes had been verified and we were able to start the full recovery of the storage stamp in production.

The storage recovery was methodically performed on all of the storage nodes, double checked, and then brought back into the management of the Fabric Controller. We then loaded all of the partitions for the storage stamp, and by 9:32pm PST, 12/30/12 the storage stamp was live and serving traffic for customers.

We always had the full list of extents that were in the storage stamp, and all extents were recovered. In addition, all checksums were validated for the extents on the stamp, which confirmed that all data was recovered without any data loss.

Geo Redundant Storage and Geo-Failover (Plan B)

A common inquiry from customers relating to this outage focuses on why we did not immediately initiate Geo-Failover of the service once the full impact of the outage was understood. To address this inquiry, let us first provide an overview of Geo Replication Storage.

What is Geo Redundant Storage?

Geo-replication is configured when a storage account is created. The location where customer chooses to create the storage account in is the ‘primary’ location. The location where a customer’s data is geo-replicated is referred to as the secondary location. The secondary location is automatically determined based on the location of the primary as described here.

Geo Redundant Storageprovides our highest level of durability by storing customer data in two locations: a primary location and a second location hundreds of miles away from the primary location. All Windows Azure Blob and Table data is geo-replicated. Queue data is not geo-replicated at this time. With Geo redundant storage we maintain 3 copies (replicas) of customer data in both the primary location and in the secondary location (for a total of 6 copies). This ensures that each data center can recover from common hardware failures (e.g. bad drives, bad nodes, rack failures, network or power issues). This also provides a geo-replicated copy of the data in a second datacenter in case of a major disaster. Geo-replication uses asynchronous replication where the geo-replication is done in the background, so there is no impact on the update performance for the storage account. The two locations only have to talk to each other to geo-replicate the updates. They do not have to talk to each other to recover. This is important, because it means that if we have to do an immediate failover from the primary to the secondary stamp, then all the data that had been committed to the secondary location via geo-replication will already be durable there, and the customer’s application only has to handle data lost from recent changes that have not yet been geo-replicated.

Failover Process

Failover is the process of changing a customer’s primary storage account location from what was the primary storage stamp to the secondary storage stamp.

For the geo-replication in production today, the failover is at the storage stamp level, not at the storage account level. This means that if we need to perform a geo failover, we would failover all the customers in a single storage stamp from the old primary stamp to the secondary stamp (new primary). We strive to minimize customer data loss, so we only perform a failover if there is a major disaster on the primary for which we are not able to recover within a few days.

Why was Geo-Failover not triggered for this Disruption?

Since geo-replication is asynchronous, if we failover from US South to US North, recent updates to the storage accounts would have been lost as expected. For the storage stamp that was impacted, we estimate that 8GBs of recent data updates would have been lost across all customers in the stamp. In addition, Windows Azure Queues are not geo-replicated (we are working to deliver this in CY13). So, if we had performed the failover all Windows Azure Queue data would be lost for storage accounts in the affected stamp. In addition, all customers that chose to have Locally Redundant Storage would have lost their data.

Some customers expressed desire to recover the stamp in place optimizing for data durability over availability. Other customers expressed the desire to optimize for availability and to failover quickly. Since we believed we could bring back the primary storage stamp in US South without any data loss in a few days, we prioritized for data durability.

Improving the Service

After an incident occurs, we take the time to analyze the incident and the ways we can improve our engineering, operations and communications. To learn as much as we can, we do a root cause analysis and analyze all aspects of the incident so we can improve the reliability of our platform for our customers.

This analysis is organized into four major areas, looking at each part of the incident lifecycle as well as the engineering process that preceded it:

Detection – how to rapidly surface failures and prioritize recovery

Recovery – how to reduce the recovery time and impact on our customers

Prevention – how the system can avoid, isolate, and/or recover from failures

Response – how to support our customers during an incident

Detection

Our monitoring system for detecting configuration errors associated with bringing storage nodes back in from repair has been fixed.

Recovery

A major action item is for us is to allow customers to choose between durability and availability, and to enable customers to build resilient applications. We are working towards this through the following future features:

Read-Only Access to Geo-Replicated Storage Accounts - We plan to enable read-only access to a customer’s storage account from the secondary location.

Customer Controlled Failover for Geo-Replicated Storage Accounts – We plan to enable customers to prioritize service availability over data durability based on their individual business needs. We plan to provide an API to allow customers to trigger the failover of a storage account.

Prevention

We have fixed the bug in the Fabric Controller that resulted in storage nodes being driven to a prepare state. We are also re-examining our operational processes to ensure appropriate training is ongoing and that processes are continuously improving.

Response

On December 28, 2012, from 7:30 am (PST) to approximately 9:00 am (PST) the Primary Service Health Dashboard was unavailable, because it relied on data in the affected storage stamp. The initial dashboard update related to the outage was attempted at 8:45 am (PST) and failed. The failover site was activated at 8:45 am (PST) and was successful at 9:01 am (PST).

We have already improved the multi-level failover procedures for the Azure service dashboard, to mitigate the impact in case a similar incident should occur.

We do our best to post what we know, real-time, on the Windows Azure dashboard.

We are continuously taking steps to improve the Windows Azure Platform and our processes to ensure any live site incident does not occur in the future given similar trigger conditions. Many of those improvements, while targeted at fixing, improving monitoring and alerting, and overall platform resiliency, will also improve our capability to provide communications that are more detailed and timely during an outage. These improvements will also help provide better visibility to an ETA for resolution.

Service Credits

We recognize that this service interruption had a significant impact on affected customers. Due to the extraordinary nature and duration of this event we are providing a 100% service credit to all affected customers on all storage capacity and storage transaction charges for the impacted monthly billing period.

These credits will be applied proactively and will be reflected on a billing period subsequent to the affected billing period. Customers who have additional questions can contact Windows Azure Support for more information.

Conclusion

The Windows Azure team will continue to review the issues outlined above, as well as the impact to our customers over the coming weeks and take all steps to improve our service. We sincerely regret the impact this outage had on our customers and will continue to work diligently to provide a highly available service that meets your business need