2017.03.31のAzure障害

Starting at 13:50 UTC on 31 Mar 2017, a subset of customers in Japan East may experience difficulties connecting to their resources hosted in this region. Engineers have identified the underlying cause as loss of cooling which caused some resources to undergo an automated shutdown to avoid overheating and ensure data integrity & resilience. Engineers have recovered the cooling units and are working on recovering the affected resources. Engineers will then validate control plane and data plane availability for all affected services. Some customers may see signs of recovery. The next update will be provided in 60 minutes or as events warrant.

RCA – Cooling Event – Japan East

Summary of impact: Between 13:28 UTC and 22:16 UTC on March 31 2017, a subset of customers in Japan East region may have experienced unavailability of Virtual Machines (VMs), VM reboots, degraded performance or connectivity failures when accessing those resources or/and service resources dependent upon Storage service in this region.

Root cause and mitigation: Initial investigation revealed that one RUPS system (a rotary uninterruptible power supply system) failed in a manner which resulted in the power distribution that feeds all of the air handler units (AHU) in Japan East datacenter to fail. The result of the air handlers failing was continued increase in temperature within the entire datacenter. Japan East region is being managed by a 3rd party vendor, who owns three dedicated security spaces at the location reported Microsoft that all of spaces were impacted. The cooling system is designed for N+1 redundancy (also called as parallel redundancy) and the power distribution design was running at N+2. Microsoft and a 3rd party vendor continue investigating a root cause for why the fault RUPS system affected all power supply to the AHUs, this is currently in progress. As a part of standard monitoring, Azure Engineers received alerts for availability drops for this region. Engineers identified the underlying cause was due to a failure within the power distribution system that was running at N+2. One RUPS (rotary uninterruptible power supply) in the N+2 parallel line up failed and resulted in being unable to supply power to the cooling system in this datacenter. As a consequence of the cooling system going down, some resources were automatically shutdown to avoid overheating and ensure data integrity and resilience. At 14:12 UTC, Facility teams (a 3rd party vendor) and Microsoft’s site services personnel were onsite and restarted the cooling system air handlers, using outside airflow to force cool the datacenter. At the same time, multiple Microsoft Service Teams prepared for bringing systems back online in a controlled process to avoid automated processes from causing any potential destabilization across neighboring devices. At 16:08 UTC, temperature readings were back within operational ranges and power up processes began using safe power recovery procedures. A thorough health check was completed after RUPS system and cooling system were restored, any suspects or failed components were replaced and isolated. Suspected and failed components are being sent for analysis. At 16:53 UTC, Engineers confirmed that approximately 95% of all switches/network devices have been restores successfully. Power up processes began on impacted scale units that host Software Load Balancing (SLB) services and the control plane. At 17:16 UTC, majority of the core infrastructure was brought online, Networking Engineers began restoration of Software Load Balancing (SLB) services in a controlled process to help programming to establish a quorum promptly. Once SLB was up and running, Engineers confirmed that majority of services were recovered automatically and successfully at 18:51 UTC. Residual impacts with Virtual Machines were found, Engineers investigated and continued to recover impacted Virtual Machines to bring them online. In parallel, Engineers notified customers who had experienced residual impacts with Virtual Machines for recovery. At 22:16 UTC, Engineers confirmed Storage and all storage dependent services recovered successfully.

Next steps: We sincerely apologize for the impact to the affected customers. We are continuously taking steps to improve the Microsoft Azure Platform and our processes to help ensure such incidents do not occur in the future, and in this case it includes (but is not limited to):1. RUPS system unit are being sent off for analysis. Root cause analysis continues with site operations, facility engineers, and equipment manufacturers to further mitigate risk of recurrence.2. Review Azure services that were impacted by this incident to help tolerate this sort of incidents to serve services with minimum disruptions by maintaining services resources across multiple scale units or implementing geo-strategy.