Microsoft explains one Azure authentication outage as another one happens

In a stroke of bad timing that would be comical if it weren’t so annoying, Microsoft’s multifactor authentication (MFA) system, used for Azure, Office 365, and Dynamics, has gone down for a second time this month, just hours after the company published its findings into a 14-hour outage on November 19.

The Azure Active Directory Multifactor Authentication services went offline just before 05:00 UTC and remained nonfunctional until just before 19:00 UTC. The servers initially affected were those servicing the Europe and the Middle East region and the Asia-Pacific region; as those regions woke up and tried to authenticate, the servers overloaded and went down. Microsoft tried to redirect some authentication attempts to US servers, but this merely had the effect of overloading those, too.

The company’s subsequent analysis has shown that three individual bugs came together to cause the problems. On November 19, a code change that had been progressively deployed over the previous six days provoked a cascade of failures. Above a certain traffic level, the new code caused a significant increase in latency between front-end servers and cache servers. This in turn revealed a race condition in the back-end servers, causing them to reset the front-end servers over and over. That then revealed a third issue: the back-end servers would create more and more processes, eventually starving themselves of resources and leaving them unresponsive.

Today’s problems are still under investigation. The MFA servers have been timing out since 14:25 UTC, causing login attempts to fail when MFA is in use. Currently, the company believes that the resolution of an earlier DNS error has produced a barrage of authentication attempts, essentially flooding the MFA system with more requests than it can handle.