‘Facebook is down so I had to actually talk to my family! They seem nice.’

‘Facebook is down! I may have to talk to people! No wait, Twitter is OK. Phew!’

But the reality was that the paragon of uptime, capacity scaling and self-reliance experienced an outage that left millions unable to access its services.

Staring around 09:00 US Pacific Time (16:00 UTC) on 13 March 2019, with resolution only coming around 23:00 (06:00 UTC). That is a long working day, in any time zone.

So what happened?

Facebook is fabled for its resilience, automation and intelligent operation.

Speaking in November 2018 at the Data Centres Ireland conference, Niall McEntegart, data centre operations director, EMEA, Facebook, said that the ratio of servers to technicians, with the aid of automation and orchestration technologies, had exceeded 40,000 to 1!

He talked about Facebook Auto-Remediation (FBAR), the automation system that allows the company to manage outages and uptime.

A 2015 article on the company pages says FBAR retrieves a list of outages, calculates remediation workflows, insert flows into a job queue, and then returns to step one, endlessly.

One would imagine that whatever about suffering a technical issue, the prospect of even a perception of an attack being able to take Facebook offline would be unbearable.

McEntegart described FBAR automation in 2018, saying it processes hundreds of thousands of alerts ‘per moment’, remediates tens of thousands of hosts ‘per moment’, repairs tens of thousands of servers per day and drains/undrains tens of thousands of servers per day.

The reliance on and benefits of automation are based on the key points of:

Automate and integrate repairs, monitoring and
management

Properly pre-diagnose issues

Automated logistics functions

Simplify all repairs

Workflows, measurements and SLAs

In this context then, the Facebook official response was somewhat unenlightening.

A tweet from 5:49 PM (UTC), 13 March 2019 ran:

“We’re aware that some people are currently having trouble accessing the Facebook family of apps. We’re working to resolve the issue as soon as possible.”

Followed by a denial that this was somehow related to a DDoS attack.

At 7:03 PM (UTC) came:

“We’re focused on working to resolve the issue as soon as possible, but can confirm that the issue is not related to a DDoS attack.”

One would imagine that whatever about suffering a technical issue, the prospect of even a perception of an attack being able to take Facebook offline would be unbearable.

Then finally, many hours after the restoration of services, came this:

“Yesterday, as a result of a server configuration change, many people had trouble accessing our apps and services. We’ve now resolved the issues and our systems are recovering. We’re very sorry for the inconvenience and appreciate everyone’s patience.”

In the context of the FBAR description above, what could possibly have gone wrong and why is Facebook so reticent in providing detail?

Could entire clusters of outage list, remediation calculation or job scheduling servers have suddenly fallen over? And how? The configuration reference would suggest some kind of human action. It is inevitable that if automated systems fail, human intervention would be required, but it is hard to judge whether this was the cause or an exacerbation of the issue.

Was Facebook trying to implement some kind of change that killed a critical stage of the process? Did the ravenous need for scale bite them hard as an issue took out swathes of capacity?

We may never know as Facebook is still quite secretive about actual operations, as opposed to frameworks and architectures.

But it is heartening, the next time you have to face a manager and say ‘we are down’ and it will take a while to fix. Cite this instance of the very essence of cloud companies and say “Sure, it could happen to a Facebook!”