Facebook’s Daylong Malfunction Is a Reminder of the Internet’s Fragility – The New York Times

Technology|Facebook’s Daylong Malfunction Is a Reminder of the Internet’s Fragility

Image

A coding mistake led to a major service interruption on Facebook and its other properties, including WhatsApp, Instagram and Messenger.CreditCreditJosh Edelson/Agence France-Presse — Getty Images

SAN FRANCISCO — Facebook said on Thursday that it had repaired a technical error that led to long lapses in service at its various properties, including Instagram, WhatsApp and Messenger.

The interruption lasted nearly 24 hours on some of the services and was the longest in Facebook’s recent history. It was an eye-opening reminder that even the most powerful internet companies, employing the best computer scientists and cutting-edge technology, can still be crippled by human error.

“All of the big web companies have multiple lines of defense, but sometimes a coding mistake made by one engineer can make its way onto many thousands of computers and cause major errors,” said Alex Stamos, a former chief security officer at Facebook and a lecturer at Stanford University. “In other words, rebooting something as complex as Facebook is very, very hard.”

A “server configuration change” made on Wednesday had a cascading effect through the company’s network, a Facebook spokesman said. That created a repeating loop of problems that kept growing and could not be immediately fixed, according to one current and one former Facebook employee, who spoke on the condition of anonymity because they were not allowed to talk to reporters.

That small mistake had big consequences. Instagram users couldn’t view other profiles, WhatsApp users couldn’t send messages, and news feeds across Facebook’s main app went blank.

Downdetector, which likens itself to a weather report for the internet, said it had received 7.5 million problem reports about Facebook’s apps. In comparison, widespread problems on YouTube in October prompted just 2.7 million reports. Downdetector measures service interruptions in part by counting reports from users who are experiencing problems.

“Never before have we seen such a large-scale outage,” said Tom Sanders, a co-founder of Downdetector.

Early Thursday, Facebook was able to pull most of its systems back online. The company is still trying to figure out how that error reverberated throughout its network. Facebook officials emphasized that the problem had not been caused by hacking or a cyberassault like a so-called denial-of-service attack, which would hit servers with a wave of traffic that caused them to stop working.

For years, Facebook has recruited engineers on the idea that within weeks they can release computer code that touches billions of people.

Image

A map, provided by DownDetector, showing the Facebook outage, centered in some of the company’s biggest markets.Creditvia DownDetector

“I still get a large amount of fulfillment from seeing my work make a meaningful impact on so many people’s lives,” a testimonial from one employee says on Facebook’s “careers” recruiting page.

But that also means a single employee’s mistake can have widespread consequences, especially as Facebook works on a recently detailed plan to consolidate the infrastructure of its “family of apps.” The more tightly woven a computer network becomes, the more likely it is that a small technical problem can grow into a large one.

Facebook, like other internet giants, prides itself on never going offline. That predictability has helped it become one of the most influential — and criticized — companies in the world. An estimated two billion-plus people use one or several of its services daily.

As people become more dependent on Facebook’s services, for chatting with family and friends as well as doing their jobs, they have higher expectations for performance, Mr. Sanders said.

“The tolerance for down time decreases, and people are increasingly expecting services to operate flawlessly 365 days per year,” he said.

Although the incident was an irritation for many users, it had more urgent consequences for businesses, like advertising, that rely on Facebook’s network to generate revenue.

Kieley Taylor, global head of social at the advertising agency GroupM, said her firm hadn’t been able to get access to Facebook’s system, meaning new advertising campaigns were delayed.

“It’s never a good day for an outage,” she said. “Luckily, it was relatively a short period, but it was fully out.”

Her company was still trying to determine how many ad campaigns had been hit. Ms. Taylor said that because Facebook’s ad system worked on a pay-as-you-go basis, GroupM wouldn’t need to seek reimbursements from Facebook for ad campaigns that weren’t delivered.

GroupM diverted advertising to Google search, YouTube and other websites, but said Facebook had unique reach given its size.

“Because of all the people who are on the platform, it continues to be a really powerful digital marketing platform,” Ms. Taylor added.