The Ripple Effect of Facebook’s Outage

June 15, 2012

Facebook recently suffered sporadic outages with some users briefly experiencing a complete outage, while others experienced hiccups with site functionality or no noticeable issues at all loading the site.

The graph below (based on approximately 130,000 observations of Facebook every 2 hours as measured from the Internet backbone across than 3,000 customers) shows that the error rate was sporadic with 30- 40% during spikes. Because Facebook is larger, distributed service, it wasn’t completely down and while the service disruptions had a fairly minimal impact for Facebook, it had a widespread ripple effect on many of the Web’s most highly trafficked sites.

Analyzing data across 3,000+ of our customers, we found that more than 5,100 web properties were potentially impacted. When we looked specifically at US media news and retail sites which are likely to use the Facebook ‘Like’ button, we found that there was a direct correlation between Facebook’s issues and slower average page load times for the top 20 US news sites and top 60 US retails sites.

The recent issues with Facebook are yet another example of the inter-dependency and reliance on third-party services for advertising, social plugins, etc. In the past year we’ve seen major outages across large services like Amazon, Microsoft, Akamai, etc. which have a cascading impact on the companies that integrate with those services. And we can count on more to come in the future.

A few short years ago, an outage on a prominent website would have resulted in only that single site being impacted, but because today’s modern applications call services from a variety of third-party providers, outages now cascade throughout the Internet and no service is too big to fail.

It’s no longer enough to solely control the factors within your firewall, you must also mitigate the risks imposed by third-party services that can slow your site’s performance or take it down.

With the average website connecting to more than twelve domains before ultimately being served to the end user, serving a website or application now means making sure all these pieces assemble in a way that yields the best possible service to the end user. The graph below depicts the differences in the use of third-party domains across vertical industries. Industries like News, Media, Retail and Travel that typically have lots of feature-rich content and functionality are very dependent on third-party services.

While third-party services can enrich your website’s functionality to drive more traffic to your site, offer interactive experiences, increase conversion, and add new functionality when it becomes available, as we’ve seen, they can also make your site be vulnerable to degraded performance or a complete shutdown if any one of these components fails.

So the question remains, can you manage your third-party services effectively and quickly recover if there’s an issue?

We recommend putting processes in place that ensure the successful implementation and continued functioning of third-party web services. You can significantly reduce risk by:

benchmarking response time and availability of each third-party web component before signing contracts

testing components before launch in multiple phases and under various conditions. Include pre-production testing for a proof of concept and load testing to evaluate performance under conditions similar to what may be expected on your site and during high traffic periods

devising fast-fail programs that secure the functioning of your overall site, even if any one particular component should crash

consider redundant services for the most critical features of your site, such as your shopping cart.

Once you’ve pre-tested your third-party components to assess their potential impact on performance, the real work begins. Even though vendors may be contracted to maintain certain standards, it is ultimately your vigilance that will ensure those standards are met. Here are a few pointers.

Set measurable and objective levels for SLAs that reference accurate data all parties can see and understand. Without defined standards, there is no way to hold vendors to exact account.

Measure and keep measuring over time. The occasional snapshot won’t do it; to ensure quality under all loads and across all markets, you must implement a regime of 24/7 monitoring.

Automate the removal of a failing object (e.g. ads, tracking pixels) when it’s been identified as an issue, bring it back once it’s working again.

Share the wealth of data with your vendors. The objective measurements you gather allow you to work amicably with your third-party partners to align interests and resolve problems.

While third-party services provide extensive functionalities which enable a richer online experience, they can also introduce performance risks and wreak havoc on thousands of websites. This dependency on outside sources means that you need to evaluate, monitor and optimize every element between the data center and the end user.

Steve has been working in the software and IT services industry for almost 20 years and has contributed in key roles to leading the evolution and modernization of the IT performance, including the adoption of Dynatrace as one of the fastest growing APM solutions in the industry. As Vice President for Product Management, he is responsible for product vision and go-to-market strategy. Prior to this role, Steve has held management positions in product management, technical sales and engineering. Steve also enjoys cycling, coaching and his children (not in that order). Steve has B.A. in Economics from Kalamazoo College.