RESEARCH & RESOURCES

Q&A: The Impact of Big Data at Facebook

What's the difference between the insight and impact of big data? Facebook's director of analytics, Ken Rudin, explains how the company uses big data, the benefits it offers, and how "real time" data needs to be.

April 2, 2013

[Editor's note: Ken Rudin, Facebook's director of analytics, is focused on ensuring that analytics is a key strategic asset for the company. He is presenting the opening keynote at the TDWI World Conference in Chicago, May 5–10, 2013. In this Q&A, we talked to Ken about how Facebook uses big data and what that means for enterprises everywhere.]

Ken Rudin: There are three big use cases at Facebook. The first is that we use it to power different parts of our product. For example, the feature "People You May Know," which suggests new connections for you on Facebook, runs on our data infrastructure. Our "Messages" product -- which people use to send billions of messages to friends every day -- is built on HBase.

The second big use case is to help keep our site secure and free of spam and other kinds of abuse; we build systems on our data infrastructure that can help detect and prevent issues like these.

The third case -- and by far the biggest of the three -- is for analytics. This includes everything from performing analyses on the thousands of product tests we're running at any given time, to measuring the performance of our advertising products, to gauging the effectiveness of the work we do to grow our user base and keep them engaged.

How do you define "impact" when working with big data?

When we think about this question, we draw a pretty strong distinction between "insights" and "impact." Most analysts know the feeling of running a great analysis that delivers really strong insights -- but then no one ever does anything with it. If you're an analyst at Facebook, you need to own that and get people to act on your analyses. If no one acts on it, you're not having any impact.

One good example here is the evolution of our "social reporting" tool. This tool provides an easy way for users to ask friends to take down photos of them that they're not comfortable with. When we analyzed how people were using this tool, we found that a large number of users were abandoning the reporting process at the moment they were asked to write a message to their friend about why they wanted the photo taken down. Through testing, we discovered that if we auto-populated a sample message, the number of users who completed the flow jumped from 20 percent to 60 percent. Thanks to that analysis and the follow-up from our team, the auto-population feature is now an official part of the tool.

Facebook has obviously invested heavily in open source Hadoop. What should enterprises be thinking about when they invest in their own big data infrastructures?

We're very focused on using the right tool for the job. Hadoop is a phenomenal tool for doing things we couldn't imagine doing a decade ago in terms of scale and amount of data you can process. Without it, Facebook wouldn't be where it is today, but as amazing as Hadoop is, we think it's important to view it as an addition to more-traditional database technologies and not as a total replacement for them.

If there are things in your data infrastructure that work really well in relational -- things like analyzing your customer base by location, age, or other characteristics -- you should keep them in relational. You should begin investing in Hadoop when you're ready to start conducting more-exploratory analysis, such as "At what price point should I sell this product?"

What role do data analysts play at Facebook? How big is the team?

Data analysts at Facebook are specialists who partner with specific teams -- product, engineering, finance, and so on -- to deliver impact. Though they're part of the central analytics team, they physically sit with the teams they support, with a goal of making sure that all decisions made in all areas of the company are appropriately informed by data. Our analysts are both analytics savvy and business savvy, so they can do more than just get answers -- they can help our teams figure out which questions to ask.

It's not a huge team, but it doesn't have to be. We try to instill this focus on data in all of our teams to get them asking their own questions and using good analysis to help them get to the answers. We even run a two-week program where anyone at the company -- product managers, engineers, marketers -- can learn how to frame questions, use our tools, and run analyses.

The industry has increasingly been focused on trying to derive insights in "real time." How important do you think this is, and how "real time" does it actually need to be?

Speed of execution matters for every business, and in many cases your ability to execute is going to be tied to your ability to measure results. For Facebook, it's especially important to the way we develop our product.

As I mentioned earlier, we're running thousands of product tests at any given time, measuring how subsets of our user base engage with slightly different versions of new features. We want to see what's happening as quickly as possible so we can adjust accordingly and make our product more engaging. We can't wait until the next day or the next week for results -- those lost days and weeks add up very quickly.

When it comes to big data, Facebook operates at a huge scale. Is it fair to say that the challenges you face are beyond what the average enterprise is likely to face?

Our analytics data store holds more than 100 PB of data, and we're adding more than 500 TB of new data every day. I've seen some recent estimates from IDC that project there will be more than 40 zettabytes of data in the world by the year 2020. That's more than 5 TB for every man, woman, and child on the planet. Yes, Facebook is at the head of this curve, but given how quickly things are moving, our challenges today are going to be your challenges tomorrow.

The other thing that's important to remember here is that good data warehousing and analysis is not just about managing scale. It's also about making sure your company is effective in using the data you collect. Are you using the right tools? Are you focusing on impact? Are you iterating quickly? Are you building a data-driven culture? These are questions you need to be asking, whether you're at 1 TB or at 100 PB.