Facebook On Big Data Analytics: An Insider's View

Few businesses are on the scale of Facebook, but the problems it's dealing with today might influence the best practices smaller companies will be putting in place tomorrow.

Just as Facebook is shaping big data hardware and data centers through its Open Compute Project initiative, it's also influencing the software tools and platforms for big data analysis, including Hadoop, Hive, graph analysis and more. Hive, Hadoop's data warehousing infrastructure, originated at Facebook, and according to Jay Parikh, VP of infrastructure engineering, the company is hard at work on ways to make Hive work faster and support more SQL query capabilities.

Parikh also tells InformationWeek that Facebook is working on new real-time and graph-analysis platforms, but the heart and soul of this interview is about big data analytics. There's plenty of detail on how Facebook answers operational and business questions, but read on to get Parikh's advice on how to avoid "wasting a lot of money" or "missing huge opportunities" in big data.

InformationWeek: The topic at hand is big data analytics, but let's start by exploring Facebook's infrastructure to get some context.

Jay Parikh: There are a few areas that we invest in to scale massive amounts of data. If you consider just the photos on Facebook, we have more than 250 billion photos on the site and we get 350 million new photos every day. It's a core, immersive experience for our users, so we've had to rethink and innovate at all levels of the stack, not just the software, to manage these files and to serve them, store them and make sure that they're available when users go back through their timeline to view them. That has meant changes at the hardware level, the network level and the data center level. It's a custom stack, and it doesn't involve Hadoop or Hive or any open source big data platforms.

Another area where we invest is in storing user actions. When you "like" something, post a status update or make a friend on Facebook, we use a very distributed, highly optimized, highly customized version of MySQL to store that data. We run the site, basically, storing all of our user action data in MySQL. That's the second pillar.

The third area is Hadoop infrastructure. We do a lot with Hadoop. It's used in every product and in many different ways. A few years ago we launched a new version of Facebook Messaging, for example, and it runs on top of HBase [the Hadoop NoSQL database framework]. All of the messages you send on mobile and desktop get persisted to HBase. We relied on our expertise in Hadoop and HDFS to scale HBase to store messages.

We also use a version of Hadoop and Hive to run the business, including a lot of our analytics around optimizing our products, generating reports for our third-party developers, who need to know how their applications are running on the site, and generating reports for advertisers, who need to know how their campaigns are doing. All of those analytics are driven off of Hadoop, HDFS, Hive and interfaces that we've developed for developers, internal data scientists, product managers and external advertisers.

Parikh: There's lots of hype in the [IT] industry today about everything needing to be real time. That has been true for us for a long time. We push the front-end website code twice a day. We have thousands of different versions of the site running at any given moment. We launched Light Stand, a new version of our newsfeed, last week, and we launched Facebook Graph Search in January. As people are adopting new products like this, we need to understand whether they're working or not. Are people engaged? Are they missing key features? Are they still liking things as much? If the warehouse or analytics platform can't keep up, then we can't come up with new iterations of our products very quickly. Real-time measurement has been a key element for us.

Great article, Doug. Glad you brought the viewpoints of the true pioneers, adopters and practitioner's viewpoints for the benefit of the mainstream enterprise. It was very interesting to read how they push front end code twice a day for analysis and Scuba. Any reason why they did not go with established in-memory databases - a technology which is pretty matured when they adopted MySQL for other purposes?

Parikh is pretty up front about the limitations of Hive that Facebook is tying to overcome, but he makes it clear it will take a yet-to-be-announced new platform -- expected this summer -- to address real-time analysis needs. Given the many real-time initiatives now underway in the Hadoop community, it will be interesting to see whether Facebook's new platform is embraced the way Hive was embraced way back when.

Most IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.

Why should big data be more difficult to secure? In a word, variety. But the business won’t wait to use it to predict customer behavior, find correlations across disparate data sources, predict fraud or financial risk, and more.