New Facebook tool fills queries in under a second—even with terabytes of data

Not satisfied with how its current set of tools was measuring site performance …

A network as large and vast as Facebook needs to be fast—and the social network's engineering team has a new tool to help make the experience even faster.

Not satisfied with how its current set of tools was measuring site performance across its myriad data centers, Facebook has instead developed Scuba, a system for "real-time, ad-hoc analysis of arbitrary datasets."

In other words, Scuba is helping Facebook gather important data on the health and status of its server infrastructure—and fast. In a blog post by Lior Abraham, the site performance team engineer says Facebook traditionally relied on pre-aggregated graphs, or sample data stored in MySQL databases. As the site has grown, however—it now boasts just over 900 million users—these tools became slow and unwieldy.

Instead, "the same approach we use to serve News Feed stories in real-time"—where data is stored primarily in memory—"can be used to serve statistics data about our internal systems," he wrote.

As a result, queries now take less than a second to fulfill, "even traversing hundreds of millions of samples and hundreds of gigabytes of data." Employees also have the ability to view heat maps of racks and servers based on such factors as CPU consumption and packet throughput, or find data on "sick" machines that are performing the least effectively.

However, speedy access to such massive amounts of data has a tradeoff. Because the tool is designed for real-time analytics in the here and now, "we typically only keep around 30 days of data," said Abraham in an e-mail exchange with GigaOm.

Scuba's front-end, meanwhile, is designed to visualize all of this data in an easy and simple to understand manner, using what the team describes as "goggles." Nearly any dimension, including "page, server, source version, datacenter, and country, to name just a few common ones," can be aggregated and visualized as a dataset.

Internally, the tool has proven successful. Abraham says that dozens of engineers contributed features and fixes, "and now over 500 employees use the tool monthly."

From the description I read on their site, it seems like a map-reduce operation performed on an in-memory datastore. Basically the data is partitioned across many machines, each machine calculates an aggregate on whatever dimensions you want, and then the results are collected back in a central location.