Meet the Data Brains Behind the Rise of Facebook

Facebook's Jay Parikh. Photo: Ariel Zambelich/Wired

Jay Parikh sits at a desk inside Building 16 at Facebook’s headquarters in Menlo Park, California, and his administrative assistant, Genie Samuel, sits next to him. Every so often, Parikh will hear her giggle, and that means she just tagged him in some sort of semi-embarrassing photo uploaded to, yes, Facebook. Typically, her giggle is immediately followed by a notification on his Facebook page. If it’s not, he has some work to do.

Parikh is Facebook’s vice president of infrastructure engineering. He oversees the hardware and software that underpins the world’s most popular social network, and if that notification doesn’t appear within seconds, it’s his job to find out why. The trouble is that the Facebook infrastructure now spans four data centers in four separate parts of the world, tens of thousands of computer servers, and more software tools than you could list without taking a deep breath in the middle of it all. The cause of that missing notification is buried somewhere inside one of the largest operations on the net.

The trouble is that the Facebook infrastructure now spans four data centers in four separate parts of the world, tens of thousands of computer servers, and more software tools than you could list without taking a deep breath in the middle of it all.

But that’s why Parikh and his team build tools like Scuba. Scuba is a new-age software platform that lets Facebook engineers instantly analyze data describing the length and breadth of the company’s massive infrastructure. Typically, when you crunch such enormous amounts of information, there’s a time lag. You might need hours to process it all. But Scuba is what’s called an in-memory data store. It keeps all that data in the high-speed memory systems running across hundreds of computer servers — not the hard disks, the memory systems — and this means you can query the data in near realtime.

“It gives us this very dynamic view into how our infrastructure is doing — how our servers are doing, how our network is doing, how the different software systems are interacting,” Parikh says. “So, when Genie tags me in a photo and it doesn’t show up within seconds, we can look to Scuba.”

In the nine years since Mark Zuckerberg launched Facebook out of his Harvard dorm room — Monday marks the anniversary of the service — it has evolved into more than just the world’s most popular social network. Zuckerberg and company have also built one of the most sophisticated engineering operations on the planet — largely because they had to. Facebook is faced with a uniquely difficult task — how to serve a personalized homepage to one billion different people, juggling one billion different sets of messages, photos, videos, and so many other data feeds — and this requires more tech talent than you might expect.

But it also spans a team of top engineers who deal in data — an increasingly important part of modern online operations. Scuba is just one of many “Big Data” software platforms Facebook has fashioned to harness the information generated by its online operation — platforms that push the boundaries of distributed computing, the art of training hundreds or even thousands of computers on a single task.

Google’s Big Data platforms are still viewed as the web’s most advanced, but as Facebook strives to expand its own online empire, it isn’t far behind, and in contrast to Google, Facebook is intent on sharing much of its software with the rest of the world. Google often shares its big ideas, but Facebook also shares its code, hoping others will make good use of it. “Our mission as a company is to make the world more open and connected,” Parihk says, “and in building our infrastructure, we’re also contributing to that mission.”

The Tale of the Broken News Feed

Facebook’s data team was founded by a man named Jeff Hammerbacher. Hammerbacher was a contemporary of Mark Zuckerberg at Harvard, where he studied mathematics, and before taking a job at Facebook in the spring 2006, he worked as a data scientist inside the big-name (but now defunct) New York financial house Bear Stearns.

Hammerbacher likes to say that the roots of Facebook’s data operation stretch back to an afternoon at Bear Stearns when the Reuters data feed suddenly went belly up. With the data feed down, no one could make trades — or make any money — and the feed stayed down for a good hour, because the one guy who ran the thing was out to lunch. For Hammerbacher, this snafu showed that data tools were just as important as data experts — if not more so.

“I realized that the delta between the data models that I generated and the models generated by a mathematician at another firm was going be pretty small compared to the amount of money we lost during that two hours without the Reuters data feed,” Hammerbacher remembers. “I felt like there was an opportunity to build a complete system that starts with data ingest and runs all the way to data model building — and try to optimize that system at every point.”

‘I felt like there was an opportunity to build a complete system that starts with data ingest and runs all the way to data model building — and try to optimize that system at every point.’

— Jeff Hammerbacher

That’s basically what he did at Facebook. The company hired him as a data scientist — someone who could help make sense of the company’s operation through information analysis — but with that broken Reuters data feed in the back of his mind, he went several steps further. He built a team that would take control of the company’s data. The team would not only analyze data. It would build and operate the tools needed to collect and process that data.

When he first joined the company, it was still trying to juggle information using old school Oracle data warehouse. But such software wasn’t designed to accomodate an operation growing as quickly as Facebook, and Hammerbacher helped push the company onto Hadoop, an open source software platform that had only recently been bootstrapped by Yahoo.

Hadoop spreads data across a sea of commodity servers, before using the collective power of those machine to transform the data into something useful. It’s attractive because commodity servers are cheap, and as your data expands, you just add more of them.

Yahoo used Hadoop to build an index for its web search engine, but Hammerbacher and Facebook saw it as a means of empowering the company’s data scientists — a way of analyzing much larger amounts of information than it could stuff into an Oracle data warehouse. The company went to work on a tool Hive — which would let analysts crunch data atop Hadoop using something very similar to the structured query language (SQL) that has been widely used since the 80s — and this soon became its primary tool for analyzing the performance of online ads, among other things.

Hadoop of the Future

Today, Hadoop underpins a who’s who of web services, from Twitter to eBay to LinkedIn, and Facebook is now pushing the platform to new extremes. According to Jay Parikh — who, as head of infrastructure, oversees the company’s Big Data work — Facebook runs the world’s largest Hadoop cluster. Just one of several Hadoop clusters operated by the company, it spans more than 4,000 machines, and it houses over 100 petabytes of data, aka hundreds of millions of gigabytes.

This cluster is so large, it has already outgrown four data centers, says Facebook engineer Raghu Murthy. On four separates occasions, as it struggled to cope with the ever expanding collection of data generated by its site, Facebook filled its allotted data center space with Hadoop servers, and each time, it was forced to find a new facility. “Our planning horizon was always, like, forever,” says Murthy, a cornerstone of the company’s Big Data work since Jeff Hammerbacher hired him away from a Standford Ph.D. program more than four years ago. “But then we would have to go through this process of shipping all the data over to a new place.”

‘Having the whole Hadoop cluster in one data center scares the shit out of me. Prism helps with this.’

— Santosh Janardhan

But after the last move, the company vowed it would never do this again, and it set about building a Hadoop cluster that would span multiple data centers. The project was led by Murthy, who had caught Hammerbacher’s eye after building a pre-Hadoop distributed computing system at Yahoo and had already worked on several key projects at Facebook, including Hive. But this was something different. Hadoop wasn’t designed to run across multiple facilities. Typically, because it requires such heavy communication between servers, clusters are limited to a single data center.

The solution is Prism, a platform Murthy and crew are currently rolling out across the Facebook infrastructure. The typical Hadoop cluster is governed by a single “namespace,” a list of computing resources available for each job, but Prism carves out multiple namespaces, creating many “logical clusters” that operate atop the same physical cluster.

These names spaces can then be divided across various Facebook teams — each team gets its own name space — but all of them have access to a common dataset, and this dataset can span multiple data centers. The trick is when a team run a job, it can replicate the particular data needed for that job and move it into a single data center. “We’re pushing the capacity planning down to the individual teams,” Murthy says. “They have a better understanding of the particular needs of the site.”

According to Murthy, the system can expand to an infinite number of servers — at least in theory. That means the company needn’t worry about maxing out another data center. But for Santosh Janardhan — who runs “operations” for the data team, meaning he ensures all this infrastructure runs smoothly — there’s an added benefit. “Having the whole Hadoop cluster in one data center scares the shit out of me,” he says. “Prism helps with this.”

Prism is just one part of a sweeping effort to improve and expand Hadoop. Led by another ex-Yahoo man, Avery Ching, a second team of engineers recently deployed a new platform called Corona, which allows many jobs to run atop a single Hadoop cluster without crashing the thing. And Murthy has also helped fashion a tool called Peregrine, which lets you query Hadoop data far more quickly than the norm. Hadoop was designed as a “batch system,” meaning you typically have to wait while jobs run, but much like Impala — a system built by Hammerbacher and Cloudera — Peregrine takes the platform closer to realtime.

Facebook has yet shared all this software with the outside world, but it has shared Corona, and if history is a guide, it will likely share more. That’s one of the reasons engineers like Avery Ching are here. “At Facebook, we face problems before others do,” he says. “Others can benefit from that. They don’t have to reinvent the wheel.”

Data Minds in Candy Land

Hadoop is the bedrock of the Facebook’s data operation — and it will be for years to come. But with tools like Scuba, the company is also moving in new directions.

Built by a team of engineers that includes Josh Metzler — who attracted the company’s attention after he repeatedly placed among the highest scorers in the programming competitions run by Top Coder — Scuba is one of a growing number of in-memory data stores that seek to significantly improve the speed of information analysis. Tiny software agents running on across Facebook data centers collect information about that behavior of the company’s infrastructure, and then Scuba compresses this log data into the memory systems of hundreds of machines. This data can then be queried almost instantly.

“It kinda of like an Excel pivot table,” says Parikh, referring to the common spreadsheet tool that lets you slice and dice data, “except you’re dealing with hundreds of millions of rows of data and you can pivot that data with a sub-second response time.”

Yes, the project seems to overlap with Peregrine — at least in part. But as Jeff Hammerbacher points out, that too is part of the Facebook ethos. “The Facebook way of building things is to go for the shortest path solution,” he says. “It doesn’t always build one monolithic system that does everything.” Like so many Facebook projects, Scuba grew out of a company hackathon. Engineers see a problem and they tackle it. They don’t wait for another project to solve it for them.

And these problems are everywhere. Santosh Janardhan juggled data at PayPal and YouTube, but he says these jobs now seem tiny by comparison. “Facebook blows all of them out,” he says. “It’s just staggering to me…the sheer rate at which data grows here.” That, he says, is the main reason they’re all here. They want to solve the big problems. “If you’re a technical guy, this is like a Candy Land.”