Out in the Open: Hacker Vows to Instantly Analyze Your Big Data

Share

Out in the Open: Hacker Vows to Instantly Analyze Your Big Data

Former AddThis CTO Stewart Allen.

Photo: Stewart Allen

These days, Hadoop is everywhere.

It began as an esoteric data-crunching platform used by vanguard web companies like Yahoo, Facebook, and Twitter, and now, less than a decade later, it's a sensation that spans industries. IBM uses Hadoop inside Watson, its Jeopardy-winning supercomputer. The NSA uses Hadoop to juggle the massive amounts of surveillance data it collects with each passing second. And this open source success story is even finding its way into businesses that play in more traditional markets like finance and insurance.

But Stewart Allen, the former CTO of social media company AddThis thinks it's all a bit much. "I refer to what's going on as the Lost Decade of Hadoop," he says. "So many people are wasting time building software that Hadoop was never meant for."

>'I refer to what's going on as the Lost Decade of Hadoop.'

Stewart Allen

Naturally, he has a solution. At AddThis, he led the creation of Hydra, a big data processing system designed specifically to handle data analytics in real-time – something that many people assume Hadoop is good for, even though it's not. This open source project is just one of a growing number of tools that seek to leap frog Hadoop and provide a much more rapid breed of data analysis.

You've probably seen those little AddThis buttons that litter the web. The company offers an analytics dashboard that lets companies track how their online content is being shared across the net. Every time someone uses one of those buttons to share something on Facebook, Twitter, or the one of the myriad other social networks that these buttons tie into, a tiny bit of data is sent back AddThis. The company then aggregates this data and sends it to customers via that dashboard.

Allen started building Hydra back in 2006 when all that small data started adding up to big data. "The data rate was beginning to overwhelm the system capacity," Allen says. The company needed new ways of juggling all this information, but nothing fit the bill. At the time, there was no Hadoop and no massive database like Cassandra, tools that let you store and retrieve data across thousands of computer servers. So Allen built a new tool from scratch.

What Allen and company came up with is a bit like Hadoop, in that Hydra distributes data across multiple servers. But it's really quite different. Hadoop was originally designed to analyze one big static data set. This is called "batch processing," and it's great if you've already collected all your data. But you need another tool if your data set is getting larger by the minute. You need a tool that analyzes data in real-time, as it comes off the net.

Nowadays, there are many tools that do this. Facebook built a system called Puma. Twitter uses tools called Summingbird and Storm. And Yahoo is exploring a tool called Spark. And the developers of Hadoop hope to improve its real-time chops with something called YARN. But Allen and the AddThis team still think Hydra has an edge on many of these systems. "I don't think all these different tools interface well," he says. "Hydra's view of the world is much simpler and cleaner."

Although AddThis focuses on real-time, Hydra can do batch processing too. It also offers a simple language for asking questions of the data and a graphical dashboard for managing clusters of servers, and as Chris Burroughs, an engineer working on Hydra at AddThis points out, it's battled tested. AddThis has used it since 2006. "Hydra has been violently used for all these years," he says. Today, the company runs six Hydra clusters. The largest spans 156 server and processes 3.5 billion actions per day.

Allen left the company in April of last year to found a couple of new startups. He's tight lipped about the names of the new companies or what they'll do, but he does say that he's using Hydra for one of them. It may never be as popular as Hadoop. But its tentacles are beginning to stretch across the web.