Related topics

Former Yahoo! Hadoop honcho uncloaks from stealth

Making elephants dance in real-time

Common Topics

Structure Data 2012 Yet another big data startup has uncloaked which traces its roots in the Hadoop MapReduce project started by, and open sourced by, Yahoo!

At the Structure Data 2012 conference in New York on Wednesday, Todd Papaioannou, formerly vice president of cloud architecture at the internet media company and the head honcho through the years when Yahoo! took its Hadoop MapReduce engine and Hadoop Distributed File System open source as an Apache project, gave a brief – and not particularly detailed – introduction to the company he has founded, called Continuuity. (Yeah, that's with a double U, just to annoy the crap out of us.)

Papaioannou said that the trick in this business is to capture the "digital exhaust" – a term coined by Google – that we emit as we go about our lives on the internet and to make use of that vastly expanding amount of data we are emitting. I would call it digital yeast myself, but as a homebrewer, you'd expect that sort of thing. Perhaps digital methane might be even more accurate. Gartner says that the amount of data will grow by 800 per cent over the next four years, and that 80 per cent of that data will be unstructured.

Being from Yahoo!, you would expect for Papaioannou to say that consumer intelligence – something that his new company will be focused on – is "the first archetypal application pattern that is emerging in the big data space." But it is more than just targeting people on the web to serve them ads, content, and deals, or to do sentiment analysis.

To prove his point of how dramatic an effect that this consumer intelligence can be, Papaioannou trotted out some statistics from some of the batch Hadoop operations at Yahoo!

Back in the day, Yahoo! News gave everyone the same homepage. But after gathering up data about Yahoo! users who go to the news site, and not only serving them up more appropriate ads, but also serving them up precisely targeted content – over 3 million different homepages were generated for the news site – Yahoo! was able to increase the clickthrough rate for news by more than 300 per cent.

"Obviously human editors could not have done that," explained Papaioannou, which is why Yahoo! build a content serving engine that figured out what pages to serve by mining the data about the stories that users actually read each day and feeding them more of the right stuff to keep them reading.

The problem with all of this is that the Hadoop backend is that Hadoop is batch oriented. "Hadoop has been a fantastic platform for doing that," says Papaioannou. "But actually, the web is moving towards much more of a real-time experience, people are expecting much more of a real-time experience."

It doesn't matter if you can sift through mounds of data to find "the signal" buried in it that tells you what to do with and end user, says Papaioannou. You need to be able to act on that signal in real-time." And Yahoo! was not able to get Hadoop to run in real-time, despite the MapReduce Online and S4 efforts.

There has been an evolution from relational databases, which didn't scale very well, to sharded databases (like distributed MySQL) to try to move from enterprise to hyperscale Web application scales, explains Papaioannou, who threw up this quick graphic:

The evolution of big data, according to Todd P and Continuuity (click to enlarge)

To get around the scalability barriers of the traditional relational database, you shatter the database and distribute it across a bunch of database nodes, which all fed into the compute node working on the data. With systems like Hadoop, you have a single master node that controls the MapReduce job that is crunching data, and you actually move the necessary compute jobs out to run on top of the data store nodes.

But in the future, what Papaioannou envisions is that companies will not wait to run their log files and clickstreams through a batch-oriented Hadoop cluster, but rather run the data through the compute nodes and process it all in real time, and presumably in a massively parallel fashion. He was not precise about how this might be accomplished.

"This is a pretty fundamental change compared to the application architecture of Hadoop," says Papaioannou. "You walk around here and you see a lot of people talking about real-time. It's not clear to me, as an industry, that we have nailed that problem. It is clear to me that we need to solve that problem, and that the next big wave of applications is going to be real-time and to get to real-time, you have to take the human out of the loop." Just like Yahoo! did with its homepage.