I’ll Take Hadoop for $400, Alex

By oom – Thu, Feb 24, 2011 2:01 PM EST

Recommend

0

Tweet

0

See what Yahoo! and Jeopardy! have in common.

watson-jeopardyThis week,IBM’s supercomputer, Watson (named after IBM’s founder, Thomas J. Watson), took on two of the most championed Jeopardy! contestants of all time in an exhilarating million-dollar Jeopardy! face-off between man and machine.

Watson defeated Jeopardy! defenders Ken Jennings and Brad Rutter, amassing $77,147 in winnings in a nail-biting three-night tournament that sparked interest around the field of artificial intelligence and data analytics.

What you may not realize is that Yahoo! played a role in it.

IBM's Watson depends on 200 million pages of content and 500 gigabytes of preprocessed information to answer the Jeopardy questions. That huge catalog of documents had to be indexed so that Watson could answer questions within the 3 second time limit. On a single computer, generating that large catalog and index would take a lot of time, but dividing the work on to many computers makes it much faster.

Apache Hadoop is the industry standard framework for processing large amounts of data on many computers in parallel. By using Hadoop MapReduce, Watson's development team was able to easily and reliably run their application on a large number of computers.

For the last 5 years, since the start of Hadoop, Yahoo! has been the primary contributor.

Hadoop PatchesThe graph below shows the number of patches contributed by different companies over time. Since Hadoop is part of Apache, no company owns it, but clearly Yahoo! has made the largest contribution. Furthermore, with a few exceptions, all of the members on the Hadoop Project Management Committee either work at Yahoo! or are Yahoo! alumni. Yahoo! also hosts the Bay Area Hadoop User Groups and last Wednesday we had over 300 people come and listen to Arun Murthy talk about the next generation of Hadoop MapReduce.

Hadoop is critical to Yahoo!'s business touching many of our key products. Yahoo! Mail uses Hadoop to find spam and it is much more effective than the previous system. The Yahoo! front page content is optimized for each user via Hadoop, based on their interests, which has lead to a much higher page view rate.

Links displayed to the user are also optimized with Hadoop and that has dramatically increased click through rates. Finally, picking the best ads to display to each user is done with Hadoop.

What all of these have in common is that they require processing huge amounts of data very quickly and reliably on large numbers of computers and they have a direct impact on Yahoo!'s performance as a company.

So if you’re ever on Jeopardy!...”It is the company which is the biggest contributor to the technology behind Hadoop….”