The Hadoop Summit of 2010 started off with a vuvuzela blast from Blake Irving, Chief Product Officer for Yahoo. Yahoo delivered keynote addresses that outlined the scale of their use, technical directions for their contributions, and architectural patterns in how they apply the technology.

The increasing interest in Hadoop was evident: this year’s conference had 1000 people in attendance and was sold out 10 days before it started, increasing from about 300 two years ago and 650 last year. The father of Java, James Gosling was also in attendance at the conference. This conference marked the fifth anniversary of Hadoop (approximately). Irving noted that only 5% of the world’s data is structured, with unstructured data having tremendous growth and that this new data is of a far more transient nature. He highlighted that Yahoo uses Hadoop to analyze every page click and optimizes rankings for content, updating the results every 7 minutes. He noted "we believe Hadoop is now ready for mainstream enterprise use."

Yahoo’s SVP of Cloud Computing, Shelton Shugar noted that Yahoo inputs 120 TB of data per day for 100 billion events, currently storing 70 PB with 170 PB of capacity. Yahoo processes 3 PB per day, running over a million jobs per month on 38,000 servers. As Yahoo’s usage of Hadoop has grown, they have needed to build support for mainstream application programmers and better management tools and data security. He noted that Yahoo uses Hadoop in production for a variety of products:

data analytics

content optimization

content enrichment

Yahoo! Mail Anti-Spam

Ad products

Ad optimization

Ad selection

Big data processing & ETL

Yahoo is also using Hadoop heavily for applied science, such as:

user interest prediction

ad inventory prediction

search ranking

ad targeting

spam filtering

Eric Baldeschwieler, VP of Hadoop Software Development of Yahoo noted that in the last year Yahoo has:

increased their clusters from 2000 to 4000 nodes each

doubled the number of jobs per node, benefitting from increased CPU power due to Moore’s law.

Now have over 80% disk utilization and typically 50-60% CPU utilization, and that data use is growing faster than processing use.

contributed over 70% of the total patches to Hadoop.

Their focus in the last year was to improve Hadoop map-reduce:

a new capacity scheduler

job tracker stability and robustness supporting mixed workloads

adding limits to resource use: safety rails

Their focus is now on developing Hadoop’s distributed file system, HDFS:

every node in their clusters now has 12 TB of storage. They are now building a single 48 PB cluster – "that blows Hadoop’s mind" because of scalability limits in the name node

improving use of memory, connections and buffers, and to provide metrics

breaking up storage into a set of volumes (uses multiple HDFS clusters)

releasing federated storage across HDFS instances, in the next major Hadoop release

Baldeschwieler explained how Yahoo personalizes their home page:

Realtime serving systems use Apache that read maps of users to interests from database

Every 5 minutes they use a production Hadoop cluster to rerank content based on recent data, updating results every 7 minutes

Every week they recompute their machine learning models for categories in a science Hadoop cluster

Yahoo Mail uses Hadoop in a similar way:

Scoring mail against a spam model in a production cluster frequently

Retraining antispam models on a science cluster every few hours.

This system powers 5 billion deliveries per day to over 450 million mail boxes.

Because HDFS has a single point of failure (the name node), it is a risk for a high availability production system. To mitigate that, Yahoo replicates data to multiple clusters, so an outage of one distributed file system can be compensated for by using a backup file system. In their presentations Yahoo noted that they are using the Hive data warehousing project of Hadoop, in addition to their own Pig project.

Baldeschwieler announced that Yahoo has released a beta test of Hadoop Security, which uses Kerberos for authentication and allows colocation of business sensitive data within the same cluster. They have also released Oozie, a workflow engine for Hadoop, which has become the de-facto ETL standard at Yahoo. It is integrated with MapReduce, HDFS, and Pig and Hadoop Security.

Overall, Yahoo demonstrated continued leadership in developing Hadoop technologies, and at the same time they were clearly pleased that a number of leading Internet companies and independent technology vendors have emerged as part of the ecosystem.

Just to further exnapd why we did this (and not directly post patches to MySQL upstream), most of the changes apply generally and would be beneficial to the MySQL codebase. However, it’s easier for us to go ahead and work in the open and to be able to pass these changes upstream, rather than pass them off privately and wait potentially a year or more to get them back in a public release. On top of that, we want to be able to collaborate with Percona, Facebook, Google, MariaDB, etc., on some of the actual changes, to make things better for everyone.This is simply the first step towards that goal.