Yahoo Tightens Up Hadoop for Security, Workflow Management

June 29, 2010

Santa Clara, Calif. -- Yahoo unveiled a new beta version of Hadoop, the open source, distributed filesystem that the Web giant created and has used internally for years to power its homepage and numerous other online assets.

The new version, announced here Tuesday at the Yahoo's annual Hadoop developer conference, integrates Kerberos open source software, which the company said will enable more secure collaboration and sharing of authenticated data.

Yahoo (NASDAQ: YHOO) is also contributing its Oozie workflow engine to the open source filesystem. Oozie, which is also open source, manages workflows and data pipelines for many Hadoop jobs at Yahoo.

Now with Kerberos support and Oozie, company officials declared Hadoop as being ready for broader enterprise use, particularly as companies see a need to process more unstructured data such as content generated by blogs, wikis and other social media.

"This is about 'big data' and an unprecedented ability to process huge data at speeds never before thought possible," Blake Irving, Yahoo's chief product officer, said in his keynote address.

As an example, Irving noted that Yahoo uses Hadoop to process 120 terabytes of information every day, including five billion daily e-mail messages for 500 million users.

"A lot of that is spam, and we're able to take that anonymized information to identify spam behavior that our production system learns from," said Eric Baldeschwieler, vice president of Hadoop software at Yahoo. As proof of its efficacy, Baldeschwieler pointed to an independent research study that indicated Yahoo Mail users saw almost 40 percent less spam than Hotmail and 55 percent less spam than Gmail .

While Yahoo is both a prime mover in the Hadoop community and the software's biggest user, Google (GOOG), Facebook, LinkedIn and Amazon (NASDAQ: AMZN) head the list of other major Web and technology companies (including IBM) that use Hadoop. Interest in Hadoop is growing judging by the attendance at the sold-out event, which attracted over 1,000 developers, up from 700 at last year's Hadoop Summit.

Yahoo officials say they support Hadoop as an open source effort because it's a "virtuous circle" that results in contributions from other companies and developers from which Yahoo itself can benefit.

"We really think this latest release is a game changer, with the security features that's going to get a lot more companies interested in Hadoop," Shelton Sugar, senior vice president of cloud computing at Yahoo, told InternetNews.com.

IDC research director Melanie Posey said Hadoop should appeal to companies looking for ways to leverage the vast quantity of data they're increasingly collecting from customers and other sources.

"Apache Hadoop is an efficient solution for processing data at scale," Posey said in a statement. "Hadoop has matured and is now becoming an enterprise-ready cloud computing technology with the addition of Kerberos authentication.

"Now organizations of various sizes can leverage Yahoo's Hadoop investment and deployments to run it on their own systems and build out their own Hadoop deployments without starting from scratch on internal science experiments," she added.

The news comes at a time when Yahoo faces intense competition to hold on to the huge base of hundreds of millions visitors to its home page and other properties, with rivalry from Google on the search side and Facebook and Twitter in social networking and media -- though Yahoo has integrated feeds from Facebook, Twitter and other social networks into its homepage to help keep consumers using its site as a starting point for their Web surfing.

Yahoo said some three million versions of the Yahoo homepage are created each day, based on what its Hadoop-powered system thinks visitors are most interested in seeing.

The online pioneer is also in the process of handing off its search infrastructure to Microsoft as part of a long-term partnership between the two companies, which will see Microsoft providing the backend for Yahoo's search. However, Yahoo retains its control over the "search experience" -- including user interface -- and customer relationships, and Hadoop plays an important role in that aspect, as part of a significant technology investment that already involves some 38,000 servers.

"Over time you'll see our usage grow," Sugar said.

Gartner analyst Mike McGuire said Yahoo's demonstrated that it's committed to investing in technology with a long-term view.

"When you look at the investments Yahoo is making and the open source angle, there a lot of things that aren't necessarily going to pay off right away, but when you look at what it's already delivered and where this could lead in terms of new advertising and content services, it's very impressive," he told InternetNews.com.