Mahout, There It Is! Open Source Algorithms Remake Overstock.com

SALT LAKE CITY – Judd Bagley set out to build a web app that would serve up a never-ending stream of news stories tailored to your particular tastes. And he did. It's called MyCurrent. But in creating this clever little app, Bagley also pushed online retailer Overstock.com away from the $2-million-a-year service it was using to generate product recommendations for web shoppers, and onto a system that did the same thing for free – and did it better.

Bagley is a software developer with Overstock's fledgling O Labs, a mini-research-and-development operation tucked into the fifth floor of the company's Salt Lake City headquarters, just outside the office of CEO Patrick Byrne. O Labs was founded to incubate projects that can push the company in new directions, and MyCurrent was the first of the lot. A personal news reader may seem like an odd thing to emerge from an online retailer, but that's largely the point. And in the end, the project pumped new life into the company's primary retail operation.

In building MyCurrent, Bagley and his O Labs cohorts stumbled onto an open source software project known as Mahout. Founded in 2009, Mahout provides the world with a set of freely available machine learning algorithms – algorithms that give computing systems at least a modicum of artificial intelligence, letting them adjust their behavior according to what's happened in the past. Inside O Labs, the idea was to use Mahout as a means of examining the news stories you've enjoyed in the past and then selecting stories you're likely to enjoy, well, right now.

>'We're saving $2 million a year with Mahout, and that never would have happened if not for the sort of experimental stuff we're doing in the labs We're discovering things that can then have benefit across the company.'

— Judd Bagley

Mahout worked well – so well that Overstock decided it could be used to generate the product recommendations for users on its main website. The company was using a commercial recommendation system from a company called RichRelevance, but a few months ago, says Saum Noursalehi, who oversees O Labs, it replaced this system with an engine based on Mahout and a sister platform known as Hadoop, a hugely popular open source system that uses a sea of ordinary computer servers to process massive amounts of data.

The tale highlights the benefit of a blue-sky R&D operation. Overstock was founded in 1999 and went public in 2002, and Byrne – the company's swashbuckling chief exec – created O Labs about a year ago to feed a bit more of the entrepreneurial ethos back into the company. "We're saving $2 million a year with Mahout, and that never would have happened if not for the sort of experimental stuff we're doing in the labs," says Bagley. "We're discovering things that can then have benefit across the company."

But it also shows how Hadoop and related open source tools continue to evolve and push even further across the web and into businesses. Mahout – which was specifically built for use with Hadoop – is little more than 3 years old, and it has already attracted the attention of several big-name web operations, including not only Overstock, but AOL, Foursquare, Yahoo, Twitter, and even Amazon.

Originally bootstrapped by Yahoo and Facebook, Hadoop mimics two sweeping software platforms that Google built to underpin its search engine. It's widely used across the web, and now it's pushing into other businesses as well, thanks in part of Hadoop-minded software startups such as Cloudera and MapR. It can be used to analyze data, but it can also crunch massive amounts of data for use in live applications – such as the Overstock recommendations service.

Hadoop has also spawned a wide range of sister projects, including Hbase, a database for storing particularly large amounts of information; Hive, a means of querying data crunched by Hadoop; Zookeeper, a means of synchronizing Hadoop and other platforms across a large cluster of servers; and, yes, Mahout, one of the newer projects. Hadoop is named after a yellow stuffed elephant that belonged to the son of the project's founder, Doug Cutting, and the Mahout moniker plays off this bit of trivia. In India, a mahout is someone who rides an elephant.

According to Ted Dunning – a MapR engineer who works on the Mahout project – the project has been adopted by "dozens" of sites to help drive user recommendations, including Amazon, one of the companies that pioneered such recommendations more than a decade ago. It's unclear how Amazon is using Mahout, but according to a job listing on LinkedIn, it has been used by the team that oversees Amazon's "Personalization Platform" – i.e., the software platform used to personalize content across the site.

But Dunning is quick to point out that Mahout is still a young project. And it's important to realize that it is merely a library of algorithms – something you use to build larger applications. "It's not a product. It's not a package. It's not a service," he says. "Batteries are not included. And you will find rough corners. Various aspects of Mahout are better or worse in terms of code maturity. Some parts are literally student projects – and are really bad. Others parts are absolutely production quality."

So, even though Overstock is saving $2 million a year in dropping its commercial recommendations tool, its switch to Mahout did involve development costs. But Overstock's Saum Noursalehi tells us that the company built its system on its own – without paid help from the likes of MapR or Cloudera. The team that runs the project spans about six developers and a product manager.

According to Noursalehi, Hadoop logs everything that any Overstock customer does on the site, and then it feeds this data into a system based on Mahout. The Mahout library includes hundreds of algorithms, and Overstock is in the process of A/B testing many of these to determine which work the best. It's also starting to "cluster" recommendations, creating groups of people who are likely to respond to certain types of recommendations.

"You might find the people living in certain zip codes are high-income people," Noursalehi says, "and their recommendations might be slightly different than those we provide to people in other regions." Similarly, the company is looking to create clusters around members of its loyalty program or its most active customers.

In other words, Overstock is behaving like an online retail operation. The difference is that it's generating these online recommendations with open source algorithms.