From

Thank you

Sorry

"Mahout" is a Hindi term for a person who rides an elephant. The elephant, in this case, is Hadoop -- and Mahout is one of the many projects that can sit on top of Hadoop, although you do not always need MapReduce to run it.

Mahout puts powerful mathematical tools in the hands of the mere mortal developers who write the InterWebs. It's a package of implementations of the most popular and important machine-learning algorithms, with the majority of the implementations designed specifically to use Hadoop to enable scalable processing of huge data sets. Some algorithms are available only in a nonparallelizable "serial" form due to the nature of the algorithm, but all can take advantage of HDFS for convenient access to data in your Hadoop processing pipeline.

Machine learning is probably the most practical subset of artificial intelligence (AI), focusing on probabilistic and statistical learning techniques. For all you AI geeks, here are some of the machine-learning algorithms included with Mahout: K-means clustering, fuzzy K-means clustering, K-means, latent Dirichlet allocation, singular value decomposition, logistic regression, naive Bayes, and random forests. Mahout also features higher-level abstractions for generating "recommendations" (à la popular e-commerce sites or social networks).

I know, when someone starts talking machine learning, AI, and Tanimoto coefficients you probably make popcorn and perk up, right? Me neither. Oddly, despite the complexity of the math, Mahout has an easy-to-use API. Here's a taste:

What this little snip would do is load a data file, curse through the items, then get 10 recommended items based on their similarity. This is a common e-commerce task. However, just because two items are similar doesn't mean I want them both. In fact, in many cases I probably don't want to buy two similar items. I mean, I recently bought a bike -- I don't want the most similar item, which would be another bike. However, other users who bought bikes also bought tire pumps, so Mahout offers user-based recommenders as well.

Both examples are very simple recommenders, and Mahout offers more advanced recommenders that take in more than a few factors and can balance user tastes against product features. None of these require advanced distributed computing, but Mahout has other algorithms that do.

Of course, the devil is in the details and I've glossed over the really important part, which is that very first line:

DataModel model = new FileDataModel(new File("data.txt"));

Hey, if you could get some math geeks to do all the work and reduce all of computing down to the 10 or so lines that compose the algorithm, we'd all be out of a job. However, how did that data get in the format we needed for the recommendations? Being able to design the implementation of that algorithm is why developers make the big bucks, and even if Mahout doesn't need Hadoop to implement many of its machine-learning algorithms, you might need Hadoop to put the data into the three columns the simple recommender required.

Mahout is a great way to leverage a number of features from recommendation engines to pattern recognition to data mining. Once we as an industry get done with the big, fat Hadoop deploy, the interest in machine learning and possibly AI more generally will explode, as one insightful commentator on my Hadoop article observed. Mahout will be there to help.

Andrew C. Oliver is a professional cat herder who moonlights as a software consultant. He is president and founder of Mammoth Data (formerly Open Software Integrators), a big data consulting firm based in Durham, N.C.