Software generalist, backend engineer, bit-herder, building distributed systems on the JVM for FullContact.

Mahout & Dropwizard: Collaborative Filtering & Recommenders

Recommenders are in use all throughout the web. Chances are, you’ve interacted with dozens of recommenders systems thought the course of simply finding this article.

Amazon is built on recommender systems.

Using techniques such as item-based recommenders and itemset generation they can not only find what’s relevant to you, they can determine things related to what you’re browsing and further find items frequently bought with yours (and perhaps provide a combo bonus to shake cobwebs from your wallet).

As with any Machine Learning task, the first step is to define your problem. What are you trying to do? Who is your customer? Is it your Business Intelligence team? Is it the consumer browsing your webapp? Is it your fault detection system attempting to draw relations? Spike detection?

The problem I’m attempting to solve is a somewhat common problem. On a site (which will remain nameless for now) members rate threads containing media. For the sake of simplicity, lets say the threads are albums posted by members along with their reviews of said albums. The members of this site are extremely opinionated, and want good recommendations.

The problem with a 1-5 star rating system is that it really doesn’t tell you anything about the content other than the percieved quality of the resource. It unfortunately loses a lot of resolution which might be important to your use case. Some members may enjoy an album for its rich harmonics and energetic, uplifting lyrics. Some may hate it because it’s a social commentary on the sad state of US Foreign Policy. Such users generally fall into cohorts of like-minded users who all tend to rate things on similar scales. Nickleback’s newest album could easily garner a rotten 1-star from all the Sabaton and Hammerfall fans.

The theory is to exploit this similarity between members to cluster them. Once clustered, the sets of all the users can be overlayed and with very basic set magic, find new content the user may not have seen previously that based on his cohort could lead to a very interesting musical adventure for him. Absolutely fantastic! How about some details? This is known as collaborative filtering.

The first step is a neighborhood generation algorithm. A users’ neighborhood is the attempt to define a user’s cohort. Having a neighborhood allows you to sample from it and create a network. In Mahout, there are two implementations: ThresholdUserNeighborhood and UserNeighborhood. Threshold Neighborhoods include all users nearest you in terms of similarity.

Similarity is defined as taking a mathematical operation to take two users (or items) and compare them in a quantitative fashion. You can read more on similarity metrics.

Pearson Similarity is a good choice for rating-based data. It’s the same algoritm used in determining the correlation of data — see a scatterplot, is it trending or is it unfocused bullshit?

Pearson correlation determines if your data generally trends. We’ll exploit this to find users similar to each other, even the ones that consistently rate items at different baselines.

The Code

Such code can be written with a basic understand of the algorithm and a head for details. Fortunately, all the hard work has been done for you in Apache Mahout. Mahout is a machine learning toolkit, a swiss-army knife of getting stuff done, and doing it fast. Much of Mahout’s work can be farmed off to your favorite Hadoop cluster which makes it ideal for a lot of data.

I don’t have a lot of data, it’s a little over 50,000 rows but it’s all real data. There isn’t a right answer. Another good dataset to play with is the Grouplens dataset as the parameters of that data has been well established.

The next stop is to build a basic evaluator. You can use this to explore the effect of different similarity measures, user neighborhoods, preference inferrers, and other techniques to improve your algorithm for your use case.

We now have a decent recommender which appears to be returning results well in the validation set. By not training with 100% of the data we can avoid model overfitting and see how well our model and data really fit the problem we’re trying to solve.

For my use case, Pearson worked better than the rest. Extend GenericUserBasedRecommender and we’re off to the races!

The recommender works, but then the question becomes: how do you make it useful? In a Java application, the answer is relatively simple: just embed it. But this vBulletin, a PHP forum package, we need to interface the Java and PHP components. With that in mind, I set out to wrap the recommender in a HTTP interface useful cross-process and via XHR (AJAX) requests.

The first technology I tried was the mahout-integrations package. The mahout-integrations package contains a servlet useful for serving out recommendations as a plain-text response, XML, or JSON. The configuration is easy as well, and you can embed Jetty in a shaded jar with almost no issues. The code is pretty weak sauce, but if it works it works right?

Wrong. While the Jetty+Servlet approach is easy, I have the requirement of updating tastes on my recommender as well as periodically refreshing the datamodel without a full stop/start cycle of the application (during which time it’d be unavailable — unacceptable, this is life or death :) ). I set out to enhance the default servlet to support this behavior, only to be handily thwarted by final classes. My initial thought was to delegate to the recommender (hah).

Okay, I understand the Open/Closed Principle, but in practice it’s a headache and a half of the class isn’t written to support common operations. Normally I’d delegate to the servlet, but the recommender its self is marked private and I didn’t really want to deal with using the RecommenderSingleton to get the recommender again for an operation which should already be supported. In the end I grabbed the source code and modified it to my own needs, flying brazenly in the face of DRY code and OCP.

This worked in production, but I really didn’t like it. As my requirements shifted and I felt worse and worse about the code and the momentary downtimes (I needed to support faster reloading) I decided to rewrite the service in Dropwizard. Dropwizard is a high-productivity, lightweight collection of some of the best battle-tested libraries Java has to offer in terms of performance and productivity wrapped up and given a great environment manager and configuration manager. Mix in Coda Hale’s best-of-breed Metrics library and you’re all set. Like many of the most brilliant of ideas, in hindsight Coda’s mashup seems as obvious as it is powerful.