Building a food recommendation engine with Spark / MLlib and Play

Recommendation engines have become very popular in the last decade with the explosion of e-commerce, on demand music and movie services, dating sites, local reviews, news aggregation and advertising (behavioral targeting, intent targeting, …). Depending on your past actions (e.g., purchases, reviews left, pages visited, …) or your interests (e.g., Facebook likes, Twitter follows), the recommendation engine will present other products that might interest you using other users actions and user behaviors (page clicks, page views, time spent on page, clicks on images/reviews, …).

In this post, we’re going to implement a recommender for food using Apache Spark and MLlib. So for instance, if one is interested by some coffee products then we might recommend her some other coffee brands, coffee filters or some related products that some other users like too.

To make it more interactive, we implemented a simple web interface using the Play framework where you are prompted a list of products that you can rate (ala hotornot but for food). After rating a certain number of products, you can choose to get recommendations by clicking on the “Recommendation” tab at the top. This will train the recommender and then display about 10 products that might interest you. Note that this can take about a minute or so (this process is normally run offline).

As a training set we use some Amazon reviews from the Fine Food sections that were written between 2010 and 2012. These reviews can be downloaded from the SNAP Stanford website. It contains about 500,000 reviews written by 256,000 users on 74,000 products. For the sake of simplicity we’re going keep reviews left by users who wrote between 10 and 20 reviews, that leaves us with around 60,000 reviews.

You can run the script download.sh to download the file and convert it into a simple CSV (userId, productId, rating).

The Play application is pretty straightforward with 2 main routes /rating and /recommendation that maps to the Application class. The rating method will pick up a random product taken from the CSV, show it to be rated and the rating is stored in MongoDB.

The ALS recommender accepts as input an RDD of Ratings(user: Int, product: Int, rating: Double). Since in the CSV, all the IDs are String identifier, we create a simple dictionary Dictionary to map Strings to their position in an index.

Then in order to train the recommender, we pass the existing ratings from the CSV as well as the ratings that you left and then predict the ratings on all the products for you that you haven’t rated. We then order them by rating and keep the top 10.

Conclusion

We showed in this post how to implement a simple recommender. We skipped the steps to determine the best values for lambda, the rank and the number of iterations using cross validations which are well explained on the official Spark Tutorial pages on Recommendations using MLlib. As one can see, the recommendation phase take some time (around a minute for only 60,000 ratings) and thus cannot be real-time. It is generally computed offline and recommendation are usually shown or sent by email slightly delayed. In the past 10 years, recommendation algorithms have improved to support incremental updates (Incremental Collaborative Filtering or ICF) to provide real-time recommendations that can be particularly useful especially in advertising for real-time intent targeting. Indeed presenting relevant ads to a user based on her immediate history have more impact than presenting ads based on her history from 2 hours ago.