First steps with PredictionIO : a simple recommendation server

A recommendation system seems to be a must in this current era websites, where you want to keep the current visitor inside your website, by providing things that will hold him there.

Anyone can build a basic recommendation engine by joining a few tables in a relational databases, and start recommending an item A based on another item B by looking for similarities between these 2 items (common tags/categories, common keywords in name and description, etc, check ShowerHacks.com as example …).

If I said I watched The Dark Knight, the most 2 obvious recommendations you will give me will be the The Dark Knight Rises and Batman Begins. Lot of common keywords, tags, staff, and you obviously have a “sequel/prequel” relation between these movies.

This far, we were dealing with item-to-item recommendation. It’s the most easy recommendation you can implement, you’re just dealing with similarities between two entities.

Now, recommend me a third movie … Superman ? Because I like superheroes movie ? Or Inception, because I like the cast ? You can’t really decide without knowing my preferences. In this user-to-item relation, you have to know all my previous watched movies and behaviors (most viewed genre, actors, theme, etc …), before reaching a conclusion.

So, let’s try Superman … Which one ? The first, second, or the third ?

Here comes the machine learning system: you feed it users and items data, as well as their relations (like, rating, views, etc …), and it will predict the future, based on various algorithm.

Apache Mahout is among one of the popular and free machine learning library, written in Java. It’s used by some big names such as Amazon, Foursquare, Twitter, Yahoo, etc … It uses Hadoop as database, can be scaled, and can process a lot of data. Installing and managing these tools can be intimidating and frustrating, but PredictionIO helps us doing all these petty tasks. In the end, you’ll just have to install predictionIO and start it, all the hadoop and mahout stuff is hidden from you.

PredictionIO, an open source Machine Learning server

PredictionIO is a “one package” tool that will install and setup all the dependencies automatically, then start a tomcat server to expose a REST API, only gateway to your machine learning server. You can learn more about the server structure here.

The dashboard will be available at http://localhost:9000. You’re free to use another port by editing ADMIN_PORT in bin/common.sh. On my system, the port 9000 was already taken by php5-fpm.

This dashboard is the main advantage of using predictionIO compared with a vanilla Hadoop+Mahout installation, as it provides a neat web interface to organize and setup your engines. The REST API can also be consumed by everyone, regardless or your preferred programing language. A PHP, Ruby, Python and Java SDK are already available, and offer basic functions. You’re free to write your own, or implements more functions on top of existent one.

The dashboard is password protected, and you can create a user account easily with

bin/users

After login, you’ll be asked to create your first App.

You’ll obtain an App Key, used to for authenticate all API call.

Next step is to create engine. An engine predicts a relation between 2 entities. If you have some posts, movies, books, etc … one engine can only deal with 2 entities : user-movie, or user-book, or movie-book, etc … Although engines can deal with multiple items relations, staying with a two entities relation raise the accuracy of the prediction.

And as the users and items data are shared among all engines, you’re not losing anything.

There’s 2 kind of engines :

Item recommendation engine

Items similarity prediction engine

As of version 0.4, only Item Recommendation Engine is available. No ETA was given on the other and more interesting engine availability.

Each engines can be fine-tuned by choosing a different prediction algorithm.

The engine is now ready to predict the future. But before that, you need to input some user, item, and behavioral data to train the machine. The more data you’ll add, the more accurate your prediction will be. Sandra Cobain from BestForTheKids.com is helping me with this currently.

PredictionIO doc have some tutorials about building recommendation engine

As far as I know, the only way to input data in predictionIO is to use the API, so when adding 1 millions of data, have some fun with the for loop …

Alternatives

PredictionIO is still young and in development. There’s not much all-on-one free machine learning server out there.

The only other one I found is Myrrix, a similar product also based on Apache Mahout, but packaged as one .jar file.

Usage can not be easier, you just download and run the .jar, and your machine learning server is online. It also used a REST API for adding/editing data and to get predictions.

A server in Myrrix correspond to an engine in predictionIO. So, to have multiple engines, you’ll end up running multiple myrrix servers, on different port. Each server is isolated, so the user data in the user-movie server can not be shared with the user-book server.

Myrrix is also in development, and still in beta. Its website is very complete, with tons of examples, tutorials and use cases.

ResqueBoard is a web interface to monitor your Php Resque activities in realtime. It uses Square’s Cube (requires MongoDB and NodeJS) to collect, compute and stream datas/events/metrics in realtime. We all know that metrics are important, especially in a bigger-than-average application. I guess you don’t need resque at all if you have a blog, but […]