Pages

Saturday, February 11, 2017

Recommendations with Apache Mahout

Recommendation?

Have you ever
been recommended a friend on Facebook? Or visited a shopping portal
where you can see the recommended items for you, Or an item you might be
interested in on Amazon? If so then you've benefited from the value of
recommendation systems.

for example,
often see personalized recommendations phrased something like, “If you
liked that item, you might like also like this one...” These sites use
recommendations to help drive users to other things they offer in an
intelligent, meaningful way, tailored specifically to the user and the
user’s preferences.

Recommendation
systems apply knowledge discovery techniques to the problem of making
recommendations that are personalized for each user. Recommendation
systems are one way we can use algorithms to help us sort through the
masses of information to find the “good stuff” in a very managed way.

From an
algorithmic standpoint, the recommendation systems we’ll talk about
today are considered in the k-nearest neighbor family of problems
(another type would be a SVD-based recommender). We want to predict the
estimated preference of a user towards an item they have never seen
before. We also want to generate a ranked (by preference score) list of
items the user might be most interested in. Two well-known styles of
recommendation algorithms are item-based recommenders and user-based
recommenders. Both types rely on the concept of a similarity
function/metric (ex: Euclidean distance, log likelihood), whether it is
for users or items.

Overview of a recommendation engine

The main
purpose of a recommendation engine is to make inferences on existing
data to show relationships between objects and entities. Objects can be
many things, including users, items, products(in short user related
data) and so on. Relationships provide a degree of likeness or belonging
between objects. For example, relationships can represent ratings of
how much a user likes an item, or indicate if a user bookmarked a
particular page.

To make a
recommendation, recommendation engines perform several steps to mine the
data(Data mining). Initially, you begin with input data that represents
the objects as well as their relationships. Input data consists of
object identifiers and the relationships to other objects.

Consider the
ratings users give to items. Using this input data, a recommendation
engine computes a similarity between objects. Computing the similarity
between objects(co-similarity) can take a great deal of time depending
on the size of the data or the particular algorithm. Distributed
algorithms such as Apache Hadoop using Mahout can be used to parallelize
the computation of the similarities. There are different types of
algorithms to compute similarities. Finally, using the similarity
information, the recommendation engine can make recommendation requests
based on the parameters requested.

For Example:

GroupLens Movie Data

The input data
for this demo is based on 1M anonymous ratings of approximately 4000
movies made by 6,040 MovieLens users, which you can download from the
www.grouplens.org site. The zip file contains four files:

movies.dat (movie ids with title and category)

ratings.dat (ratings of movies)

README

users.dat (user information)

The ratings file is most interesting to us since it’s the main input to our recommendation job. Each line has the format:

Ratings.dat description

UserID::MovieID::Rating::Timestamp

So let’s adjust our input file to match what we need to run our job. First download the file and unzip it locally from:

Next run the command:

tr -s ':' ',' < ratings.dat | cut -f1-3 -d, > ratings.csv

This produces the csv output format we’ll use in the next section when we run our “Itembased Collaborative Filtering” job.

hadoop fs -put [my_local_file] [user_file_location_in_hdfs]

this command put input file on HDFS,

create user.txt file which stores the data(userID) of the users to which we want show recommendations.

put it on HDFS under users directory.

With our user list in hdfs we can now run the Mahout recommendation job with a command in the form of:

which will run
for a while (a chain of 10 MapReduce jobs) and then write out the item
recommendations into HDFS we can now take a look at. If we tail the
output from the RecommenderJob with the command:

hadoop fs -cat [output-hdfs-path]/part-r-00000The output will show the user(provided into user.txt) with the recommended items.