Pages

Introduction to Recommendations with Map-Reduce and mrjob

Thursday, August 23, 2012

Hi all,

In this post I will present how can we use map-reduce programming model for making recommendations. Recommender systems are quite popular among shopping sites and social network thee days. How do they do it ? Generally, the user interaction data available from items and products in shopping sites and social networks are enough information to build a recommendation engine using classic techniques such as Collaborative Filtering.

Why Map-Reduce ?

MapReduce is a framework originally developed at Google that allows easy large scale distributed computing across a number of domains. Apache Hadoop is an open source implementation of it. It scales well to many thousands of nodes and can handle petabytes of data. For recommendations where we have to find the similar products to a product you are interested at , we must calculate how similar pairs of items are. For instance, if someone watches the movie Matrix, the recommender would suggest the film Blade Runner. So we need to compute the similarity between two movies. One way is to find correlation between pairs of items. But if you own a shopping site, which has 500,00 products, potentially we would have to compute over 250 billion computations. Besides the computation, the correlation data will be sparse, because it's unlikely that every pair of items will have some user interested in them. So we have a large and sparse dataset. And we have also to deal with temporal aspect since the user interest in products changes with time, so we need the correlation calculation done periodically so that the results are up to date. For these reason the best way to handle with this scenarion and problem is going after a divide and conquer pattern, and MapReduce is a powerful framework and can be used to implement data mining algorithms. You can take a look at this post about MapReduce or go to these video classes about Hadoop.

Map-Reduce Architecture

Meeting mrjob

mrjob is a Python package that helps you write and run Hadoop Streaming jobs. It supports Amazon's Elastic MapReduce(EMR) and it also works with your own Hadoop cluster. It has been released as an open-source framework by Yelp and we will use it as interface for Hadoop since its legibility and ease to handle with MapReduce tasks. Check this link to see how to to download and use it.

Movie Similarities

Imagine that you own a online movie business, and you want to suggest for your clients movie recommendations. Your system runs a rating system, that is, people can rate movies with 1 to 5 starts, and we will assume for simplicity that all of the ratings are stored in a csv file somewhere.

Our goal is to calculate how similar pairs of movies are, so that we recommend movies similar to movies you liked. Using the correlation we can:

For every pair of movies A and B, find all the people who rated botha A and B.

Use these ratings to form a Movie A vector and a Movie B vector.

Calculate the correlation between those two vectors

When someone watches a movie, you can recommend the movies most correlated with it

So the first step is to get our movies file which has three columns: (user, movie, rating). For this task we will use the MovieLens Dataset of Movie Ratings with 10.000 ratings from 1000 users on 1700 movies (you can download it at this link).

Here it is a sample of the dataset file after normalized.

So let's start by reading the ratings into the MovieSimilarities job.

You want to compute how similar pairs of movies are, so that if someone watches the movie The Matrix, you can recommend movies like BladeRunner. So how should you define the similarity between two movies ?

One possibility is to compute their correlation. The basic idea behind it is for every pair of movies A and B, find all the people who rated both A and B. Use these ratings to form a Movie A vector and a Movie B vector. Then, calculate the correlation between these two vectors. Now when someone watches a movie, you can now recommend him the movies most correlated with it.

So let's divide to conquer. Our first task is for each user, emit a row containing their 'postings' (item, rating). And for reducer, emit the user rating sum and count for use later steps.

Before using these rating pairs to calculate correlation, let's see how we can compute it. We know that they can be formed as vectors of ratings, so we can use linear algebra to perform norms and dot products, as alo to compute the length of each vector or the sum over all elements in each vector. By representing them as matrices, we can perform several operations on those movies.
To summarize, each row in calculate similarity will compute the number of people who rated both movie and movie2 , the sum over all elements in each ratings vectors (sum_x, sum_y) and the squared sum of each vector (sum_xx, sum__yy). So we can now can calculate the correlation between the movies. The correlation can be expressed as:

So that's it! Now the last step of the job that will sort the top-correlated items for each item and print it to the output.

So let's see the output. Here's a sample of the top output I got:

MovieA

MovieB

Correlation

Return of the Jedi (1983)

Empire Strikes Back, The (1980)

0.787655

Star Trek: The Motion Picture (1979)

Star Trek III: The Search for Spock (1984)

0.758751

Star Trek: Generations (1994)

Star Trek V: The Final Frontier (1989)

0.72042

Star Wars (1977)

Return of the Jedi (1983)

0.687749

Star Trek VI: The Undiscovered Country
(1991)

Star Trek III: The Search for Spock (1984)

0.635803

Star Trek V: The Final Frontier (1989)

Star Trek III: The Search for Spock (1984)

0.632764

Star Trek: Generations (1994)

Star Trek: First Contact (1996)

0.602729

Star Trek: The Motion Picture (1979)

Star Trek: First Contact (1996)

0.593454

Star Trek: First Contact (1996)

Star Trek VI: The Undiscovered Country (1991)

0.546233

Star Trek V: The Final Frontier (1989)

Star Trek: Generations (1994)

0.4693

Star Trek: Generations (1994)

Star Trek: The Wrath of Khan (1982)

0.424847

Star Trek IV: The Voyage Home (1986)

Empire Strikes Back, The (1980)

0.38947

Star Trek III: The Search for Spock
(1984)

Empire Strikes Back, The (1980)

0.371294

Star Trek IV: The Voyage Home (1986)

Star Trek VI: The Undiscovered Country (1991)

0.360103

Star Trek: The Wrath of Khan (1982)

Empire Strikes Back, The (1980)

0.35366

Stargate (1994)

Star Trek: Generations (1994)

0.347169

Star Trek VI: The Undiscovered Country
(1991)

Empire Strikes Back, The (1980)

0.340193

Star Trek V: The Final Frontier (1989)

Stargate (1994)

0.315828

Star Trek: The Wrath of Khan (1982)

Star Trek VI: The Undiscovered Country (1991)

0.222516

Star Wars (1977)

Star Trek: Generations (1994)

0.219273

Star Trek V: The Final Frontier (1989)

Star Trek: The Wrath of Khan (1982)

0.180544

Stargate (1994)

Star Wars (1977)

0.153285

Star Trek V: The Final Frontier (1989)

Empire Strikes Back, The (1980)

0.084117

As we would expect we can notice that

Star Trek movies are similar to other Star Trek movies.

The people who likes Star Trek movies are not so fans of Star Wars and vice-versa;

Star Wars Fans will be always fans! :D

The Sci-Fi movies are quite similar to each other;

Star Trek III: The Search for Spock (1984) is one the best movies of Star Trek (several positive correlations)

Let's see another dataset. What about Book Ratings ? Let's see this dataset of 1 million book ratings. Here's again a sample of it:

But now we want to compute other similarity measures besides correlation. Let's take a look on them.

Cossine Similarity

Another common vector-based similarity measure.

Regularized Correlation

We could use regularized correlation by adding N virtual movie pairs that have zero correlation. This helps avoid noise if some movie pairs have very few raters in common.

Jaccard
The implicit data can be useful. In some cases only because you rate a Toy Store movie, even if you rate it quite horribly, you can still be interested in similar animation movies. So we can ignore the value itself of each rating and use a set-based similarity measure such as the Jaccard Similarity.

Now, let's add all those similarities to our mapreduce job and make some adjustments by making a new job for counting the number of raters for each movie. It will be required for computing the jaccard similarity.

Ok, let's take a look at the book similarities now with those new fields.

But is it possible to generalize our input and make our code to generate similarities for different inputs ? Yes it is. Let's abstract our input. For this, we will create a VectorSimilarities Class that represents input data in the following format:

So if we want to define a new input format, just subclass the VectorSimilarities class and implement the method input.

So here's the class for the book recommendations using our new VectorSimilarities.

And here's the class for the movies recommendations. It simply reads from a data file and lets the VectorSimilarities superclass do the work.

Conclusions

As you noticed map-reduce is a powerful technique for numerical computation and speacially when you have to compute large datasets. There are several optimization I can do in those scripts such as numpy vectorizations for computing the similarities. I will explore more these features in the next posts: one handling with recommender systems and popular social networks as also how you can use the Amazon EMR infrastructure to compute your jobs!

I'd like to thank Edwin Chen and his post using those examples with Scala and whose post inspired me to explore these examples above in Python.

All code for those examples above can be downloaded at my github repository.

When I was taught about mapreduce one of the key components was the combiner. It is a step between the mapper and the reducer which essentially runs the reducer at the end of the map phase in order to decrease the number of lines of data that the mapper is outputting. As the size of the data I need to process increases (at the muti-terabyte scale), the reduce step becomes prohibitively slow. I talked to a friend of mine and he says that this has been his experience too, and that instead of using a combiner, he partitions his reduce key using a hash function which reduces the number of values that go to each key in the reduce step. I tried this and it worked. Has anyone else had this experience with the combiner step not scaling well, and why can't I find any documentation of this problem as well as the workaround? I'd rather not use a workaround if there is a way to make the combiner step scale.java barcode maker

Thanks for sharing this informative blog. Recently I did Digital Marketing Training in Chennai at a leading digital marketing company. It's really useful for me to make a bright career. To know more details about this course please visit FITA.

I have read all the articles in your blog; was really impressed after reading it.If anyone focus the Best sas training in Chennai. Let us know we are ready to serve for your career. FITA is pleased to inform you that; we provides practical training on all the technologies with the MNC exports having more than 5 years of experience in your preferred domain. Get your career with our knowledge. sas training institute in Chennai|sas training chennai

Statistical Analysis System (SAS) is an integrated system of software products provided by SAS Institute Inc., The most common description of statistics is that it’s the process of analyzing data — number crunching, in a sense. Visit Us, SAS Training in Chennai

Oracle Training in Chennai is one of the best oracle training institute in Chennai which offers complete Oracle training in Chennai by well experienced Oracle Training in chennai Consultants having more than 12+ years of IT experience.

Greens Technology provides Best PEGA training courses in chennai.PEGA training course content designed basic to advanced levels. Pega Training In Chennai we have a team of PEGA experts who are working professionals with hands on real time PEGA projects knowledge, which will give students an edge over other Training Institutes.

Greens Technology Apache Hadoop training in Chennai is the expert source for Apache Hadoop training and certification. We offer public and private Hadoop Training in Chennai courses for developers and administrators with certification for IT professionals.

SAS (Statistical Analysis System) is one of the most popular softwares used in the world of analytics & big data.SAS helps in data management, data cleaning and statistically analysing data. SAS certifications offered by the SAS Training in Chennai Institute are highly sought after and globally recognized.

Green Technologies In Chennai Greens Technology is a leading Training and Placement company in Chennai. We are known for our practical approach towards trainings that enable students to gain real-time exposure on competitive technologies. Trainings are offered by employees from MNCs to give a real corporate exposure.

Hi admin thanks for sharing informative article on hadoop technology. In coming years, hadoop and big data handling is going to be future of computing world. This field offer huge career prospects for talented professionals. Thus, taking Hadoop Training in Chennai will help you to enter big data technology.

Hey, nice site you have here!We provide world-class Oracle certification and placement training course as i wondered Keep up the excellent work experience!Please visit Greens Technologies located at Chennai Adyar Oracle Training in chennai

Awesome blog if our training additional way as an SQL and PL/SQL trained as individual, you will be able to understand other applications more quickly and continue to build your skill set which will assist you in getting hi-tech industry jobs as possible in future courese of action..visit this blog Green Technologies In Chennai

Nice site....Please refer this site also Our vision succes!Training are focused on perfect improvement of technical skills for Freshers and working professional. Our Training classes are sure to help the trainee with COMPLETE PRACTICAL TRAINING and Realtime methodologies. Green Technologies In Chennai

let's Jump Start Your Career & Get Ahead. Choose sas training method that works for you. This course is designed for professionals looking to move to a role as a business analyst, and students looking to pursue business analytics as a career. SAS Training in Chennai

You have stated definite points about the technology that is discussed above. The content published here derives a valuable inspiration to technology geeks like me. Moreover you are running a great blog. Many thanks for sharing this in here.

Oracle Training in Chennai is one of the best oracle training institute in Chennai which offers complete Oracle training in Chennai by well experienced Oracle Consultants having more than 12+ years of IT experience.

There are lots of information about latest technology and how to get trained in them, like Hadoop Training Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Hadoop Training in Chennai). By the way you are running a great blog. Thanks for sharing this.

A Best Pega Training course that is exclusively designed with Basics through Advanced Pega Concepts.With our Pega Training in Chennai you’ll learn concepts in expert level with practical manner.We help the trainees with guidance for Pega System Architect Certification and also provide guidance to get placed in Pega jobs in the industry.

Our HP Quick Test Professional course includes basic to advanced level and our QTP course is designed to get the placement in good MNC companies in chennai as quickly as once you complete the QTP certification training course.

Welcome to Wiztech Automation - Embedded System Training in Chennai. We have knowledgeable Team for Embedded Courses handling and we also are after Job Placements offer provide once your Successful Completion of Course. We are Providing on Microcontrollers such as 8051, PIC, AVR, ARM7, ARM9, ARM11 and RTOS. Free Accommodation, Individual Focus, Best Lab facilities, 100% Practical Training and Job opportunities.

Whatever we gathered information from the blogs, we should implement that in practically then only we can understand that exact thing clearly, but it’s no need to do it, because you have explained the concepts very well. It was crystal clear, keep sharing.. Websphere Training in Chennai

This information is impressive..I am inspired with your post writing style & how continuously you describe this topic. After reading your post,thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topicAndroid Training In Chennai In Chennai

There are lots of information about latest technology and how to get trained in them, like Best Hadoop Training In Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies Hadoop Training in Chennai By the way you are running a great blog. Thanks for sharing this blogs..

if i share this blog weblogic Server Training in Chennai aims to teach professionals and beginners to have perfect solution of their learning needs in server technologies.Weblogic server training In Chennai

i gain the knowledge of Java programs easy to add functionalities play online games, chating with others and industry oriented coaching available from greens technology chennai in Adyar may visit.Core java training In Chennai

Crazy Bulk is 100% safe and legal steroids available in market. There are several products crazy bulk is offering in which dianabol, anavar, testosterone max, winstrol are some of the most popular products among bodybuilders.

Dianabol (D-Bal) is a well recognized and considered to be the most powerful formula that is most generally known by the name as methandien, Methandrostenolone, metandienone.You can get your supply today at our website belowBuy Dianabol

Crazy Mass is 100% safe and legal available in market. There are several products crazy Mass is offering in which Dianobal Elite Series, T-Bal 75 Elite Series, testosterone max, winstrol are some of the most popular products among bodybuilders. You can get your supply today at Buy 100% Safe & legal Steroids By Crazy Mass

D-BAL MAX – Explosive Energy and Strength Formula! After the certain age and stage in life, the health of individuals begins to walk in a declining process naturally. As an example, after the certain age or stage, the men’s body produces less testosterone levels and they looks less attractive along with feel less energetic.

hai you have to learned to lot of information about c# .net Gain the knowledge and hands-on experience you need to successfully design, build and deploy applications with c#.net. C-Net-training-in-chennai

Amazing blog if our training additional way as an silverlight training trained as individual, you will be able to understand other applications more quickly and continue to build your skill set which will assist you in getting hi-tech industry jobs as possible in future courese of action..visit this blogsilverlight-training.htmlgreenstechnologies.in:

awesome Job oriented sharepoint training in Chennai is offered by our institue is mainly focused on real time and industry oriented. We provide training from beginner’s level to advanced level techniques thought by our experts.if you have more details visit this blog. SharePoint-training-in-chennai.html

Wiztech Automation Solutions is the Best Training institute in Chennai,started in the year 2006 and it extended its circle through providing the best Education as per the Global Quality Standards. Hence our Training Center in Chennai was Recognized by IAO and ISO for its inspiring Education Quality Standards. Wiztech Automation Solution, the PLC SCADA Training Academy in Chennai offers both PLC, SCADA, DCS, VFD, Drives, Control Panels, HMI, Pneumatics, Embedded systems, VLSI, IT, Web Designing, AutoCad Training courses in chennai with latest various brands. Wiztech Automation Solutions offers Real Time Training Courses with 100% Placement support in chennai.

Welcome to Wiztech Automation - Embedded System Training in Chennai. We have knowledgeable Team for Embedded Courses handling and we also are after Job Placements offer provide once your Successful Completion of Course. We are Providing on Microcontrollers such as 8051, PIC, AVR, ARM7, ARM9, ARM11 and RTOS. Free Accommodation, Individual Focus, Best Lab facilities, 100% Practical Training and Job opportunities.

I think the problem lies in the fact that most people read articles on sites like Problogger and do nothing about it. They don't take action on the advice. They read the article, leave a comment, and go back to tweeting. It's sad.Buy Dbol Tablets Online

Hi admin thanks for sharing informative article on hadoop technology. In coming years, hadoop and big data handling is going to be future of computing world. This field offer huge career prospects for talented professionals. Thus, taking Hadoop & Spark Training in Hyderabad will help you to enter big data hadoop & spark technology.

Hi admin thanks for sharing informative article on hadoop technology. In coming years, hadoop and big data handling is going to be future of computing world. This field offer huge career prospects for talented professionals. Thus, taking Hadoop & Spark Training in Hyderabad will help you to enter big data hadoop & spark technology.

I really like your idea of http://blogweightlossdietposts.tumblr.com/Thanks for the information http://awesomefitnessads.blogspot.com/Thanks for the best blog. http://weightloss-diettipz.weebly.com/Really impressive post.https://weightlossdiet2015.wordpress.com/Useful information.I am actual blessed to read this article.http://5tipz.livejournal.com/Here's the right information for the best results. http://weightlossdiet2016.jimdo.com/

Thanks for the best blog. it was very useful for me.keep sharing such ideas in the future as well. Thanks for giving me the useful information. I think I need it! https://mysupplementreview.wordpress.com/2016/01/20/top-4-weight-loss-products-of-2015/

https://mysupplementreview.wordpress.com/2016/01/20/phenq-natural-ingredient-for-a-successful-weight-loss/This is a very good post. Just wonderful. Truly, I am amazed at what informative things you've told us today. Thanks a million for that.

http://steroidproductreviews.blogspot.com/2016/01/comprehensive-answer-to-weight-loss.htmlUseful information shared..Iam very happy to read this article..thanks for giving us nice info.Fantastic walk-through. I appreciate this post.

http://steroidproductreviews.blogspot.com/2015/12/4-tips-for-whole-figure-modification.htmlThanks for the best blog. it was very useful for me.keep sharing such ideas in the future as well. Thanks for giving me the useful information. I think I need it!

http://steroidproductreviews.blogspot.com/2015/10/incredible-crazy-mass-products-that.htmlThanks for sharing excellent informations. Your site is very cool. I am impressed by the details that you have on this site.

http://www.ehealthharmony.com/2016/01/11/important-things-to-check-before-joining-gym/Hello Dear, Really your blog is very interesting.... it contains great and unique information. I enjoyed to visiting your blog. It's just amazing.... Thanks very much.

http://www.rippedlogger.com/2016/01/02/your-ultimate-fitness-plan-for-2016/Really impressive post. I read it whole and going to share it with my social circules. I enjoyed your article and planning to rewrite it on my own blog.

http://top10weightlossproducts.weebly.com/blog/eat-like-an-athlete-and-look-like-an-athleteThis is a very interesting post. I have been looking for this stuff for many days. Thank you for sharing excellent informations. I will tell my friends to visit your site. Anyway this is a very great post. I have bookmark your site and I am waiting for your next post.

http://top10weightlossproducts.weebly.com/blog/elevate-your-physique-to-the-hard-rock-levelExtremely clear clarification of issues is given and it is open to every living soul. I have perused your post, truly you have given this extraordinary informative data about it

http://myproductreviews.blog.com/2016/01/22/get-a-sexy-beach-body-by-using-phen375-slimming-product/I want you to thank for your time of this wonderful read!!! I definately enjoy every little bit of it and I have you bookmarked to check out new stuff of your blog a must read blog!!!!

Garcinina Cambogia extract was recently featured on Dr. Oz where he proclaimed it as the "Holy Grail of Weight Loss" supplements. Learn how Dr Oz Garcinia Extracts Works to Lose Weight, Burn fat and suppress appetite.

10 Diet Myths That Circulating Around Us since Our BirthWhen it comes to weight loss, everyone seems to have some advice. But how do you separate fact from fiction?http://www.ehealthharmony.com/2016/03/08/diet-myths-circulating-around-us/

Hi,I am using the movielens data set , is there a way to convert this data set into user movie rating matrix.users in the rows and movie in the columns and the corresponding cells gives us the rating.Kindly let me know

Crazy Bulk Legal Muscle Anabolic Steroids For Sale legal steroids are a powerful, safe alternative that gives you the same fantastic results but without the side effect.. visit here http://www.crazybulks-store.com/Crazy Bulkhttp://www.crazybulks-store.com/

Search in this blog

Join the Brazilian Python Conference PythonBrasil 2013

Marcel Caraciolo

I am a brazilian data scientist, entrepreneur, python hacker and technology consultant. Nowadays I work with data-centric applications, specially in machine learning, recommender systems and bioinformatics. I am also interested in distributed computing, high performance and data visualization, educational and bioinformatics ventures.

Until 2013 I was the co-founder of two companies Atepassar.com, a social network for students in Brazil and co-founder of PyCursos, a on-line startup for python training and on-line courses. In 2014, I assumed a new position at Genomika Diagnósticos, a brazilian genetics tests laboratory, as CTO.