Pages

Non-Personalized Recommender systems with Pandas and Python

Tuesday, October 22, 2013

Hi all,

At the last PythonBrasil I gave a tutorial about Python and Data Analysis focused on recommender systems, the main topic I've been studying for the last years. There is a popular python package among the statisticians and data scientists called Pandas. I watched several talks and keynotes about it, but I didn't have a try on it. The tutorial gave me this chance and after the tutorial me and the audience fell quite excited about the potential and power that this library gives.

This post starts a series of articles that I will write about recommender systems and even the introduction for the new-old refreshed library that I am working on: Crab, a python library for building recommender systems. :)

This post starts with the first topic about the theme: Non-personalized Recommender Systems and giving several examples with the python package Pandas. In future I will also post an alternative version of this post but referencing Crab, about how it works with him.

But first let's introduce what Pandas is.

Introduction to Pandas

Pandas is a data analysis library for Python that is great for data preparation, joining and ultimately generating well-formed, tabular data that's easy to use in a variety of visualization tools or (as we will see here) machine learning applications. For further introduction about pandas, check this website or this notebook.

Non-personalized Recommenders

Non-personalized recommenders can recommend items to consumers based on what other consumers have said about the items on average. That is, the recommendations are independent of the customer, so each customer gets the same recommendation. For example, if you go to amazon.com as an anonymous user it shows items that are currently viewed by other members.

Generally the recommendations come in two flavours: predictions or recommendations. In case of predictions are simple statements that are formed in form of scores, stars or counts. On the other hand, recommendations are generally simple a list of items shown without any number associated with it.

Let's going by an example:

Simple Prediction using Average

The score in the scale of 1 to 5 to the book Programming Collective Intelligence was 4.5 stars out of 5.
This is an example of a simple prediction. It displays a simple average of other customer reviews about the book.
The math behind it is quite simple:

In the same page it also displays the information about the other books which the customers bought after buying Programming Collective Intelligence. A list of recommended books presented to anyone who visits the product's page. It is an example of recommendation.

But how Amazon came up with those recommendations ? There are several techniques that could be applied to provide those recommendations. One would be the association rules mining, a data mining technique to generate a set of rules and combinatios of items that were bought together. Or it could be a simple average measure based on the proportion of who bought x and y by who bought x. Let's explain using some maths:

Let X be the number of customers who purchased the book Programming Collective Intelligence. Let Y be the other books they purchased. You need to compute the ration given below for each book and sort them by descending order. Finally, pick up the top K books and show them as related. :D

Score(X, Y) = Total Customers who purchased X and Y / Total Customers who purchased X

Using this simple score function for all the books you wil achieve:

Python for Data Analysis 100%

Startup Playbook 100%

MongoDB Definitive Guid 0 %

Machine Learning for Hackers 0%

As we imagined the book Python for Data Analysis makes perfect sense. But why did the book Startup Playbook came to the top when it has been purchased by customers who have not purchased Programming Collective Intelligence. This a famous trick in e-commerce applications called banana trap. Let's explain: In a grocery store most of customers will buy bananas. If someones buys a razor and a banana then you cannot tell that the purchase of a razor influenced the purchase of banana. Hence we need to adjust the math to handle this case as well. Modfying the version:

(Total Customers who did not purchase X but got Y / Total Customers who did not purchase X)

Substituting the number we get:

Python for Data Analysis = ( 2 / 2 ) / ( 1 / 3) = 1 / 1/3 = 3

Startup Playbook = ( 2 / 2) / ( 3 / 3) = 1

The denominator acts as a normalizer and you can see that Python for Data Analysis clearly stands out. Interesting, doesn't ?

The next article I will work more with non-personalized recommenders, presenting some ranking algorithms that I developed for Atepassar.com for ranking professors. :)

Examples with real dataset (let's play with CourseTalk dataset)

To present non-personalized recommenders let's play with some data. I decided to crawl the data from the popular ranking site for MOOC's Course Talk. It is an aggregator of several MOOC's where people can rate the courses and write reviews. The dataset is a mirror from the date 10/11/2013 and it is only used here for study purposes.

Let's use Pandas to read all the data and start showing what we can do with Python and present a list of top courses ranked by some non-personalized metrics :)

Update: For better analysis I hosted all the code provided at the IPython Notebook at the following link by using nbviewer.

All the dataset and source code will be provided at crab's github, the idea is to work on those notebooks to provide a future book about recommender systems :)

I hope you enjoyed this article, and stay tunned for the next one about another type of non-personalized recommenders: Ranking algorithms for vote up/vote down systems!

78 comments:

I've been using pandas for a while now, it's really great for data management. The only downside is that pandas has limited out-of-core capabilities. My dataset is ~200GB big and I have to use a high-performance cluster to be able to use it with pandas. But apparently Wes McKinney is working on that (see his last post: http://wesmckinney.com/blog/?p=697).

It was really a wonderful article and I was really impressed by reading this blog. We are giving all software Course Online Training. The HTML5 Training in Chennai is one of the reputed Training institute in Chennai. They give professional and real time training for all students.

You have stated definite points about the technology that is discussed above. The content published here derives a valuable inspiration to technology geeks like me. Moreover you are running a great blog. Many thanks for sharing this in here.

If wants to get real time Oracle Training visit this blog They give professional and job oriented training for all students.To make it easier for you Greens Technologies trained as visualizing all the real-world Application and how to implement in Archiecture trained with expert trainners guide may you want.. Start brightening your career with us Green Technologies In Chennai

Nice site....Please refer this site also if Our vision succes!Training are focused on perfect improvement of technical skills for Freshers and working professional. Our Training classes are sure to help the trainee with COMPLETE PRACTICAL TRAINING and Realtime methodologies Green Technologies In Chennai

This site has very useful inputs related to qtp.This page lists down detailed and information about QTP for beginners as well as experienced users of QTP. If you are a beginner, it is advised that you go through the one after the other as mentioned in the list. So let’s get started… QTP Training in Chennai

Hi. Nice post. I am wondering if it is possible.Actually pega software that can be used in many companies for their day to day business activities it has great scope in future.if suggest best coaching center visit Pega Training in Chennai

fantastic presentation of informatica..if sharinng this session will describe near real-time architectures for accelerating the delivery of data to critical analytics and customer service applications in real world once again i want to share this sites Informatica Training in chennai

Hey, nice site you have here!We provide world-class Oracle certification and placement training course as i wondered Keep up the excellent work experience!Please visit Greens Technologies located at Chennai Adyar Oracle Training in chennai

Hey, nice site you have here!We provide world-class Oracle certification and placement training course as i wondered Keep up the excellent work experience!Please visit Greens Technologies located at Chennai Adyar Oracle Training in chennai

I would recommend the Qlikview course to anyone interested in learning Business Intelligence .Absolutely professional and engaging training sessions helped me to appreciate and understand the technology better. thank you very much if our dedicated efforts and valuable insights which made it easy for me to understand the concepts taught and more ... qlikview Training in chennai

Thanks for sharing this informative blog .To make it easier for you Greens Techonologies at Chennai is visualizing all the materials about (OBIEE).SO lets Start brightening your future.and using modeling tools how to prepare and build objects and metadata to be used in reports and more trained itself visit Obiee Training in chennai

Awesome blog if our training additional way as an SQL and PL/SQL trained as individual, you will be able to understand other applications more quickly and continue to build your skill set which will assist you in getting hi-tech industry jobs as possible in future courese of action..visit this blogplsql in Chennaigreenstechnologies.in:

Nice site.... refer this site .if Our vision succes!Training are focused on perfect improvement of technical skills for Freshers and working professional. Our Training classes are sure to help the trainee with COMPLETE PRACTICAL TRAINING and Realtime methodologies.Oracle Rac Training Chennaihaddoop:

Awesome blog if our training additional way as an SQL and PL/SQL trained as individual, you will be able to understand other applications more quickly and continue to build your skill set which will assist you in getting hi-tech industry jobs as possible in future courese of action..visit this blogplsql in Chennaigreenstechnologies.in:

Nice site.... refer this site .if Our vision succes!Training are focused on perfect improvement of technical skills for Freshers and working professional. Our Training classes are sure to help the trainee with COMPLETE PRACTICAL TRAINING and Realtime methodologies.Oracle Rac Training Chennaihaddoop:

hai you have to learned to lot of information about c# .net Gain the knowledge and hands-on experience you need to successfully design, build and deploy applications with c#.net. C-Net-training-in-chennai

Amazing blog if our training additional way as an silverlight training trained as individual, you will be able to understand other applications more quickly and continue to build your skill set which will assist you in getting hi-tech industry jobs as possible in future courese of action..visit this blogsilverlight-training.htmlgreenstechnologies.in:

awesome Job oriented sharepoint training in Chennai is offered by our institue is mainly focused on real time and industry oriented. We provide training from beginner’s level to advanced level techniques thought by our experts.if you have more details visit this blog. SharePoint-training-in-chennai.html

Thank you for taking the time to provide us with your valuable information. We strive to provide our candidates with excellent care and we take your comments to heart.As always, we appreciate your confidence and trust in us.

All are saying the same thing repeatedly, but in your blog I had a chance to get some useful and unique information, I love your writing style very much, I would like to suggest your blog in my dude circle, so keep on updates.

This blog is having the general information. Got a creative work and this is very different one.We have to develop our creativity mind.This blog helps for this. Thank you for this blog. This is very interesting and useful.

Thank you for taking the time to provide us with your valuable information. We strive to provide our candidates with excellent care and we take your comments to heart.As always, we appreciate your confidence and trust in usHadoop Training in chennai

It’s really amazing that we can record what our visitors do on our site. Thanks for sharing this awesome guide. I’m happy that I came across with your site this article is on point,thanks again and have a great day. Keep update more information..

I just see the post i am so happy the post of information's.So I have really enjoyed and reading your blogs for these posts.Any way I’ll be subscribing to your feed and I hope you post again soon.AWS Training in Chennai

Nice it seems to be good post... It will get readers engagement on the article since readers engagement plays an vital role in every blog.. i am expecting more updated posts from your hands.Android App Development Company

Wow, that was an informative article on Non-Personalized Recommender systems with Pandas and Python and I have learned a lot of information about the system that will be of importance when I embark on Research paper chapter 4 writing. Thanks so much for sharing the article with us and I am looking forward to reading more posts from this site.

This was some very inciteful and useful information. I appreciated the candor in not checking multiple times a week, as I do! :-) You are correct in saying that efforts can be focused on building additional link sources. Thanks for the reminder.Thanks for posting useful information.After reading your post,thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic Excellent article.

Search in this blog

Join the Brazilian Python Conference PythonBrasil 2013

Marcel Caraciolo

I am a brazilian data scientist, entrepreneur, python hacker and technology consultant. Nowadays I work with data-centric applications, specially in machine learning, recommender systems and bioinformatics. I am also interested in distributed computing, high performance and data visualization, educational and bioinformatics ventures.

Until 2013 I was the co-founder of two companies Atepassar.com, a social network for students in Brazil and co-founder of PyCursos, a on-line startup for python training and on-line courses. In 2014, I assumed a new position at Genomika Diagnósticos, a brazilian genetics tests laboratory, as CTO.