Pages

Tuesday, October 22, 2013

Hi all,

At the last PythonBrasil I gave a tutorial about Python and Data Analysis focused on recommender systems, the main topic I've been studying for the last years. There is a popular python package among the statisticians and data scientists called Pandas. I watched several talks and keynotes about it, but I didn't have a try on it. The tutorial gave me this chance and after the tutorial me and the audience fell quite excited about the potential and power that this library gives.

This post starts a series of articles that I will write about recommender systems and even the introduction for the new-old refreshed library that I am working on: Crab, a python library for building recommender systems. :)

This post starts with the first topic about the theme: Non-personalized Recommender Systems and giving several examples with the python package Pandas. In future I will also post an alternative version of this post but referencing Crab, about how it works with him.

But first let's introduce what Pandas is.

Introduction to Pandas

Pandas is a data analysis library for Python that is great for data preparation, joining and ultimately generating well-formed, tabular data that's easy to use in a variety of visualization tools or (as we will see here) machine learning applications. For further introduction about pandas, check this website or this notebook.

Non-personalized Recommenders

Non-personalized recommenders can recommend items to consumers based on what other consumers have said about the items on average. That is, the recommendations are independent of the customer, so each customer gets the same recommendation. For example, if you go to amazon.com as an anonymous user it shows items that are currently viewed by other members.

Generally the recommendations come in two flavours: predictions or recommendations. In case of predictions are simple statements that are formed in form of scores, stars or counts. On the other hand, recommendations are generally simple a list of items shown without any number associated with it.

Let's going by an example:

Simple Prediction using Average

The score in the scale of 1 to 5 to the book Programming Collective Intelligence was 4.5 stars out of 5.
This is an example of a simple prediction. It displays a simple average of other customer reviews about the book.
The math behind it is quite simple:

In the same page it also displays the information about the other books which the customers bought after buying Programming Collective Intelligence. A list of recommended books presented to anyone who visits the product's page. It is an example of recommendation.

But how Amazon came up with those recommendations ? There are several techniques that could be applied to provide those recommendations. One would be the association rules mining, a data mining technique to generate a set of rules and combinatios of items that were bought together. Or it could be a simple average measure based on the proportion of who bought x and y by who bought x. Let's explain using some maths:

Let X be the number of customers who purchased the book Programming Collective Intelligence. Let Y be the other books they purchased. You need to compute the ration given below for each book and sort them by descending order. Finally, pick up the top K books and show them as related. :D

Score(X, Y) = Total Customers who purchased X and Y / Total Customers who purchased X

Using this simple score function for all the books you wil achieve:

Python for Data Analysis 100%

Startup Playbook 100%

MongoDB Definitive Guid 0 %

Machine Learning for Hackers 0%

As we imagined the book Python for Data Analysis makes perfect sense. But why did the book Startup Playbook came to the top when it has been purchased by customers who have not purchased Programming Collective Intelligence. This a famous trick in e-commerce applications called banana trap. Let's explain: In a grocery store most of customers will buy bananas. If someones buys a razor and a banana then you cannot tell that the purchase of a razor influenced the purchase of banana. Hence we need to adjust the math to handle this case as well. Modfying the version:

(Total Customers who did not purchase X but got Y / Total Customers who did not purchase X)

Substituting the number we get:

Python for Data Analysis = ( 2 / 2 ) / ( 1 / 3) = 1 / 1/3 = 3

Startup Playbook = ( 2 / 2) / ( 3 / 3) = 1

The denominator acts as a normalizer and you can see that Python for Data Analysis clearly stands out. Interesting, doesn't ?

The next article I will work more with non-personalized recommenders, presenting some ranking algorithms that I developed for Atepassar.com for ranking professors. :)

Examples with real dataset (let's play with CourseTalk dataset)

To present non-personalized recommenders let's play with some data. I decided to crawl the data from the popular ranking site for MOOC's Course Talk. It is an aggregator of several MOOC's where people can rate the courses and write reviews. The dataset is a mirror from the date 10/11/2013 and it is only used here for study purposes.

Let's use Pandas to read all the data and start showing what we can do with Python and present a list of top courses ranked by some non-personalized metrics :)

Update: For better analysis I hosted all the code provided at the IPython Notebook at the following link by using nbviewer.

All the dataset and source code will be provided at crab's github, the idea is to work on those notebooks to provide a future book about recommender systems :)

I hope you enjoyed this article, and stay tunned for the next one about another type of non-personalized recommenders: Ranking algorithms for vote up/vote down systems!

Search in this blog

Join the Brazilian Python Conference PythonBrasil 2013

Marcel Caraciolo

I am a brazilian data scientist, entrepreneur, python hacker and technology consultant. Nowadays I work with data-centric applications, specially in machine learning, recommender systems and bioinformatics. I am also interested in distributed computing, high performance and data visualization, educational and bioinformatics ventures.

Until 2013 I was the co-founder of two companies Atepassar.com, a social network for students in Brazil and co-founder of PyCursos, a on-line startup for python training and on-line courses. In 2014, I assumed a new position at Genomika Diagnósticos, a brazilian genetics tests laboratory, as CTO.