Pages

Collaborative Filtering : Implementation with Python!

Tuesday, November 10, 2009

Continuing the recommendation engines articles series, in this article i'm going to present an implementation of the collaborative filtering algorithm (CF), that filters information for a user based on a collection of user profiles. Users having similar profiles may share similar interests. For a user, information can be filtered in/out regarding to the behaviors of his or her similar users.

Users profiles can be collected either explicitly or implicitly. One can explicitly ask users to rate what they have used/purchased. Such a profile is filled explicitly by the user ratings. An implicit profile is based on passive observation and contains users historic interaction data.

The most common usage of Collaborative Filtering is to make recommendation. That's why collaborative filtering is strongly correlated to recommender system in literature.

The implementation shown here will be at Python, so if you're not familiar with the programming language you can see more about it here. The pros of using Python is that with so less lines of code you can easily make the things running. Regardless of the underlying implementation, collaborative filters tend to try to solve the same broad problem using much the same data.

Generally you have a crowd of users, a big pile of items and some of the users rated them(what they think). Finally, you want to suggest more items to a user and you'd prefer to make your recommendations relevant to their likely interests. As you will see, that the algorithm suggest that you could use the opinions that people have recorded about items they have bought, to give a good guess as to which items they haven't bought, but might like.

The first thing is to collect the preferences of the users. My Collaborative Filtering implementation stores its data in two 2D matrices. So for each user in a row we have columns for each item that he rated, as you can see at the Figure 01 below.

Figure 01. The 2D Matrix User:Book:Rating

To keep things simple, let's represent our matrices as two levels of Python dict objects (a dictionary is simply a hash table, if you're not familiar with Python). The key of each dict is a userID, so to get the rate which the user "Bryan" gave to the book "Classical Mythology" we look in first-level dict for "Bryan", then the second-level dict for "Lost Symbol". Our problem scope here will be book recommendations. The complete dataset can be found free here at this link for download. Free available for research, the Book-Crossing dataset contains 278,858 users (anonymized) providing 1,149,780 ratings (explicit/implicit) about 271,379 books in a 4-week crawl (August/September 2004).

In this article, we will use only use only the data stored in Bx-books.csv and Bx-book-Ratings.csv, that contain the list of identifiers, titles of the books and the ratings gave by the users respectively. To download the data already pre-processed, click here. If you prefer to do it all by yourself, i also provided some code (loadDataset) in the implementation. It's important to notice that the user represented in this data set is represented by an unique numeric identifier for privacy of the users.

>>>from critics import *

>>>critics['228054']

{'Fortune': 6.0}

>>>

After collecting the data related to the stuff that the users prefer, you need somehow a metric to determine how similar the users are compared to your tastes. To measure this, you have to compare each user with other using a similar measure distance. There are some functions to evaluate this metric, but in this article i will use the euclidian distance and pearson correlation. I am not going to explain the mathematics behind those measure distances, because you can find a lot of information about them out of a hat. The basic idea behind those measures is that the more the users have similar tastes the more they are next to each other in the preferences search space. Which one to use? Depends on your problem, test all and verify which one get better results. Generally, the Pearson correlation gets slightly better results, since it shows how much the variables change together. To play with them, check the implementation of the functions sim_pearson or sim_euclidean. Those functions will be used as parameters of the functions defined in the rest of this article.

>>>recommendations.sim_distance(critics,'98556', '180727')

0.058823529411764705

>>>

>>> recommendations.sim_pearson(critics.critics,'180727', '177432')

0.6622661785325219

>>>

Now that we have the measure distances to compare two users, we now can define other function to classify all users compared to a specified user and find the one that is most similar. In this particular case, the goal is to find users that rated and have the similar taste so i can know who i can ask for advice when i want to choose a book. The function topMatches returns a sorted list of n users with similar preferences to a specified user. Now, with the list, you can see the ratings done by other users that have similar preferences as me. So the idea i should see the books that she rated, then choose new books.

Find someone similar to read recommendations is great, but generally what we really want is to make recommendation of books not users. I could simply look to the user profile and seek for books the user likes and i haven't read yet, but this it's not so clever. This approach could eventually result in a user that haven't done an evaluation on books that i could like. It could also return a user that liked a movie that was badly evaluated (low rates) by all other users returned by the topMatches. To solve those problems, you have to give rates to items using a weighted average that can properly classify the evaluations. The implementation code for this items recommendation is simple and work with both measure distances.

The code of the function getRecommendations looks at each user except the one passed as parameter. It calculates how similar the users are to the specified user and after looks at each item rated by those users. As result you now have a classified books list and also a estimated rate that i would give for each book in it. This report allows me to choose which book i want to read or not, or if i prefer to do other thing than read it. It's important to notice that you can decide not to make recommendations if any result achieves a specified threshold by the user.

>>> recommendations.getRecommendations(critics.critics,'180727')[0:3]

[(10.000000000000002, 'The Two Towers (The Lord of the Rings, Part 2)'),

(10.000000000000002, 'The Return of the King (The Lord of the Rings, Part 3)'),

(10.000000000000002, 'Hawaii')]

>>> recommendations.getRecommendations(critics.critics,'180727',

similarity=recommendations.sim_distance)[0:4]

[(10.000000000000002, 'Dune'), (10.000000000000002, 'Best Friends'),

(10.000000000000002, 'All Creatures Great and Small'),

(10.000000000000002, 'A Christmas Carol (Dover Thrift Editions)')]

Now we know how to find similar users and recommend items to a user, but how about finding similar items ?! You see those recommendations at web stores in the internet, specially when the store hasn't collected many information about your preferences. One of web stores that uses this type of recommendations is the Amazon web store, as you can see it here.

Figure 02. Amazon Web Store Recommendation System

In this case, you can evaluate the similarity, searching for users that liked a particular item and seeing others that appreciated in the same way. To do this, you can use the same functions defined earlier in this article, the only change is to replace the users by items now. So you can find similar items to the specified item.

I provided a function transformPrefs to do that. It rebuilds the new dictionary now with the key value with the book name and as values the pairs (user,rate).

It's not so obvious that changing users to items it will lead to useful results, but in many cases it will make possible to do interesting comparisons. Imagine a web store that collect buying historic profiles with the purpose of recommend products to people in particular. Revert people to products, you can allow the system now recommend users that could buy specific products. It's very useful when the marketing department of your company want to do a great marketing effort to a big cut-off prices sales. Or it could be also be used to check if links recommended show in a web page are really seen by users that have a great probability of liking them.

So that's it. I expect you enjoyed this article. As you can see the recommendation engine using collaborative filtering is very effective when you don't have a great amount of items or users. When you deal with a big store like Amazon, that has millions of users and items - compare one user against all others , then each evaluated item can be extremely slow. An alternative technique to get over this limitation is the Item-based-filtering. It's very useful in cases when you have a big dataset. This technique can give better results and allows that many calculations be done previously before a user ask for a recommendation, consequently, showing the recommendations quickly.

You can download a copy of my sample collaborative filtering implementation as collaborative_recommendation.py. In the next article we will study about the item-based-filtering technique.

PS: I'm planning with other colleague Ricardo Caspirro to develop a library in Python for recommendations. We are very excited and planning great stuff for the python and recommendation systems enthusiasts! Wait for great news soon!

Awesome post ! Is this algorithm a neighbourhood method or a latent factor models ? I understand that collaborative filtering has two different methods, either neighbourhood based or latent factor model based.

this entire write-uup is a thinly veiled copy of contect from Programming Collective Intelligence. It is pathetic that you did not acknowledge the source material or cite it in the references. plagiarist.

https://pastebin.com/VtBkmi2S I'm doing a project on the same dataset that I'm doing in this article, but I have a problem when running this function (pastebin above), time is really very long running. To get you to understand the first print that displays 100 / len (ItemPrefs) that corresponds to approximately 67381, it takes about 60 seconds, and to end all the calculation it takes forever. What do you suggest to do?

That was not only an interesting but also a intriguing article on python and the writer has explained comprehensively his ideas. I have learned a lot about the implementation of python and I will be visiting this site occasionally to read both new and old programming articles. Find time and check my professional writing website by clicking on Secondary Data Proposal Help.

Search in this blog

Join the Brazilian Python Conference PythonBrasil 2013

Marcel Caraciolo

I am a brazilian data scientist, entrepreneur, python hacker and technology consultant. Nowadays I work with data-centric applications, specially in machine learning, recommender systems and bioinformatics. I am also interested in distributed computing, high performance and data visualization, educational and bioinformatics ventures.

Until 2013 I was the co-founder of two companies Atepassar.com, a social network for students in Brazil and co-founder of PyCursos, a on-line startup for python training and on-line courses. In 2014, I assumed a new position at Genomika Diagnósticos, a brazilian genetics tests laboratory, as CTO.