Person B is likely to have similar opinions on Apples as A than some other random person.

The implications of collaborative filtering are obvious: you can predict and recommend items to users based on preference similarities. There are two types of collaborative filtering: user-based and item-based.

Item Based Collaborative Filtering takes the similarities between items’ consumption history.User Based Collaborative Filtering considers similarities between user consumption history.

Case: Last.FM Music

The data set contains information about users, their gender, their age, and which artists they have listened to on Last.FM. We will not use the entire dataset. For simplicity’s sake we only use songs in Germany and we will transform the data to a item frequency matrix. This means each row will represent a user, and each column represents and artist. For this we use R’s “reshape” package. This is largely administrative, so we will start with the transformed dataset.

Item Based Collaborative Filtering

In item based collaborative filtering we do not really care about the users. So the first thing we should do is drop the user column from our data. This is really easy since it is the first column, but if it was not the first column we would still be able to drop it with the following code:

We then want to calculate the similarity of each song with the rest of the songs. This means that we want to compare each column in our “data.germany.ibs” data set with every other column in the data set. Specifically, we will be comparing what is known as the “Cosine Similarity”.

The cosine similarity, in essence takes the sum product of the first and second column, and divide that by the product of the square root of the sum of squares of each column. (that was a mouth-full!)

The important thing to know is the resulting number represents how “similar” the first column is with the second column. We will use the following helper function to product the Cosine Similarity:

We are now ready to start comparing each of our songs (items). We first need a placeholder to store the results of our cosine similarities. This placeholder will have the songs in both columns and rows:

Perfect, all that’s left is to loop column by column and calculate the cosine similarities with our helper function, and then put the results into the placeholder data table. That sounds like a pretty straight-forward nested for-loop:

# Lets fill in those empty spaces with cosine similarities# Loop through the columnsfor(i in 1:ncol(data.germany.ibs)){# Loop through the columns for each columnfor(j in 1:ncol(data.germany.ibs)){# Fill in placeholder with cosine similarities
data.germany.ibs.similarity[i,j]<- getCosine(as.matrix(data.germany.ibs[i]),as.matrix(data.germany.ibs[j]))}}# Back to dataframe
data.germany.ibs.similarity <- as.data.frame(data.germany.ibs.similarity)

# Lets fill in those empty spaces with cosine similarities
# Loop through the columns
for(i in 1:ncol(data.germany.ibs)) {
# Loop through the columns for each column
for(j in 1:ncol(data.germany.ibs)) {
# Fill in placeholder with cosine similarities
data.germany.ibs.similarity[i,j] <- getCosine(as.matrix(data.germany.ibs[i]),as.matrix(data.germany.ibs[j]))
}
}
# Back to dataframe
data.germany.ibs.similarity <- as.data.frame(data.germany.ibs.similarity)

Note: For loops in R are infernally slow. We use as.matrix() to transform the columns into matrices since matrix operations run a lot faster. We transform the similarity matrix into a data.frame for later processes that we will use.

We have our similarity matrix. Now the question is … so what?

We are now in a position to make recommendations! We look at the top 10 neighbours of each song – those would be the recommendations we make to people listening to those songs.

We start off by creating a placeholder:

# Get the top 10 neighbours for each
data.germany.neighbours <- matrix(NA, nrow=ncol(data.germany.ibs.similarity),ncol=11,dimnames=list(colnames(data.germany.ibs.similarity)))

# Get the top 10 neighbours for each
data.germany.neighbours <- matrix(NA, nrow=ncol(data.germany.ibs.similarity),ncol=11,dimnames=list(colnames(data.germany.ibs.similarity)))

The rest is one big ugly nested loop. First the loop, then we will break it down step by step:

# Loop through the users (rows)for(i in 1:nrow(holder)){# Loops through the products (columns)for(j in 1:ncol(holder)){# Get the user's name and th product's name# We do this not to conform with vectors sorted differently
user <- rownames(holder)[i]
product <- colnames(holder)[j]# We do not want to recommend products you have already consumed# If you have already consumed it, we store an empty stringif(as.integer(data.germany[data.germany$user==user,product])==1){
holder[i,j]<-""}else{# We first have to get a product's top 10 neighbours sorted by similarity
topN<-((head(n=11,(data.germany.ibs.similarity[order(data.germany.ibs.similarity[,product],decreasing=TRUE),][product]))))
topN.names <- as.character(rownames(topN))
topN.similarities <- as.numeric(topN[,1])# Drop the first one because it will always be the same song
topN.similarities<-topN.similarities[-1]
topN.names<-topN.names[-1]# We then get the user's purchase history for those 10 items
topN.purchases<- data.germany[,c("user",topN.names)]
topN.userPurchases<-topN.purchases[topN.purchases$user==user,]
topN.userPurchases <- as.numeric(topN.userPurchases[!(names(topN.userPurchases)%in% c("user"))])# We then calculate the score for that product and that user
holder[i,j]<-getScore(similarities=topN.similarities,history=topN.userPurchases)}# close else statement}# end product for loop }# end user for loop
data.germany.user.scores <- holder

# Loop through the users (rows)
for(i in 1:nrow(holder))
{
# Loops through the products (columns)
for(j in 1:ncol(holder))
{
# Get the user's name and th product's name
# We do this not to conform with vectors sorted differently
user <- rownames(holder)[i]
product <- colnames(holder)[j]
# We do not want to recommend products you have already consumed
# If you have already consumed it, we store an empty string
if(as.integer(data.germany[data.germany$user==user,product]) == 1)
{
holder[i,j]<-""
} else {
# We first have to get a product's top 10 neighbours sorted by similarity
topN<-((head(n=11,(data.germany.ibs.similarity[order(data.germany.ibs.similarity[,product],decreasing=TRUE),][product]))))
topN.names <- as.character(rownames(topN))
topN.similarities <- as.numeric(topN[,1])
# Drop the first one because it will always be the same song
topN.similarities<-topN.similarities[-1]
topN.names<-topN.names[-1]
# We then get the user's purchase history for those 10 items
topN.purchases<- data.germany[,c("user",topN.names)]
topN.userPurchases<-topN.purchases[topN.purchases$user==user,]
topN.userPurchases <- as.numeric(topN.userPurchases[!(names(topN.userPurchases) %in% c("user"))])
# We then calculate the score for that product and that user
holder[i,j]<-getScore(similarities=topN.similarities,history=topN.userPurchases)
} # close else statement
} # end product for loop
} # end user for loop
data.germany.user.scores <- holder

The loop starts by taking each user (row) and then jumps into another loop that takes each column (artists).
We then store the user’s name and artist name in variables to use them easily later.
We then use an if statement to filter out artists that a user has already listened to – this is a business case decision.

The next bit gets the item based similarity scores for the artist under consideration.

# We first have to get a product's top 10 neighbours sorted by similarity
topN<-((head(n=11,(data.germany.ibs.similarity[order(data.germany.ibs.similarity[,product],decreasing=TRUE),][product]))))
topN.names <- as.character(rownames(topN))
topN.similarities <- as.numeric(topN[,1])# Drop the first one because it will always be the same song
topN.similarities<-topN.similarities[-1]
topN.names<-topN.names[-1]

# We first have to get a product's top 10 neighbours sorted by similarity
topN<-((head(n=11,(data.germany.ibs.similarity[order(data.germany.ibs.similarity[,product],decreasing=TRUE),][product]))))
topN.names <- as.character(rownames(topN))
topN.similarities <- as.numeric(topN[,1])
# Drop the first one because it will always be the same song
topN.similarities<-topN.similarities[-1]
topN.names<-topN.names[-1]

It is important to note the number of artists you pick matters. We pick the top 10.
We store the similarities score and song names.
We also drop the first column because, as we saw, it always represents the same song.

We’re almost there. We just need the user’s purchase history for the top 10 songs.

# We then get the user's purchase history for those 10 items
topN.purchases<- data.germany[,c("user",topN.names)]
topN.userPurchases<-topN.purchases[topN.purchases$user==user,]
topN.userPurchases <- as.numeric(topN.userPurchases[!(names(topN.userPurchases)%in% c("user"))])

# We then get the user's purchase history for those 10 items
topN.purchases<- data.germany[,c("user",topN.names)]
topN.userPurchases<-topN.purchases[topN.purchases$user==user,]
topN.userPurchases <- as.numeric(topN.userPurchases[!(names(topN.userPurchases) %in% c("user"))])

We use the original data set to get the purchases of our users’ top 10 purchases.
We filter out our current user in the loop and then filter out purchases that match the user.

We are now ready to calculate the score and store it in our holder matrix:

# We then calculate the score for that product and that user
holder[i,j]<-getScore(similarities=topN.similarities,history=topN.userPurchases)

# We then calculate the score for that product and that user
holder[i,j]<-getScore(similarities=topN.similarities,history=topN.userPurchases)

References

This case is based on Professor Miguel Canela “Designing a music recommendation app”

Entire Code

# Admin stuff here, nothing special
options(digits=4)
data <-read.csv(file="lastfm-data.csv")
data.germany <-read.csv(file="lastfm-matrix-germany.csv")############################# Item Based Similarity ############################# # Drop the user column and make a new data frame
data.germany.ibs <-(data.germany[,!(names(data.germany)%in% c("user"))])# Create a helper function to calculate the cosine between two vectors
getCosine <- function(x,y){
this.cosine <- sum(x*y)/(sqrt(sum(x*x))*sqrt(sum(y*y)))return(this.cosine)}# Create a placeholder dataframe listing item vs. item
holder <- matrix(NA, nrow=ncol(data.germany.ibs),ncol=ncol(data.germany.ibs),dimnames=list(colnames(data.germany.ibs),colnames(data.germany.ibs)))
data.germany.ibs.similarity <- as.data.frame(holder)# Lets fill in those empty spaces with cosine similaritiesfor(i in 1:ncol(data.germany.ibs)){for(j in 1:ncol(data.germany.ibs)){
data.germany.ibs.similarity[i,j]= getCosine(data.germany.ibs[i],data.germany.ibs[j])}}# Output similarity results to a filewrite.csv(data.germany.ibs.similarity,file="final-germany-similarity.csv")# Get the top 10 neighbours for each
data.germany.neighbours <- matrix(NA, nrow=ncol(data.germany.ibs.similarity),ncol=11,dimnames=list(colnames(data.germany.ibs.similarity)))for(i in 1:ncol(data.germany.ibs)){
data.germany.neighbours[i,]<-(t(head(n=11,rownames(data.germany.ibs.similarity[order(data.germany.ibs.similarity[,i],decreasing=TRUE),][i]))))}# Output neighbour results to a file write.csv(file="final-germany-item-neighbours.csv",x=data.germany.neighbours[,-1])############################# User Scores Matrix ############################# # Process:# Choose a product, see if the user purchased a product# Get the similarities of that product's top 10 neighbours# Get the purchase record of that user of the top 10 neighbours# Do the formula: sumproduct(purchaseHistory, similarities)/sum(similarities)# Lets make a helper function to calculate the scores
getScore <- function(history, similarities){
x <- sum(history*similarities)/sum(similarities)
x
}# A placeholder matrix
holder <- matrix(NA, nrow=nrow(data.germany),ncol=ncol(data.germany)-1,dimnames=list((data.germany$user),colnames(data.germany[-1])))# Loop through the users (rows)for(i in 1:nrow(holder)){# Loops through the products (columns)for(j in 1:ncol(holder)){# Get the user's name and th product's name# We do this not to conform with vectors sorted differently
user <- rownames(holder)[i]
product <- colnames(holder)[j]# We do not want to recommend products you have already consumed# If you have already consumed it, we store an empty stringif(as.integer(data.germany[data.germany$user==user,product])==1){
holder[i,j]<-""}else{# We first have to get a product's top 10 neighbours sorted by similarity
topN<-((head(n=11,(data.germany.ibs.similarity[order(data.germany.ibs.similarity[,product],decreasing=TRUE),][product]))))
topN.names <- as.character(rownames(topN))
topN.similarities <- as.numeric(topN[,1])# Drop the first one because it will always be the same song
topN.similarities<-topN.similarities[-1]
topN.names<-topN.names[-1]# We then get the user's purchase history for those 10 items
topN.purchases<- data.germany[,c("user",topN.names)]
topN.userPurchases<-topN.purchases[topN.purchases$user==user,]
topN.userPurchases <- as.numeric(topN.userPurchases[!(names(topN.userPurchases)%in% c("user"))])# We then calculate the score for that product and that user
holder[i,j]<-getScore(similarities=topN.similarities,history=topN.userPurchases)}# close else statement}# end product for loop }# end user for loop# Output the results to a file
data.germany.user.scores <- holder
write.csv(file="final-user-scores.csv",data.germany.user.scores)# Lets make our recommendations pretty
data.germany.user.scores.holder <- matrix(NA, nrow=nrow(data.germany.user.scores),ncol=100,dimnames=list(rownames(data.germany.user.scores)))for(i in 1:nrow(data.germany.user.scores)){
data.germany.user.scores.holder[i,]<- names(head(n=100,(data.germany.user.scores[,order(data.germany.user.scores[i,],decreasing=TRUE)])[i,]))}# Write output to filewrite.csv(file="final-user-recommendations.csv",data.germany.user.scores.holder)

# Admin stuff here, nothing special
options(digits=4)
data <- read.csv(file="lastfm-data.csv")
data.germany <- read.csv(file="lastfm-matrix-germany.csv")
############################
# Item Based Similarity #
############################
# Drop the user column and make a new data frame
data.germany.ibs <- (data.germany[,!(names(data.germany) %in% c("user"))])
# Create a helper function to calculate the cosine between two vectors
getCosine <- function(x,y)
{
this.cosine <- sum(x*y) / (sqrt(sum(x*x)) * sqrt(sum(y*y)))
return(this.cosine)
}
# Create a placeholder dataframe listing item vs. item
holder <- matrix(NA, nrow=ncol(data.germany.ibs),ncol=ncol(data.germany.ibs),dimnames=list(colnames(data.germany.ibs),colnames(data.germany.ibs)))
data.germany.ibs.similarity <- as.data.frame(holder)
# Lets fill in those empty spaces with cosine similarities
for(i in 1:ncol(data.germany.ibs)) {
for(j in 1:ncol(data.germany.ibs)) {
data.germany.ibs.similarity[i,j]= getCosine(data.germany.ibs[i],data.germany.ibs[j])
}
}
# Output similarity results to a file
write.csv(data.germany.ibs.similarity,file="final-germany-similarity.csv")
# Get the top 10 neighbours for each
data.germany.neighbours <- matrix(NA, nrow=ncol(data.germany.ibs.similarity),ncol=11,dimnames=list(colnames(data.germany.ibs.similarity)))
for(i in 1:ncol(data.germany.ibs))
{
data.germany.neighbours[i,] <- (t(head(n=11,rownames(data.germany.ibs.similarity[order(data.germany.ibs.similarity[,i],decreasing=TRUE),][i]))))
}
# Output neighbour results to a file
write.csv(file="final-germany-item-neighbours.csv",x=data.germany.neighbours[,-1])
############################
# User Scores Matrix #
############################
# Process:
# Choose a product, see if the user purchased a product
# Get the similarities of that product's top 10 neighbours
# Get the purchase record of that user of the top 10 neighbours
# Do the formula: sumproduct(purchaseHistory, similarities)/sum(similarities)
# Lets make a helper function to calculate the scores
getScore <- function(history, similarities)
{
x <- sum(history*similarities)/sum(similarities)
x
}
# A placeholder matrix
holder <- matrix(NA, nrow=nrow(data.germany),ncol=ncol(data.germany)-1,dimnames=list((data.germany$user),colnames(data.germany[-1])))
# Loop through the users (rows)
for(i in 1:nrow(holder))
{
# Loops through the products (columns)
for(j in 1:ncol(holder))
{
# Get the user's name and th product's name
# We do this not to conform with vectors sorted differently
user <- rownames(holder)[i]
product <- colnames(holder)[j]
# We do not want to recommend products you have already consumed
# If you have already consumed it, we store an empty string
if(as.integer(data.germany[data.germany$user==user,product]) == 1)
{
holder[i,j]<-""
} else {
# We first have to get a product's top 10 neighbours sorted by similarity
topN<-((head(n=11,(data.germany.ibs.similarity[order(data.germany.ibs.similarity[,product],decreasing=TRUE),][product]))))
topN.names <- as.character(rownames(topN))
topN.similarities <- as.numeric(topN[,1])
# Drop the first one because it will always be the same song
topN.similarities<-topN.similarities[-1]
topN.names<-topN.names[-1]
# We then get the user's purchase history for those 10 items
topN.purchases<- data.germany[,c("user",topN.names)]
topN.userPurchases<-topN.purchases[topN.purchases$user==user,]
topN.userPurchases <- as.numeric(topN.userPurchases[!(names(topN.userPurchases) %in% c("user"))])
# We then calculate the score for that product and that user
holder[i,j]<-getScore(similarities=topN.similarities,history=topN.userPurchases)
} # close else statement
} # end product for loop
} # end user for loop
# Output the results to a file
data.germany.user.scores <- holder
write.csv(file="final-user-scores.csv",data.germany.user.scores)
# Lets make our recommendations pretty
data.germany.user.scores.holder <- matrix(NA, nrow=nrow(data.germany.user.scores),ncol=100,dimnames=list(rownames(data.germany.user.scores)))
for(i in 1:nrow(data.germany.user.scores))
{
data.germany.user.scores.holder[i,] <- names(head(n=100,(data.germany.user.scores[,order(data.germany.user.scores[i,],decreasing=TRUE)])[i,]))
}
# Write output to file
write.csv(file="final-user-recommendations.csv",data.germany.user.scores.holder)

84 Comments on “Collaborative Filtering with R”

Comment navigation

Thanks for clear explanation. I have few queries regarding similarity measures. Example that you considered here is implicit feed back information so it only contains 0’s and 1’s (1 – user purchased a product; 0 – user didn’t purchased a product. My doubt is when we have categorical data is cosine (or) Pearson’s correlation coefficient similarity
makes any sense?

Salem, thank you so much for your detailed explanation of collaborative filtering.

Here, you have dealt with data which is binary (0s and 1s). However, what if the recommendations are to be made on the basis of data which gives the number of times an user listened to a particular artist? Not only 0s and 1s but 0 to any range. It will range up to the number of times that particular user listened to that artist. Maybe this scenario does not make much sense in a music recommendation app, but lets say we have purchase or sales data? One user might have a very high affinity to a certain product or certain groups of product. Does the reco engine change in that case?

I have tried running your code with the provided data set. Do you have any tips, advice or comments on how to make it run faster? Lets say in cases where there are over 10,000 users! In that scenario does the process of K-Means clustering come into place? If so, could you briefly describe how to go about such a case?

There is no reason why you cannot create a matrix of the number of times a user listened to an artist and create a similarity matrix from that – the scores will adjust based on the similarities. It is extremely slow because of the for loop that I use in the code. For loops are the arch enemy of R 🙁 if you have a large dataset I would recommend you port the (part) of the code to python which runs much faster. There is probably a smarter way to do the for loop bits.

To be honest I do not know how to improve the performance of the for loops in this code with R.
I would recommend using more processing power either on your machine (what OS are you using, I can give you hints on how to dedicate more memory to R) or send it to the cloud (Amazon has this service).

I also recommend you try porting the data processing bit to Python which is a lot friendlier with loops of this sort (I might just make a port of this snippet since it seems to be getting popular).

I know this is not much help but I hope it puts you on the right track at least.

There is small correction in the following piece of code,
# Fill in placeholder with cosine similarities
data.germany.ibs.similarity[i,j] <- getCosine(as.matrix(data.germany.ibs[i]),as.matrix(data.germany.ibs[j]))

I think it should be like this,
data.germany.ibs.similarity[i,j] <- getCosine(as.matrix(data.germany.ibs[,i]),as.matrix(data.germany.ibs[,j]))

This depends on your data set and how it is formatted. The purpose of the post is to demonstrate collaborative filtering in R. If you would like consulting or work done for you on data munging please feel free to email me and I can offer you those facilities at a standard daily rate.

For the part “User Based Recommendations” I run your code but R still turns (more than 2 hours) ,I am still awaiting the execution result of R ; this is normal ?I am still awaiting the execution result R, there is a mistake in your code ?

I know this is old, but you can always imbed one or two print statements in a (non-parallel) for statement to see its progress. I ran this on 1500 products and 13,000 users. It took a couple of days but I could see the progress from the RStudio command line.

Hi Salem
I not good at writing function, the one big ugly nested loop, will you able to write a function for this. I want to test the code’s performance under the package SNOW for a data of 30000 observations into 200 variables.
Thanks
Mittal

If I’m dealing with a sparse matrix, might I want to substitute “if(as.integer(data.germany[data.germany$user==user,product]) == 1)” for something more like if(!is.na(data.germany[data.germany$User==user,product]))? As it stands, I’m getting an error (missing value where TRUE/FALSE needed).

I have a question on user-based filtering. Pictorial depiction on top of your article shows that we are recommending grapes and orange to 3rd boy because this boy is similar to 1st girl because of 2 common fruits among them water-melon and strawberry.

However, in your code and explanation for user-based filtering you are no doing any user-to- user similarity calculation and only using cosine similarities calculated for item based filtering. Please help me in understanding this.

I have read another article which is based on rating scores and it calculates cosine similarities for user while applying user-based filtering.

Thank you for your post. This was really helpful. My question is based on this current methodology, how do you evaluate the recommendations? More specifically, how can one calculate the MAE/RMSE for the recommendations?

Hi Salem,
Many thanks for your explanation. Must appreciate.
Can I ask if this is similar to the recommenderlab package in R?
and how do you suggest to validate the results it gives..like its recommending in a order for each user what he should most likely want. There should be a way for validation.In my case I am using super market data and recommend products to customers.
Any input would be helpful.
Thanks,
Prashanth

Hi, how can I evaluate the accuracy of recommendation using MAP(Mean Average Precision)in this dataset ?
If I divide the dataset into test data and training data, I don’t know how to rank the test data.
Do you have any idea?
Thank you

I would like to ask you. If I want to insert this recommender into a web site:
– Where reside the data? I mean, how would be the big picture of the architecture?
– Where is the matrix store?
– Suppose I have data about the users in a relational database, how would it work?
– Do I have to run the code with the online user together with the stored data and then obtain the recommendations?

My questions are maybe too general, but any comment would be highly appreciated.

Thanks for clear explanation. I work in a country wherein there are multiple religions, and there are few singers who sings for multiple religions. I can’t recommend a song on the basis of singer or song only ( chances to recommend a song of other religion can’t be ruled out). Please help me to add a religion column in same example.

Hi, Thanks for the code, I saw the same Raj said. I think you missed the comma when you calculate the cosine in the loop. Am I missing sth? I’m curious now because you haven’t corrected. Maybe I’m wrong. Thanks again.

Hey ….i got the solution for that ….
But My New Question is How you created Matrix ‘lastfm-matrix-germany.csv’
actually how should be the approach for data collection …for explanation take example song recommendation itself ..Can you explain the pre processing step before matrix formation..it will great help /..

your code meticulously explains all details behind filtering. i was trying to do both item based & user based filtering. i have followed your process for item based process & it ran well. but facing a problem for user based case, as in my sparse data, i do not have the user column(as you have in your data set). i have set my sparse data as,

& the sparse_data has no user specific information. i use this same sparse_data for item based filtering, faced no problem. but how can i do user based filtering from this? more specifically, to create the holder entity you have used the code

Thanks a lot for this post. Algorithm and explanation is brilliant! In your method we are making recommendations for all the users. Would there be a way to provide a threshold so that users with only top scores would get recommendations. Would you have an idea about how these thresholds could be set?

i have 30 gb data. how to process this data in this algorithm. i am using sparkR, but my server do not giving any response during read.csv(), but i am also trying fread() to read this ,
IS the any way to handle the big data problem?
sparkR have any limit of data size to handel the big data ?

I’m using a different database and after performing the similarity analysis, I have to recommend the results (I’m recommending places) to the user in an android app. Can anyone help me? How should I go about it? How can I link Rstudio results to an android app? I need help urgently.Please
Thanks in advance.

Cool tutorial, thank you! Though it may be useful to offer an alternative solution to the nested for-loop. As we all know, R is quite slow in loops and way fasted when using vectorized apply-type functions. Therefore, instead of using a second loop, one can use an “apply” function instead, which speeds up the process tremendously.

I guess you were talking about this solution. Well its a good idea but I am already using coSparse function to get cosine similarity for sparse matrix and its quite fast as well. However I am stuck in this sorting step –
topN<- (head(n=11,(data.similarity[order(data.similarity[,product],decreasing=TRUE),][product])))
Is it possible to vectorize this operation as well? Thanks a lot for your help!