4 Introduction -Technologies that can help us sift through all the available information to find that which is most valuable for us.Recommender Systems – Apply knowledge discovery techniques to the problem of making personalized recommendations for information, products or services, usually during a live interaction.

5 Introduction – cont.Neighbors of x = users who have historically had a similar taste to that of x.Items that the neighbors like compose the recommendation.

6 Introduction – cont.Improve scalability of collaborative filtering algorithms .Improve the quality of recommendations for the users.Bottleneck is the search for neighbors – avoiding the bottleneck by first exploring the ,relatively static, relationships between the items rather than the users.

7 Introduction – cont.The problem – trying to predict the opinion the user will have on the different items and be able to recommend the “best” items to each user.

8 The Collaborative Filtering Process -trying to predict the opinion the user will have on the different items and be able to recommend the “best” items to each user based on the user’s previous likings and the opinions of other like minded users.

9 The CF Process – cont. List of m users and a list of n Items .Each user has a list of items he/she expressed their opinion about (can be a null set).Explicit opinion - a rating score (numerical scale).Implicitly – purchase records.Active user for whom the CF task is performed.

10 The CF Process – cont.The task of a CF algorithm is to find item likeliness of two forms :Prediction – a numerical value, expressing the predicted likeliness of an item the user hasn’t expressed his/her opinion about.Recommendation – a list of N items the active user will like the most (Top-N recommendations).

14 Challenges Of User-based CF Algorithms -Sparsity – evaluation of large item sets, users purchases are under 1%.difficult to make predictions based on nearest neighbor algorithms.=> Accuracy of recommendation may be poor.

15 Challenges Of User-based CF Algorithms –cont.Scalability - Nearest neighbor require computation that grows with both the number of users and the number of items.Semi-intelligent filtering agents using syntactic features -> poor relationship among like minded but sparse-rating users.Solution : usage of LSI to capture similarity between users & items in a reduced dimensional space. Analyze user-item matrix: user will be interested in items that are similar to the items he liked earlier -> doesn’t require identifying the neighborhood of similar users.

16 Item Based CF Algorithm -Looks into the set of items the target user has rated & computes how similar they are to the target item and then selects k most similar items.Prediction is computed by taking a weighted average on the target user’s ratings on the most similar items.

17 Item Similarity Computation -Similarity between items i & j is computed by isolating the users who have rated them and then applying a similarity computation technique.Cosine-based Similarity – items are vectors in the m dimensional user space (difference in rating scale between users is not taken into account).

18 Item Similarity Computation – cont.Correlation-based Similarity - using the Pearson-r correlation (used only in cases where the uses rated both item I & item j).R(u,i) = rating of user u on item i.R(i) = average rating of the i-th item.

19 Item Similarity Computation – cont.Adjusted Cosine Similarity – each pair in the co-rated set corresponds to a different user. (takes care of difference in rating scale).R(u,i) = rating of user u on item i.R(u) = average of the u-th user.

21 Prediction Computation -Generating the prediction – look into the target users ratings and use techniques to obtain predictions.Weighted Sum – how the active user rates the similar items.

22 Prediction Computation –cont.Regression – an approximation of the ratings based on a regression model instead of using directly the ratings of similar items. (Euclidean distance between rating vectors).- R’(N) = ratings based on regression.Error.- Regression model parameters.

26 Experiments : The Data Set -MovieLens – a web-based movies recommender system with 43,000 users & over 3500 movies.Used 100,000 ratings from the DB (only users who rated 20 or more movies).80% of the data - training set.20% 0f the data - test set.Data is in the form of user-item matrix rows (users), 1682 columns (items/movies – rated by at least one of the users).

27 Experiments : The Data Set – cont.Sparsity level of the data set –1- (nonzero entries/total entries) => for the movie data set.

28 Evaluation Metrics -Measures for evaluating the quality of a recommender system :Statistical accuracy metrics – comparing numerical recommendation scores against the actual user ratings for the user-item pairs in the test data set.Decision support accuracy metrics – how effective a prediction engine is at helping a user select high-quality items from the set of all items.

29 Evaluation Metrics – cont.MAE – Mean Absolute Error : deviation of recommendations from their true user-specified values.The lower the MAE, the more accurately the recommendation engine predicts user ratings.MAE is the most commonly used and is the easiest to interpret.

30 Experimental Procedure -Experimental steps – division into train and test portion.Assessment of quality of recommendations - determining the sensitivity of the neighborhood size, train/test ratio & the effect of different similarity measures.Using only the training data & further subdivision of it into a train and test portion.10-fold cross validation - randomly choosing different train & test sets, taking the average of the MAE values.

32 Experimental Results -Effect of similarity Algorithms -For each similarity algorithm the neighborhood was computed and the weighted sum algorithm was used to generate the prediction.Experiments were conducted on the train data.Test set was used to compute MAE.

42 Scalability Challenges -Sensitivity of the model size – impact of number of items on the quality of the prediction.Model size of l = we consider only the l best similarity values for the model building and later on we usek<l of the values to generate the prediction.Varied the number of items to be used for similarity computation from 25 to 200.

43 Scalability Challenges – cont.Precompute items similarities on different model sizes using the weighted sum prediction.MAE is computed from the test data.Process repeated for 3 different train/test ratios.

45 Scalability Challenges – cont.MAE values get better as we increase the model size but gradually slows down.for train/test ratio of 0.8 we are within 96% % item-item scheme’s accuracy using only 1.9% - 3% of items.High accuracy can be achieved by using a fraction of items – precomputing the item similarity is useful.