domingo, 3 de setembro de 2017

At Globo.com we try to optimize our personalized recommendation providing better content in a ranked list. We have a hybrid system that can automatically essemble collaborative filtering, content based and non personalized algorithms.

But none of our implemented algorithms are intented to rank. Our best collaborative filtering algorithm, based on the paper Collaborative Filtering for Implicit Feedback Datasets, doesn't optimizes a ranking metric like MRR or NDCG.

Even so we have good results of those algorithms choosing the best hyper parameters focusing in MRR, we thought we could get better results using algorithms focusing in optimizing a ranking metric: Our first hypothesis is xCLiMF.

xCLiMF is a latent factor collaborative filtering variant wich optimizes a lower bound of the smoothed reciprocal rank of "relevant" items in ranked recommendation list. It is a evolution o CLiMF algorithm that can handle using multiple levels of relevance data instead of using only unary user preferences.

My implementation

So my implementation also have some differences from original papers:

In the paper they don't mention any kind of previously normalization, but it's always a good practice to do it for machine learning algorithms input data. So I normalized dividing all rating by the maximum rating in the dataset. Using original ratings lead to some math errors:

Exponential function of large values causes math range error;

Exponential function of very large negative numbers get 0, and lead to some division by zero error;

The paper experimentation setup uses N random ratings by user in training and in the cross validation data. Even so I have implemented this way, I have put an option to select 2N top ratings by user, randomly distributed in training and cross validation. For me, this last protocol reflected a more realistic kind of problem that xCLiMF is going to solve.

In the original paper they described the algorithm as two chained loops. Firstly I implemented as described in the paper. But them I updated the code to use a more vectorized solution, having a 54% improvement in the performance. I maintained the chained loop solution for comparison purpose ( https://github.com/timotta/xclimf/blob/master/tests/looping_update.py )

For the objective function, the paper describes that it's not necessary to divide by the number of users, because it is only used for validating if the gradient ascending is still increasing after many interactions. But I made this division for interpretability purposes.

Results

Using the experimentation protocol of N random ratings by user in training and in the cross validation, applied to the MovieLens 1M dataset, I got MRR variying from 0.08 to 0.09 at the 50 iteration using the hyper-parameters described in the paper.

Using the top K experimentation protocol in the same dataset I got MRR varying from 0.2 to 0.26. I tried those two experimentation protocol with the Alternative Least Squares implemented at Spark and the maximum MRR I got was 0.01.

Conclusions and future

Now we have a working xCLiMF implementation in python, but it is not feasible to train it with Globo.com dataset. Running it using MovieLens 1M dataset that have 6k users took 4 minutes. With MovieLens 20M dataset that have 130k users took more than one hour. Globo.com datasets are much bigger than that, if we count only the logged users, our dataset of users preferences on shows has more than 4 millions users, it would take more than 40 hours to run.

That's why my next step is implementing this same algorithm on Spark. Since the updating user by user step is independent, it is possible to scale it running in parallel on our hadoop cluster.