Criteo’s take on RecSys 2017: 10 papers you should not miss

At Criteo, we closely watch and contribute to the state-of-the-art in recommender systems for which the interest is growing rapidly. This year, the RecSys conference brought in over 600 participants from industry and academia to discuss challenges in building recommender systems.

Criteo was also a proud sponsor of Recsys 2017 (3rd year in a row). Being a booth level sponsor meant we had a dedicated Criteo stand filled with some fun SWAG, a number of our engineers and researchers and recruitment team. This allowed us to have a tête-à-tête with attendees who visited our booth and drive fruitful conversations about some of the coolest technologies and latest methods around.

This year, our events team decided to take us on an adventure on Lake Como. The Criteo Recsys cruise gathered close to one hundred known and some new faces at the conference to discover some of the opulent Villas along the lake. The cruise lasted over 2 hours amidst some good Italian food and endless conversation among peers.

In this post, we would like to highlight research work in areas that we are particularly interested in.

Session-based recommendation

Session-based recommendation is a new setup that now is gaining more and more attention from the community. Contrary to the classical setup where user-item pairs are considered to be independent observations, session-based recommendation represents a more realistic setting where current user-item interaction depends on the sequence of previous ones.

Three papers in particular caught our attention, each one using a variations of a neural network based architecture.

In Personalizing Session-based Recommendations with Hierarchical Recurrent Neural Networks by M.Quadrana et al., the authors extend their previous work on RNNs for Session-based recommendations to Hierarchical RNNs (HRNN) in order to be able to leverage long-term user behaviour for their next event prediction task. The proposed architecture is based on two levels of connected RNNs: a session-based RNN -denoted as GRU-ses- passes the hidden user vector as input to a higher-level RNN -denoted as GRU-usr- that predicts the start state of the user in the next session. The user-level representation is fixed throughout the session and it is updated only when the session ends. Because of the existence of the GRU-usr, the session-level recommendations change w.r.t. to the baseline case where the recommender system has only access to the current user session. The authors experiment with two versions of HRNN namely:

HRNN-init, where each session is initialized with the user vector predicted by GRU-usr.

HRNN-all, where they pass the user vector to each state in the session

In their experiments they find that HRNN-init outperforms HRNN-all by a large margin, leading them to conclude that session dynamics are more important that long-term user behavior for predicting the next item event. Finally, they compare their architecture against previous versions of RNN and item-based predictors, and find that their new method achieves state-of-the-art results.

Sequential User-based Recurrent Neural Network Recommendations by T. Donkers et al., proposes a modification of the original RNN architecture that compresses the user sequence only using the specific combination of historical items and their sequential ordering but it does not explicitly take into account the user-specific bias. To this end, they introduce a new type of Gated Recurrent Unit that is parametrized in terms of a user embedding vector. They experiment with 3 different types of architectures namely linear, rectified and attentional. As expected, the attentional-based model outperforms the vanilla RNN and the other two types of personalized RNNs. Though interesting through its improvements in performance, the proposed architecture has the shortcoming of needing one vector for each user, which for some large-scale use-cases such as ours – where we are faced with more than 500 Million users every day – leads to enormous memory requirements. It will be interesting to see if the authors follow-up with a more compressed solution for personalization.

Running a bit against the recent trend of RNN-based sequence models, 3D Convolutional Networks for Session-based Recommendation with Content Features by T. Xuan et al. proposes a convolution-based architecture that employs 3D convolutions that capture structure not only across local feature structure but also across time, thus they are able to capture sequence information. Apart from introducing 3D convolutions, another notable innovation of the architecture is that it treats the item IDs as words and compresses their representation using a character-level encoding, for example, an item with ID = ’23’ is approximated as a combination of input embeddings for the digits ‘2’ and ‘3’. The authors claim that their method has superior results to the previous versions of RNN-based methods on most benchmarked datasets. This work opens up the field of sequence modeling for recommendation to convolution models that have the potential of being faster to train and easier to understand.

Representation learning

Representation learning is everywhere, in topics ranging from using item content to mitigating the cold start problem. Some interesting papers were presented leveraging text reviews to make better rating predictions, such as TransNets: Learning to Transform for Recommendation by R. Catherine et al. where the main idea is to use the review as a level of “indirection”: trying to predict a representation of the review that a user will give to an item and then predict the rating. Other papers tried to approach the problem of jointly learning representations of users and items in a novel way, such as Translation-based Recommendation by R. He et al., the runner up for the best paper award, where the authors propose in the context of next item recommendation, a simple framework where items are embedded in a “transition” space and users are represented as translation vectors. The next item recommended to a user is then just a translation away from the previous item. We also found the paper Expediting Exploration by Attribute-to-Feature Mapping for Cold-Start Recommendations by D. Cohen et al. quite interesting as it tackles the cold start problem, inherent to all interaction based recommender systems. The paper builds on previous work to propose a way to map an item’s latent features – learned by matrix factorization for example – to an item’s attributes – movie genres, … -, so that for new items we have an initial latent vector that can be used afterwards for random exploration.

Folding: Why Good Models Sometimes Make Spurious Recommendations by D. Xin et al.adresses the issue of spurious recommendations in the context of representation learning based systems, by introducing a novel idea where they show that optimizing for goodness metrics can lead to more undesired recommendations. Most metrics are used to estimate how good a recommender system is. In other words, we ignore the impact of spurious recommendations while optimizing for good one. Thus, we cannot deny their big impact on the perceived quality.

The folding phenomenon is introduced in the context of Matrix Factorization, the motivation is that if we have no relation between two groups of item-users -considering that those two groups are completely independent- learning the vector representations of both using one model is almost the same as using two separate models. The representation spaces of the two models are then folded one onto the other and we end up with spurious similarities which lead to bad recommendations. The authors propose a metric to quantify the severity of this effect and correlation with goodness metrics.

Scaling recommender systems

Scalability is an essential topic at Criteo: our recommender system must be designed to handle more than 1.2 billion users, 6 billion products, and 5 billion queries per day ! With more and more attendees and speakers from industry, it has also become a core topic of RecSys, with 2 industry sessions and a 1-day workshop on large scale systems.

With the recent progress of deep learning in the recommendation domain, we were pleased to see new ideas to make DNN models scale. In that regard, we really liked the paper Getting Deep Recommenders Fit: Bloom Embeddings for Sparse Binary Input/Output Networks by J. Serrà et al.The authors noticed that in the popular gated RNN architecture, when using 330k products, the input and output layers represent 99.9% of the model’s parameters. The paper shows how to compute embeddings of the original products that are compact, fast to compute – constant time -, and can be directly plugged in input and output of DNN model, without changing its architecture. The main idea is to generalize the well-known hashing trick with k independent hash functions, projecting original products into an embedding space of lower dimension. To reverse the output, the authors propose to use a property of the Bloom filters, to compute the likelihood of each original product. They also show, on several datasets and models, that this process doesn’t decrease the accuracy of the system, as long as the embedding size is not too small – up to 5x smaller than the original dimension -.

We also liked the idea presented in Latent Representations for Efficient Ranking: Empirical Assessment by M. Kula, in which the authors propose to speed-up a recommendation system by using a binary representation of users and items instead of the classic floating-point latent vectors. Binary representations are smaller – 5% of the original size -, faster to compute and can be learned using a trick introduced in “Xnor-net: Imagenet classification using binary convolutional neural networks, M. Rastegari et al. Interestingly, the authors claim that their method doesn’t beat a simpler baseline – decreasing the dimension of the floating-point embeddings -: it’s quite uncommon, but still a useful conclusion.

Integrating psychological knowledge into recommender systems

Some research directions are focusing on explaining model behavior using a psychological approach.

This year’s best paper award was attributed to Modeling the Assimilation-Contrast Effects in Online Product Rating Systems: Debiasing and Recommendations by X. Zhang et al. This decision was motivated by the fact that this work is cross-domain and integrates psychological theories to explain user behavior. In this work, the authors study the effect of prior expectation, i.e. the rating that other users gave to the product at purchase time, on the buyer’s final rating compared to the true rating, i.e. the rating of the product after a significantly high number of ratings have been given. Indeed, it has been observed that the prior expectation has an effect on the rating given by the buyer. For example, if the prior rating is a 4 and the true rating is a 3, a buyer will tend to set a rating of 2 as his expectations were not fulfilled: This is the contrast effect. What was left unexplained is the fact that this effect tend to disappear if the rating becomes extreme. For example, a prior expectation of 5 will have no impact on the buyer’s final rating. The paper demonstrates how a mix of assimilation (tendency to agree with the prior expectation) and contrast (tendency to disagree with other people) can explain this peculiar effect. The proposed hypothesis matches the bias estimated by the model in the ratings.

User Preferences Analysis using Visual Stimuliby P. Gaspar also proposes to take into account a psychological prior under the form of color preferences. This work proposes to integrate features extracted from images into collaborative filtering to enhance the results. Low and high level features are extracted from images. Low level features are mean of image value and of the diagonals, and the standard deviation of all values. It also includes Pleasure-Dominance-Arousal features computed from colors. High level features are embeddings obtained using the Inception-v3 net. The baseline considered is a content-based recommender system on the task of recommending movies. The movies are embedded based on their synopsis and category and a knn is used to find the closest movie to the user’s history. The baseline model takes the N closest movies. In order to take the image features into account, the baseline is still used but to select candidates, and then another knn is run in image space to re-rank the movies. Results shows a small increase in metrics – precision and nDCG@10 -.

Conclusion

We really enjoyed our time at RecSys 2017, the synergy between academia and industry, the quality of the research work presented, and the attendees from different backgrounds and interests.

We could see some major research directions emerging, like using representation learning to model content based features as well as interaction based ones, or putting emphasis on the sequential nature of item consumption and thus recommendation, so it was interesting to see the links with our research efforts, the two Criteo papers of the conference were about exactly these subjects. However we still think that there is some work to be done to make the evaluation methods more uniform. It’s hard to compare methods and techniques that use different datasets even for the same task with different evaluation methodologies. But we are confident that the field is becoming more and more mature and that this area will continue to improve.