Collaborative filtering with embeddings

Embeddings are not just for use in natural language processing. Here we apply embeddings to a common task in collaborative filtering - predicting user ratings - and on our way, strive for a better understanding of what an embedding layer really does.

What is so attractive about this concept? Embeddings incorporate the concept of distributed representations, an encoding of information not at specialized locations (dedicated neurons, say), but as a pattern of activations spread out over a network. No better source to cite than Geoffrey Hinton, who played an important role in the development of the concept(Rumelhart, McClelland, and PDP Research Group 1986):

Distributed representation means a many to many relationship between two types of representation (such as concepts and neurons). Each concept is represented by many neurons. Each neuron participates in the representation of many concepts.1

The advantages are manifold. Perhaps the most famous effect of using embeddings is that we can learn and make use of semantic similarity.

Let’s take a task like sentiment analysis. Initially, what we feed the network are sequences of words, essentially encoded as factors. In this setup, all words are equidistant: Orange is as different from kiwi as it is from thunderstorm. An ensuing embedding layer then maps these representations to dense vectors of floating point numbers, which can be checked for mutual similarity via various similarity measures such as cosine distance.

We hope that when we feed these “meaningful” vectors to the next layer(s), better classification will result. In addition, we may be interested in exploring that semantic space for its own sake, or use it in multi-modal transfer learning (Frome et al. 2013).

In this post, we’d like to do two things: First, we want to show an interesting application of embeddings beyond natural language processing, namely, their use in collaborative filtering. In this, we follow ideas developed in lesson5-movielens.ipynb which is part of fast.ai’s Deep Learning for Coders class. Second, to gather more intuition, we’d like to take a look “under the hood” at how a simple embedding layer can be implemented.

So first, let’s jump into collaborative filtering. Just like the notebook that inspired us, we’ll predict movie ratings. We will use the 2016 ml-latest-small dataset from MovieLens that contains ~100000 ratings of ~9900 movies, rated by ~700 users.

Embeddings for collaborative filtering

In collaborative filtering, we try to generate recommendations based not on elaborate knowledge about our users and not on detailed profiles of our products, but on how users and products go together. Is product \(\mathbf{p}\) a fit for user \(\mathbf{u}\)? If so, we’ll recommend it.

Often, this is done via matrix factorization. See, for example, this nice article by the winners of the 2009 Netflix prize, introducing the why and how of matrix factorization techniques as used in collaborative filtering.

The diagram takes its example from the context of text analysis, assuming a co-occurrence matrix of hashtags and users (\(\mathbf{A}\)). As stated above, we’ll instead work with a dataset of movie ratings.

Were we doing matrix factorization, we would need to somehow address the fact that not every user has rated every movie. As we’ll be using embeddings instead, we won’t have that problem. For the sake of argumentation, though, let’s assume for a moment the ratings were a matrix, not a dataframe in tidy format.

In that case, \(\mathbf{A}\) would store the ratings, with each row containing the ratings one user gave to all movies.

This matrix then gets decomposed into three matrices:

\(\mathbf{\Sigma}\) stores the importance of the latent factors governing the relationship between users and movies.

\(\mathbf{U}\) contains information on how users score on these latent factors. It’s a representation (embedding) of users by the ratings they gave to the movies.

\(\mathbf{V}\) stores how movies score on these same latent factors. It’s a representation (embedding) of movies by how they got rated by said users.

As soon as we have a representation of movies as well as users in the same latent space, we can determine their mutual fit by a simple dot product \(\mathbf{m^ t}\mathbf{u}\). Assuming the user and movie vectors have been normalized to length 1, this is equivalent to calculating the cosine similarity

What does all this have to do with embeddings?

Well, the same overall principles apply when we work with user resp. movie embeddings, instead of vectors obtained from matrix factorization. We’ll have one layer_embedding for users, one layer_embedding for movies, and a layer_lambda that calculates the dot product.

How well does this work? Final RMSE (the square root of the MSE loss we were using) on the validation set is around 1.08 , while popular benchmarks (e.g., of the LibRec recommender system) lie around 0.91. Also, we’re overfitting early. It looks like we need a slightly more sophisticated system.

Training curve for simple dot product model

Accounting for user and movie biases

A problem with our method is that we attribute the rating as a whole to user-movie interaction. However, some users are intrinsically more critical, while others tend to be more lenient. Analogously, films differ by average rating. We hope to get better predictions when factoring in these biases.

Conceptually, we then calculate a prediction like this:

\[pred = avg + bias_m + bias_u + \mathbf{m^ t}\mathbf{u}\]

The corresponding Keras model gets just slightly more complex. In addition to the user and movie embeddings we’ve already been working with, the below model embeds the average user and the average movie in 1-d space. We then add both biases to the dot product encoding user-movie interaction. A sigmoid activation normalizes to a value between 0 and 1, which then gets mapped back to the original space.

Note how in this model, we also use dropout on the user and movie embeddings (again, the best dropout rate is open to experimentation).

Not only does it overfit later, it actually reaches a way better RMSE of 0.88 on the validation set!

Training curve for dot product model with biases

Spending some time on hyperparameter optimization could very well lead to even better results. As this post focuses on the conceptual side though, we want to see what else we can do with those embeddings.

Embeddings: a closer look

We can easily extract the embedding matrices from the respective layers. Let’s do this for movies now.

How are they distributed? Here’s a heatmap of the first 20 movies. (Note how we increment the row indices by 1, because the very first row in the embedding matrix belongs to a movie id 0 which does not exist in our dataset.) We see that the embeddings look rather uniformly distributed between -0.5 and 0.5.

We’ll leave it to the knowledgeable reader to name these factors, and proceed to our second topic: How does an embedding layer do what it does?

Do-it-yourself embeddings

You may have heard people say all an embedding layer did was just a lookup. Imagine you had a dataset that, in addition to continuous variables like temperature or barometric pressure, contained a categorical column characterization consisting of tags like “foggy” or “cloudy”. Say characterization had 7 possible values, encoded as a factor with levels 1-7.

Were we going to feed this variable to a non-embedding layer, layer_dense say, we’d have to take care that those numbers do not get taken for integers, thus falsely implying an interval (or at least ordered) scale. But when we use an embedding as the first layer in a Keras model, we feed in integers all the time! For example, in text classification, a sentence might get encoded as a vector padded with zeroes, like this:

2 77 4 5 122 55 1 3 0 0

The thing that makes this work is that the embedding layer actually does perform a lookup. Below, you’ll find a very simple3custom layer that does essentially the same thing as Keras’ layer_embedding:

It has a weight matrix self$embeddings that maps from an input space (movies, say) to the output space of latent factors (embeddings).

When we call the layer, as in

x <- k_gather(self$embeddings, x)

it looks up the passed-in row number in the weight matrix, thus retrieving an item’s distributed representation from the matrix.

We end up with a RMSE of 1.13 on the validation set, which is not far from the 1.08 we obtained when using layer_embedding. At least, this should tell us that we successfully reproduced the approach.

Conclusion

Our goals in this post were twofold: Shed some light on how an embedding layer can be implemented, and show how embeddings calculated by a neural network can be used as a substitute for component matrices obtained from matrix decomposition. Of course, this is not the only thing that’s fascinating about embeddings!

For example, a very practical question is how much actual predictions can be improved by using embeddings instead of one-hot vectors; another is how learned embeddings might differ depending on what task they were trained on. Last not least - how do latent factors learned via embeddings differ from those learned by an autoencoder?

In that spirit, there is no lack of topics for exploration and poking around …

Custom models are a recent Keras feature that allow for a flexible definition of the forward pass. While the current use case does not require using a custom model, it nicely illustrates how the network’s logic can quickly be grasped by looking at the call method.↩

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".