Distributed representation of anything

In this review, we explore various distributed representations of anything we find on the Internet – words, paragraphs, people, photographs. These representations can be used for a variety of purposes as illustrated below. We try to select subjects that seem disparate, instead of providing a comprehensive review of all applications of distributed representation.

Input

Model output

Word vectors

Sentiment analysis

Paragraphs vectors

Clustering paragraphs

People vectors (Wiki articles)

Comparisons

Photos and Words vectors

Photographs retrieval

Excited? I am! Let’s jump in.

Distributed representation of words

This is where the story begins: the idea of representing some qualitative concept (e.g. words) in a quantitative manner. If we look up a word in a dictionary, we get its definition in terms other qualitative words, helpful for humans but doesn’t really help the computer (unless we do additional post-processing, e.g. feeding the word vectors of the definition words into another neural network). In this previous post, we’ve introduced the idea of word vectors – a numeric representation of the qualitative word.

In summary, current NLP practices often substitute the word into a fixed-length numeric vector so that words of similar meanings have similar numeric vectors. It is worth re-highlighting the training concept that in training the numeric vector of a word (let’s call it the center word), the vector of the center word is optimized to predict the surrounding context words. This training concept is extended into novel applications as we will see below.

Distributed representation of paragraphs

An interesting extension of the word2vec is the distributed representation of paragraphs, just as how a fixed-length vector could represent a word, a separate fixed-length vector could represent an entire paragraph.

Simply summing word-vectors across the paragraph is a reasonable approach: “as the word vectors are trained to predict the surround words in the sentence, the vectors are representing the distribution of the context in which a word appears. These values are related logarithmically to the probabilities computed by the output layer, so the sum of two word vectors is related to the product of the two context distributions. [1]” Since the summation of word-vectors is commutative – order of summation doesn’t matter – this approach doesn’t preserve word order. Below, we review two alternative ways to train paragraph vectors.

[2] proposed two ways to train paragraph vectors, the similarity across these two methods is that both representations are learned to predict the words from the paragraph.

The first method is the PV-DM (paragraph vector: distributed memory). This samples a fixed-length (say, 3) from the context window. Each of the 3-words is represented by a 7-dimension vector. The paragraph vector is also represented by a 7-dimension vector. The 4 (3+1) vectors are concatenated (into a 28-dimensional vector) or averaged (into a 7-dimensional vector) to be used as input to predict for the next word. The concatenation of word vectors in a small context window takes into consideration the word order.

PV-DM illustration.
Source: [2]

The second method is the PV-DBOW (paragraph vector: distributed bag of words). This randomly samples 4 words from the paragraph, and only uses the paragraph vector as input.

PV-DM draw words from surround words of the target word; PV-DBOW draw words from the paragraph.

PV-DBOW stores less data – only the softmax weights are stored, as opposed to both softmax weights and word vectors in PV-DM.

PV-DBOW illustration
Source: [2]

Now that we have the paragraph vectors, we could perform clustering using these high-dimensional vectors. Paragraph embedding, whether trained using the above two methods or simple summation, enables text articles (e.g. medical notes [3]) to be clustered.

Distributed representation of people: Ayumi Hamasaki vs. Lady Gaga

[4] investigate training the paragraph vectors using Wikipedia articles: one paragraph vector represents one Wiki article. By training the word vectors jointly with the paragraph vector, the authors show that to find the Japanese equivalence of “Lady Gaga” can be achieved by vector operations:

The mixed-use of word vectors and paragraph vectors is powerful: it can explain the difference between two articles in one word, or explain the difference between two words in one article. We can, for example, find the word vector that approximates the difference between paragraph vectors of “Donald Trump” and “Barack Obama”.

Exciting, isn’t it? There’s more, Stitch Fix has shown that we can do these operations to pictures.

Distributed representations for picture retrieval

There is an existing great post from the authors, so please visit it for more details regarding this fascinating work. In summary, if someone likes the pregnant version of the apparel, we could add pregnant to the current apparel and retrieve a similar-style pregnant version.

This article reviews how the notion of the surrounding words of the subject defines the subject can be useful in representing words, paragraphs, people and even pictures. Math operations can be performed on these vectors to yield insights and/or retrieve information.

Have I missed any other interesting applications? Please let me know in the comments below!