Word embeddings: the (very) basics

This post is the first of a series on word embeddings, i.e. vector representations of words in a vector space. Word embeddings have been known to linguists for quite some time. Recently, artificial neural networks have taken word embeddings to the next level. I will explain what makes artificial-intelligence-flavored word vectors so appealing in a future post. Right now, I wish to reel back a little and explain the basics.

Words of a feather flock together

Because a corpus is not a mere bag of words, The most obvious facet of a corpus is its distributional nature. Distribution takes many forms. Words in a corpus tend to concentrate in certain contexts and be more diffuse in others (dispersion). They tend to co-occur with other specific words (collocation) or in specific grammatical contexts (colligation). For the above reasons, corpus linguistics is mainly a distributional science (Gries 2014, 365).

Expectedly, one obvious source of inspiration for corpus linguistics is the Distributional Hypothesis, henceforth DH (Firth 1957, Harris 1954, Miller & Charles 1991). In its simplest form, the Distributional Hypothesis states that lexemes with a similar distribution have similar « meanings ». Whether you subscribe to the DH depends on how far you are willing to stretch the definition of meaning. Consider the three examples below:

We had a wonderful time in Paris.

We had a terrible time in Paris.

We had a wonderful dinner in Paris.

Opponents to the DH will rightly argue that the adjectives wonderful and terrible have identical distributions, but they are antonyms. The same goes for the nouns time and dinner, which have little in common semantically speaking. Supporters of the DH will counter-argue that wonderful and terrible belong to the hyper-category of appreciative adjectives, and that time and dinner belong to the hyper-category of nouns that are open to appreciative qualification. Pick a side. My position is somewhere in the middle. The definition of meaning that the DH posits is clearly limited, but we know it. Another serious issue is that the DH forces us to think in terms of similarities instead of semantic relations. Having said that, the DH can be easily operationalized in a corpus.

Word embeddings

Word embeddings are computational implementation of the DH. Suppose we have a mini corpus with seven words (bee, eagle, goose, helicopter, drone, rocket, and jet) and three contexts (wings,engine, and sky). Each word is characterized by three coordinates which correspond to the number of times the word is found in each context. For example, helicopter is not found in the wings context and it occurs twice and four times in the contexts engine and sky, respectively. Its coordinates are therefore (0,2,4). Each word occupies a specific position in the vector space, as represented in Fig. 1.1

Fig. 1. A vector space of seven words in three contexts

The word vector is the arrow from the point where all three axes intersect to the end point defined by the coordinates.

The presupposition underlying word embeddings is that semantic similarities are indexed on contextual affinities. For example, helicopter and drone are close because they occur in similar contexts, have similar vector profiles, and are therefore close in the vector space.

It is customary to collect all coordinates in a matrix such as the one below.

word

context: wings

context: engine

context: sky

bee

3

2

0

eagle

3

3

0

goose

2

4

0

helicopter

0

4

2

drone

0

3

3

rocket

0

2

4

jet

1

1

1

Each line is a vector. The vectors contained in the matrix are said to be explicit because each dimension corresponds to a well-identified context. Most of the time, matrices of explicit vectors contain many « empty » cells, i.e. cells whose value is null. This matrix is deliberately simple as each vector is three-dimensional. In the real word, the matrix can easily reach several thousand lines and columns, depending on the size of the corpus.

Cosine similarity

As said above, the DH forces us to think in terms of similarities. Although this results in a simplistic view of meaning, a nice consequence is that vector coordinates can be used to calculate the proximities between words. This is done with cosine similarity (cos~\theta), i.e. the cosine of the angle between two word vectors.

Fig. 2 below shows that the similarity between helicopter and drone (\theta_1) is the same as the similarity between drone and rocket (\theta_2). It also shows that helicopter is not as similar to rocket (\theta_1 + \theta_2) as drone is.

Fig. 2. Cosine similarities between ‘helicopter’ and ‘drone’, and between ‘drone’ and ‘rocket’

Let us see briefly how cosine similarity is measured. Let \vec{a} and \vec{b} denote two vectors. Cosine similarity is calculated as follows:

The lsa package (Wild 2007) has a dedicated function for the calculation of cosine similarity: cosine(). Install and load the package.

install.packages("lsa")
library(lsa)

As you run the cosine()function, transpose the matrix with t(). This guarantees that you will compute cosine similarity between the words, not the contexts.

cosine(t(m))

My recommendation is to round the results to a couple of decimals. This makes it easier to interpret the matrix.

round(cosine(t(m)), 2)

You obtain the resulting matrix of cosine similarities.

Theoretically, similarity scores range from -1 (complete opposition) to 1 (identity). A score of 0 indicates orthogonality (decorrelation). Values in between indicate intermediate degrees of similarity (between 0 and 1) or dissimilarity (between 0 and -1). Here, the cosine similarities range from 0 to 1, since the word frequencies are not negative. The angle between two word-vectors is not greater than 90°.

Because the matrix is symmetric, it is divided into two parts (two triangles) on either side of the diagonal of exact similarity (i.e. cos~\theta = 1) between the same words. The largest dissimilarity is observed between bee and rocket (cos~\theta = 0.25).

min(cos)

Beside the diagonal of exact similarity, the largest similarity is observed between bee and eagle (cos~\theta = 0.98). We should not take these results at face value because (a) they are based on made-up frequencies, and (b) reasoning in terms of distributional similarities is far more biased than reasoning in terms of semantic relations.

The above is but the tip of the iceberg, yet a prerequisite to understand more elaborate methods that operationalize the DH, namely Distributional Semantic Models (DSMs), which I will introduce in a future post. Their underlying logic is that the statistical distribution of lexemes in certain linguistic contexts can be used to model meaning. Nowadays, DSMs rely on matrices of dense vectors, i.e. vectors whose dimensions cannot be interpreted explicitly and whose values are real numbers, not just integers.