Introduction to Word Embeddings

Word embeddings are commonly used in many Natural Language Processing (NLP) tasks because they are found to be useful representations of words and often lead to better performance in the various tasks performed. Given its widespread use, this post seeks to introduce the concept of word embeddings to the prospective NLP practitioner.

Word embeddings allow words to be represented by a series of numbers – which we would refer to as real-valued vectors from now on. For example, the following phrase can be represented by a series of vectors, each vector having a dimension of 2.

I found it intriguing that a qualitative and abstract idea such as a word could be represented by a numeric vector of a fixed dimension. A common question is what do these numbers represent and how do they get decided? Wouldn’t it be nice if one dimension corresponds to a degree of happiness, another corresponds to a degree of formality and the algorithm places the words in the hyperspace according to these interpretable dimensions? Unfortunately, the dimensions are not optimized to represent concepts. There have been, however, studies that use dimension reduction techniques to reveal clusters of words with similar meanings.

Word vector values

Broadly speaking, the vector of each word is optimized to be able to predict the surrounding words.

“You get to know a word by the company it keeps.” – Firth, J. R. (1957)

In computational linguistics, this is known as distributional similarity. This idea is used to train the two popular word embeddings that the community is using – word2vec [1] and glove [2]. In the training process, the algorithm repeatedly adjusts the values of each word vector such that it is best predicting its surround context words. If you would like to learn more details, I highly recommend this lecture from Stanford University and in fact, the entire series.

For these two sets of word embeddings, a large corpus – with billions of tokens – is used to train to convergence. Once trained, these word vectors are now dense distributed representations of the words. Dense in the sense that it combats the problem of sparsity, to be discussed in the next paragraph. Distributed because the meaning – formally semantic content – of the word is spread across the number vector.

Motivation

Before neural network architecture became popular in NLP, a common approach is to use Support Vector Machines (SVM) to tackle text classification problems. A typical approach to handle the input variable is to use an n-gram approach and count the number of occurrence of each uni-, bi- and tri-gram. A unigram is the occurrence of a word and a bigram is the occurrence of a two-word phrase. The major drawback of such an approach is the curse of dimensionality, also known as the problem of sparsity. The English vocabulary is very large, let’s take 20k as an example. Although ‘frog’ and ‘toad’ have similar meaning, because of the unigram representation, this pair is as different as ‘frog’ and ‘hotel’. This is obviously undesirable and word vectors tackle this problem because similar words have similar vectors.

Being memory intensive is another problem of the n-gram approach. If the unigram representation requires 20k columns to count each occurrence, a bi-gram and trigram counters could take exponentially more columns to count the occurrences. Researchers investigate ways to represent multiple words, [3] investigates the extension of word representations to phrase representations; [4] further extends the idea to represent paragraphs. Because of its success, dense distributed embedding has been an active area of research, extending itself into other inter-disciplinary domains. For example, in the medical domain to represent medical concepts [5], notes, visits [6] and patients [7].

Desirable properties

Through optimizing the word vectors to best predict its context words, clusters of similar words and relationships between words are formed.

As previously mentioned, dimension reduction techniques can be applied to project the multi-dimensional vectors into 2D space to visualize the clusters and relationship between words.

Words that are close in meaning are clustered near to one another, as illustrated in the glove’s web site, the nearest neighbor of frog is frogs, toads, litoria. This implies that it is alright for a classifier to not see the word litoria and only frog during training, and the classifier would not be thrown off when it sees litoria during testing because the two word vectors are similar.

As illustrated on the glove’s website, word embeddings learn relationships – formally linear substructures – between words. Vector differences between a pair of words can be added to another word vector to find the analogous word. For example, “man” – “woman” + “queen” ≈ “king”.

Conclusion

This post introduces word embeddings in natural language processing. It briefly discusses the process through which the values of word vectors get adjusted, the motivations of having word vectors and the desirable properties associated with word vectors.