Fast Text and Skip-Gram

Sep 28, 2016
14 minute read

In the last few years word embeddings have proved to be very effective in various natural language processing tasks like classification. Kim’s Paper. The focus of the post is to understand word embeddings through code. This leaves scope for easy experimentation by the reader for the specific problems they are dealing with.

There are various fantastic posts on word embeddings and the details behind them. Here is a short list of posts.

In this post, we will implement a very simple version of the fastText paper on word embeddings. We will build up to this paper using the concepts it uses and eventually the fast text paper. Word Embeddings are a way to represent words as dense vectors instead of just indices or as bag of words. The reasons for doing so are as follows:

When you represent words as indices, the fact that words by themselves have meanings associated with them is not adequately represented.

the:1, hello:2, cat:3, dog:4, television:5 ..

Here even though cat and dog are both animals the corresponding indices they are represented by do not have any relationships between them. What would be ideal is if there was some way each of these words had some representation, such that the corresponding vectors or indices were also related.

Bag of words also suffer from similar problems and more details about those problems can be found in the resources mentioned above.

Now that the motivation is clear, the goal of word embeddings or word vectors is to have a representation for each word that also inherently carries some meanings.

In the above diagram the words are related to each other in the vector space, thus vector addition gives some interesting properties like the following:

king - man + woman = queen

Now the details of how words embeddings are constructed is where things get really interesting. The key idea behind word2vec is the distributional hypothesis, which essentially refers to the fact the words are characterized by the words they hang out with. This essentially refers to the fact that the word “rose” is more likely to be seen around the word “red” and the word “sky” is more likely to be seen around the word “blue”. This part will become clearer through code.

Let’s start by using the Airbnb dataset. It can be found here. Also, I did some preprocessing
but should be fairly easy to just extract the text field by loading into pandas
data frame and getting the review column.

importpandasaspddata=pd.read_csv('AirbnbData/reviews.csv')

The data is quite interesting and there is a lot of scopes to use it for other
purposes but we are only interested in the text column so let’s concentrate on
that. Here is a random example of a review.

data['text'][4]

'I enjoy playing and watching sports and listening to music...all types and all sorts!'

Skip-Gram approach

The first concept we will go through is skip-gram. Here we want to learn words based on how they occur in the sentence, specifically the words they hang out with. (The distributional hypothesis part that we discussed above.)
The fifth sentence in the dataset is “I enjoy playing and watching sports and listening to music…all types and all sorts!”. In order to create a training dataset for exploiting the distributional hypothesis, we will create the training batch which will create the word and context pairs for each of the words. What we want is, for each of the word, the words adjacent to that word to have a higher probability of occurring together and the words away from it, to have a lower probability. (Not quite true, essentially, words that are likely to occur together should have a higher probability than the words that don’t.) Eg: In the sentence “Color of the rose is red”, here we want to maximize p(red|is) and minimize may be p(green|orange) which is a noisy example.

The goal is to have a dataset, where we can distinguish if a word is present in a context and then mark it positive, else mark the word as negative. Keras has some useful libraries that lets you do that very easily.

Just as a quick note, we will randomly sample words from the dataset and create our training data. There is a problem with this, though. The more common words will get sampled more frequently than the uncommon ones. For instance the word “the” will be sampled really frequently because they occur often. Since we do not want to sample them that frequently, we will use a sampling table. A sampling table essentially is the probability of sampling the word i-th most common word in a dataset (more common words should be sampled less frequently, for balance) [From, keras documentation].

To go through the details of the model, we create target and context pairs first. More details on how to create these later. Then for each of the words, we represent the word in a new vector space of dimension 100. The layers “target” and “context” represents the two words and if the target word and the context word appear together in a context then the label is 1 otherwise a zero. This is how this has been framed together as a binary classification problem. So in our example,

‘I enjoy playing and watching sports and listening to music…all types and all sorts!’

The [target, context] pairs will be for instance,

[enjoy, I], [enjoy,playing] with labels 1 (since these words occur next to each other) and some noisy examples from the vocabulary [enjoy, green] with labels 0

Just to go through the details of every step, when we did the dot product along the second axis in the Merge layer, we are essentially trying to find the similarity between the two vectors, the context word, and the target word. The reason for doing it this way is because now you can think of contexts in different ways. A context may not be just the words it occurs with, but the characters it contains. The char n-grams can be context. Yes, this is where the fasttext word embeddings come in. More on that later in this post.

But let’s dive into contexts a bit more and how specific problems can specify contexts differently. Now may be in your task you can define contexts with not just words and characters, but with the shape of the word for instance. Do similarly shaped words tend to have similar meaning? May be “dog” and “cat” both have the shape “char-char-char”. Or are they always nouns? pronouns? verbs? But you get the idea.

This is how we get the word vectors in skip-gram.

We will come back to skipgram again when we discuss the fasttext embeddings. But there is another word embedding approach and that is known as CBOW or continuous bag of words.

Now in CBOW the opposite happens, from a given word we try to predict the context words.

Subword Information

The skipgram approach is effective and useful because it emphasizes on the specific word and the words it generally occurs with. This intuitively makes sense, we expect to see words like “Octoberfest, beer, Germany” to occur together and words like “France, Germany England” to occur together.

But each word also contains information that we want to capture. Like about the relationships between characters and within characters and so on. This is where character-based n-grams come in and this is what “subword” information that the fasttext paper refers to.

So the way fasttext works is just with a new scoring function compared to the skipgram model. The new scoring function is described as follows:

For skipgram you could see, we took a dot product of the two word embedding vectors and that was the score. In this case, it takes a dot product of not just the words but all it’s corresponding character n-grams from 3 to 6. So the word vector for the word will be the collection of the n-grams along with the word.

This is the first important part. We need the corresponding character n-grams for each of the word The char_ngram_generator generates the n-grams for the word. The variables n1 and n2 refer to how many characters of n-grams should we use. The paper refers to adding a special character for the beginning and the end of the word so we have n-grams from length 4 to 7. Also for each of the words, the list also contains the actual word other than the corresponding n-grams.

#This creates the character n-grams like it is described in fasttext
defchar_ngram_generator(text,n1=4,n2=7):z=[]# There is a sentence in the paper where they mention they add a
# special character for the beginning and end of the word to
# distinguish prefixes and suffixes. This is what I understood.
# Feel free to send a pull request if this means something else
text2='*'+text+'*'forkinrange(n1,n2):z.append([text2[i:i+k]foriinrange(len(text2)-k+1)])z=[ngramforngramsinzforngraminngrams]z.append(text)returnzngrams2Idx={}ngrams_list=[]vocab_ngrams={}foriinvocab:ngrams_list.append(char_ngram_generator(i))vocab_ngrams[i]=char_ngram_generator(i)ngrams_vocab=[ngramforngramsinngrams_listforngraminngrams]ngrams2Idx=dict((c,i+6568)fori,cinenumerate(ngrams_vocab))ngrams2Idx.update(tokenizer.word_index)words_and_ngrams_vocab=len(ngrams2Idx)printwords_and_ngrams_vocab

Even though we are not using our own layer in Keras, Keras provides an extremely easy way to extend and write one’s own layers. Here is an example of how one can add all the rows of a matrix where each of the rows represents each char-ngram to get the overall vector for the entire word. This is the key idea behind the subword information of each word. Each word is essentially the sum of all it’s corresponding vectors of it’s n-grams.

Here is the network. Each word can have the only certain number of n-grams. Here we are limiting that to 10. Each of these n-grams along with the word is then trained and we get a corresponding vector for each of the word and the ngrams in the word. The n-grams are a superset of the vocabulary. Also because we first created a dictionary of words and then a dictionary of the char n-grams, the word “as” and the bigram “as” in the word “paste” are assigned to different vectors. Finally, we add the corresponding vectors of n-grams in each word to get the final representation of the word from the corresponding char n-grams and do a dot product of these two vectors to find the similarity of these two words. Notice that the difference with normal skip-gram in word2vec is just that this time each word also has the information of the corresponding character n-grams. This is the subword information it refers to.