How to use pre-trained word vectors with Keras

Ever wondered how to use pre-trained word vectors like Word2Vec or FastText to train your neural network to it’s maximum performance? Here’s where to start.

What are pre-trained word vectors?

Word vectors are a representation of actual words using vectors of numbers.

But wait, what are vectors?

In computer science, vectors consist of a row of numerical values. They are represented like this.

23

18

45

56

41

98

Usually, programming languages index positions of a vector starting from 0. In our vector, position 0 has value 23, position 1 has value 18 and so on.

How can word vectors help us?

In Natural Language Processing, word vectors are very popular because they can teach the model where a word can be found depending on the context. Adding context to a NLP model can significantly improve it’s accuracy.

The values from a word vector are the position of the specific word in a (usually) 300 dimension space.

These vectors can indicate semantically similar words. That means you can do math operations with words.

Like: husband – man + woman = wife

When do we need word vectors?

As I said above, word vectors can help the model understand the context in our text, but context is not always needed.

In a machine learning problem we use word vectors when we are sure that our model has to predict classes or continuous values that semantically correlate with the context from the dataset.

What is Word2Vec?

Word2Vec is a shallow, two-layered neural network that is trained on a large corpus of text and outputs a vector space with hundreds of dimensions.

The Word2Vec model can be trained using different architectures to produce different outputs.– CBOW (Continuous bag-of-words): The order of the context words does not influence prediction– Skip-grams: nearby context words are weighted more heavily than distant ones.

We will see in the code how exactly we can manipulate this kind of model.

Now let’s see what the Word2Vec class expects as parameters.size=n -> The dimension that the Word2Vec vectors will have (300 in our case).min_count=n -> Include the word in our vocabulary after n encounters. (1 in our case).iter=n -> In how many epochs should the Word2Vec model learn the semantic correlations (10 in our case).

After the training we can see what the Word2Vec model learned by using the ‘most_similar‘ function.

embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in word_index.items():
embedding_vector = vocab_and_vectors.get(word)
# words that cannot be found will be set to 0
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector

Good work! My only suggestion would be fit the tokenizer on the training set and texts_to_sequences on both the training and testing set. This is a form of data leakage if you fit on the entire dataset.