Using pre-trained Glove embeddings in TensorFlow

Embeddings can be used in machine learning to represent data and take advantage of reducing the dimensionality of the dataset and learning some latent factors between data points. Commonly this is used with words to say, reduce a 400,000 word vector to a 50 dimensional vector, but could equally be used to map post codes or other token encoded data. Another use case might be in recommender systems GloVe (Global Vectors for Word Representation) was developed at Stanford and more information can be found here. There are a few learnt datasets including Wikipedia, web crawl and a Twitter set, each increasing the number of words in its vocabulary with varying embedding dimensions. We will be using the smallest Wikipedia dataset and for this sample will pick the 50 dimensional embedding.

Obtaining the embeddings

And define a few paths to make things easier and ensure our python script can obtain and extract the data whether we have it locally or retrieving it from the web. Here we also define EMBEDDING_DIMENSION as the dimension of the vector for word representation. It will be the length of the vector representing the words. After parsing the weight file, we will later define VOCAB_LENGTH which will be the total number of word tokens we will use. Later we will also define UNKOWN_WORD to represent a token used for any words we encounter that aren’t in the dataset.

Loading embeddings

The downloaded file has a word per line in descending frequency of usage in the dataset. A line is space separated with the word first and the decimal numbers as the vector representation of that word. Here we will keep three data structures for various uses:

word2idx: a dictionary for mapping words to their index token - used for converting a sequence of words to sequence of integers for embedding lookup

idx2word: a list of words in order - used for decoding an integer sequence to words

weights: a matrice of size VOCAB_LENGTH x EMBEDDING_DIMESNION containing the vectors for each word

with (glove_data_directory / ‘glove.6B.50d.txt’).open(‘r’) as file:
for index, line in enumerate(file):
values = line.split() # Word and weights separated by space
word = values[0] # Word is first symbol on each line
word_weights = np.asarray(values[1:], dtype=np.float32) # Remainder of line is weights for word
word2idx[word] = index + 1 # PAD is our zeroth index so shift by one
weights.append(word_weights)

Finally from with TensorFlow we define a variable to hold our embedding weights using tf.get_variable. This will either create or load the variable into the graph. The most common initializer for variables would be to create random weights, but we wish to load our Glove weights and so will use a tf.constant_initializer to initialize it with the weights we loaded previously and indicate that these shouldn’t be updated by setting trainable=False. The actual embedding of our sequence of word indices to embedded vectors is then done by tf.nn.embedding_lookup. This is basically a vector retrieval using the word indices for indixing the zeroth axis and returning the vector on the embedded vector.