Tokenizing

In lexical analysis, tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. -Wikipedia

We want to tokenize each string to get a list of words, usually by making everything lowercase and splitting along the spaces. In contrast, lemmatization
involves getting the root of each word, which can be helpful but is more computationally expensive (enough so that you would want to preprocess your text rather than do it on-the-fly).

We first create a SentenceGenerator
class which will generate our text line-by-line, tokenized. This generator is passed to the Gensim Word2Vec model, which takes care of the training in the background. We can pass parameters through the function to the model as keyword **params
.

Key Observation

The syn0
weight matrix in Gensim corresponds exactly to weights of the Embedding
layer in Keras. We want to save it so that we can use it later, so we dump it to a file. We also want to save the vocabulary so that we know which columns of the Gensim weight matrix correspond to which word; in Keras, this dictionary will tell us which index to pass to the Embedding
layer for a given word. We’ll dump this as a JSON file to make it more human-readable.

The canonical usage for word embeddings is to see that similar words are near each other. We can measure the cosine similarity between words with a simple model like this (note that we aren’t training it, just using it to get the similarity).