Word embedding is a technique that treats words as vectors whose relative similarities correlate with semantic similarity. This technique is one of the most successful applications of unsupervised learning. Natural language processing (NLP) systems traditionally encode words as strings, which are arbitrary and provide no useful information to the system regarding the relationships that may exist between different words. Word embedding is an alternative technique in NLP, whereby words or phrases from the vocabulary are mapped to vectors of real numbers in a low-dimensional space relative to the vocabulary size, and the similarities between the vectors correlate with the words’ semantic similarity.

For example, let’s take the words woman, man, queen, and king. We can get their vector representations and use basic algebraic operations to find semantic similarities. Measuring similarity between vectors is possible using measures such as cosine similarity. So, when we subtract the vector of the word man from the vector of the word woman, then its cosine distance would be close to the distance between the word queen minus the word king (see Figure 1).

W("woman")−W("man") ≃ W("queen")−W("king")

Figure 1. Gender vectors. Source: Lior Shkiller.

Many different types of models were proposed for representing words as continuous vectors, including latent semantic analysis (LSA) and latent Dirichlet allocation (LDA). The idea behind those methods is that words that are related will often appear in the same documents. For instance, backpack,school,notebook, and teacher are probably likely to appear together. But school, tiger, apple, and basketball would probably not appear together consistently. To represent words as vectors -- using the assumption that similar words will occur in similar documents -- LSA creates a matrix, whereby the rows represent unique words and the columns represent each paragraph. Then, LSA applies singular value decomposition (SVD), which is used to reduce the number of rows while preserving the similarity structure among columns.The problem is that those models become computationally very expensive on large data.

Instead of computing and storing large amounts of data, we can try to create a neural network model that will be able to learn the relationship between the words and do it efficiently.

Word2Vec

The most popular word embedding model is word2vec, created by Mikolov, et al., in 2013. The model showed great results and improvements in efficiency. Mikolov, et al., presented the negative-sampling approach as a more efficient way of deriving word embeddings. You can read more about it here.

The CBOW model

In the CBOW architecture, the model predicts the current word from a window of surrounding context words. Mikolov, et al., thus use both the n words before and after the target word w to predict it.

A sequence of words is equivalent to a set of items. Therefore, it is also possible to replace the terms word and item, which allows for applying the same method for collaborative filtering and recommender systems. CBOW is several times faster to train than the skip-gram model and has slightly better accuracy for the words that appear frequently (see Figure 2).

The continuous skip-gram model

In the skip-gram model, instead of using the surrounding words to predict the center word, it uses the center word to predict the surrounding words (see Figure 3). According to Mikolov, et al., skip-gram works well with a small amount of the training data and does a good job of representing even rare words and phrases.

Next, to make things easy, we will install gensim, a Python package that implements word2vec.

pip install --upgrade gensim

We need to create the corpus from Wikipedia, which we will use to train the word2vec model. The output of the following code is "wiki..text"—which contains all the words of all the articles in Wikipedia, segregated by language.

Lior Shkiller is the co-founder of Deep Solutions. He is a machine learning practitioner, and is passionate about AI and cognitive science. He has a degree in Computer Science and Psychology from Tel Aviv University and has more than 10 years of experience in software development.
Deep Solutions delivers end-to-end software solutions based on deep learning innovative algorithms for computer vision, natural language processing, anomaly detection, recommendation systems, and more.

In this O'Reilly training video, the "Hadoop Application Architectures" authors present an end-to-end case study of a clickstream analytics engine to provide a concrete example of how to architect and implement a complete solution with Hadoop. In this segment, they provide an overview of the complete architecture. Presenters: Mark Grover, Gwen Shapira, Jonathan Seidman, Ted Malaska

Lisa Qian lays out the process for a successful A/B test, from defining a goal and hypothesis, to knowing when to end the test. The most rigorous form of data-gathering when done right, A/B tests can't be run by guesswork or gut instinct.