From word2vec to term2vec

In our previous article, we have seen that getting similar nouns or verbs using word2vec is all well and good, but terms extracted from customer reviews that hold meaningful insights are generally composed of multiple words. So what happens then?

Limits of word2vec

We have already talked about word vectors, and how linear functions can be easily used over them to find words which are semantically related to each other. However, as its names suggest the vectorization mechanism was designed for words. Since each word has its own vector, it’s only possible to compare single words.

Using multi-word tokens

In some cases it just seems more intuitive to build a vector for a whole phrase than for each of the single words that compose it. Because we are looking for a representation of meaning, using two vectors for “customer” and “service” instead of “customer service” seems wrong.

One trick that has been proposed (Mikolov et al, 2013) is to consider some multi-word expressions as a single token. After all, what we’re after is similarity in meaning, so there’s no reason to force the limitation of a similarity between one-word tokens. Using the distance or cosine measure, the authors showed that it is beneficial for the model that neighbour words such as “New” and “York” are better off reunited in one single token “New_York”. This proves true for a number of identifiable named entities.

The problem here is that we don’t want to work with named entities. To extract useful information from our reviews, we need to work with longer, much less idiomatic expressions.

Furthermore, the use of multi-word tokens will increase the amount of data required for the training process (which is already large for one-word tokens). Just like n-grams, for bigger values of n, even more data is required to ensure that a particular multi-word term has enough occurrences during the training process.

More and more data

Usually, training such models require more data than you can have. A direct solution would be gathering data from external sources and training the word2vec model with this increased data. Besides time (for training and crawling), we still have very a important problem: domain specificity, that is to say, words that could have different meanings in different domains.

For example, the word “power” could mean the electricity power in the household appliances domain while it could also represent the human strength in the sport equipment domain. The domain-specific meaning of words can be overwhelmed by the meaning of the same words in the external data.

Even with enough data, in a multi-word token setup, the model will likely fail to generalize many features and results will be very bad (we actually tried). Moreover, when a new multi-word expression appears, we cannot get its vector even if each composing word exists in the model.

A broader framework

While a specific new method that fit the problem of vectorization for multi-word expressions would be welcome, an ideal solution would be to generalize the problem to all terms, regardless of their size in words. Furthermore, we would like to use external data without drowning the domain specific meaning of terms.

Concatenation

The solution we adopted, here at Dictanova, is to use pre-trained word2vec-like models trained on large general corpora, e.g. the wikipedia, as external data. Then we concatenate each word vector in the domain-specific model to that in the external model.

So how can this avoid the bias from the general model? Actually, it doesn’t. It all depends on the size of vectors in the specific model. If the size of domain-specific word embeddings is larger than the external vector size, then prevailing meaning of the concatenated vector will be given by the domain-specific embeddings.

In practice, we rather use small vector sizes in the specific model (yet not too small), this way we can provide robust models based on general domain external models but take into account domain specificity by using vectors from models trained on much less data (typically yours).

Representing multi-word terms (MWT)

The definition of term in our real world scenario is broader and more flexible than its definition in the terminology domain.

The following examples show you the flavour of what MWT can be:

Wallet on chain pouch, wallet on chain and woc are all variants of each other. We call this synonymy.

Buy a watch and purchase of a watch are also considered as variants, nominal and verbal phrases should be recognized alike.

Recall that we want a length-independent representation, so that vectors at the right side of the figure are comparable even if they are quite different in length.

Since syntactic variants are already handled by word2vec, which, for example, distributes two similar vectors for the words buy and purchase, all we have to deal with is to use word level embeddings to produce multi-word expression embeddings.

Composing term embeddings

Since treating MWT as a single token does not provide a scalable convenient solution, we decided to build MWT representations from each composing element. We simply compute an average vector over normalized vectors of each MWT component. For example, the representation vector of “wallet on chain” is generated as below:

By using the mean vector, each word contributes in equal proportions to the whole meaning of the MWT while keeping its compositionality. Another neat effect of such approach is that when a new MWT is met, we can still get its embedding vector provided that at least one of its composing words exists in a model trained beforehand.

So finally, as shown above, all terms are represented by a single vector, which allows us to easily compare the similarity between them regardless of their length. This allows us to bring you a powerful tool to gather terms into topics when using our semantic API.