Word Representation Revisited

This is a rearranged note for the first two lectures of the course CS224d taught at Stanford this year.
I would cover how we come to the very place of word embedding and what we’ve ever done along this route.
Note that there’s still some points not clear and I need to refer to some paper and revisit this post in several days.

Word Representation from the Beginning

At first people did research at word representation in several methods, like taxonomy and synonym.

WordNet is an example of taxonomy. It represents the words in their hypernym relation, like a eagle “is a” bird.

Synonym is another set that put words together of similar meaning.

But these kinds of work have problems.

New word is hard to be added into the current set.

The arrangement is subjective.

It needs much human efforts to build such a set or WordNet.

It’s hard to accurately assess the similarity between words in the resource.

SVD methods

First people try to build a word-document matrix. Each position contains the number of times a word appears in a document . Then the matrix has the shape of .
This kind of matrix give rise to researches in topic models for a document, like LSA, PLSA, LDA and so on. In this post it will not be covered.

Another kind of matrix is word-word matrix. Each position contains the number of co-occurence time of word and . In this way, the matrix has the shape of .

Using SVD we could decomposite the matrix into three matrices with the shapes respectively.

Up to now, if we use the first matrix in the results of SVD, we could get some cool results. Each row of this matrix is a condensed word representation. Similar words are clustered together though we didn’t tell it any information. These are just statistical properties of the corpus.

But SVD methods do have other problems, too.

The matrix size goes with our vocabulary size.

It needs a lot of storage, and actually it’s a very sparse matrix.

The model is less robust because it may change a lot if our corpus or vocabulary varies.

The matrix is high dimension and needs quadratic time to train (SVD).

We may do some hacks for some problems, too:

For really high-frequency words like the, a, is, …

those words capture much less semantics than the long-tail words

we may ignore them or set a upper bound (say 100) for the co-occurence count

Words cooccurence may be weighed by their distance

Use Pearson correlation instead of raw count (??? but how)

there are many words that do not occur together but it’s meaningless to deal with them.

set to 0 if the correlation is negative

And the representations given by SVD are not ideal. We may want some think like to capture the syntactic or even semantic properties.

Then we move to iteration models, namely, each time only a small portion of data is processed.

Continuous Bag of Words model.

Like the n-gram models, a sequence could be assigned with a probability. Again, they may need to compute on global data to get the probabilities.

In CBOW model, we use the surrounding to predict the central word. Take the following sentence as an example.

“The cat jumps over the puddle.”

model parameters

What we know: context words(like cat, over) in one-hot representation, and central word(jumps).

What we don’t know: input matrix and output matrix , where is just a constant we choose as the size of embedding space.

loss function

The cross entropy is often used as the objective function in the setting when a probability is going to be learned from the real one. The cross entropy is defined as

Here in fact is in one-hot representation so it’s 0 everywhere except some place .