FastText (summation composition function)

FastText is probably the most influential and effective recent model. It represents each word as a
bag-of-character n-grams. Representations for character n-grams, once they are learned, can be combined (via
simple summation) to represent out-of-vocabulary (OOV) words.

CNN and RNN subword-level composition functions

We contribute to the discussion of composition functions for constructing subword-level embeddings.
We propose CNN- and RNN-based subword-level word embedding models, which can embed
arbitrary character sequences into vectors.

We also propose a hybrid training scheme, which makes these neural networks directly integrated into Skip-Gram
model.
We train two sets of word embeddings simultaneously:
one is from a lookup table as in traditional Skip-Gram,
and another is from convolutional or recurrent neural network.
The former is better at capturing semantic similarity.
The latter is more focused on morphology and can learn embeddings for OOV words.

The overall achitecture of the original Skip-Gram, FastText, and our subword-level models are shown in the above
figure.

Japanese subword-level embeddings

We also implement the subword-level composition functions to Japanese.

The overall achitecture of Japanese embedding models are shown in the above figure.
For Japanese language, We investigate the effect of explicit inclusion of kanjis and kanji components (bushu).

Usage

We implement all the subword-level models (including FastText) using Chainer deep learning framework.

Experiments

Results on word similarity and word analogy datasets.
For hybrid training scheme, we denote the embeddings that come from word vector lookup table as "Model_word",
and the embeddings which come from the composition function as "Model_subword".
We denote the vanilla (non-hybrid) models as "Model_vanilla".
The "FastText_external" is the public available FastText embeddings,
which are trained on the full Wikipedia corpus. We also test the version where OOV words are expanded,
and denote as "Model+OOV". Model combinations are denoted as gray rows,
and best results among them are marked Bold. Rare words dataset in blue column have 43.3% OOV rate,
while other word similarity datasets have maximum 4.6% OOV rate. Morphology related categories are denoted as almond columns.

Visualization of learned word embeddings, each dot represents a word,
different colors represent different affixes.
We use t-SNE to project the word vectors from 300 dimension to 2 dimension.