Word features

OpenNMT supports additional features on source and target words in the form of discrete labels.

On the source side, these features act as additional information to the encoder. An
embedding will be optimized for each label and then fed as additional source input
alongside the word it annotates.

On the target side, these features will be predicted by the network. The
decoder is then able to decode a sentence and annotate each word.

To use additional features, directly modify your data by appending labels to each word with
the special character ￨ (unicode character FFE8). There can be an arbitrary number of additional
features in the form word￨feat1￨feat2￨...￨featN but each word must have the same number of
features and in the same order. Source and target data can have a different number of additional features.

As an example, see data/src-train-case.txt which uses a separate feature
to represent the case of each word. Using case as a feature is a way to optimize the word
dictionary (no duplicated words like "the" and "The") and gives the system an additional
information that can be useful to optimize its objective function.

By default, word features on the target side are automatically shifted compared to the words so that their prediction directly depends on the word they annotate. This way, the decoder architecture is similar to a RNN-based sequence tagger with the output of a timestep being the tag of the input.

By default, features vocabulary size is unlimited. Depending on the type of features you are using, you may want to limit their vocabulary during the preprocessing with the -src_vocab_size and -tgt_vocab_size options in the format word_vocab_size[ feat1_vocab_size[ feat2_vocab_size[ ...]]]. For example:

# unlimited source features vocabulary size
-src_vocab_size 50000# first feature vocabulary is limited to 60, others are unlimited
-src_vocab_size 5000060# second feature vocabulary is limited to 100, others are unlimited
-src_vocab_size 500000100# limit vocabulary size of the first and second feature
-src_vocab_size 5000060100

You can similarly use -src_words_min_frequency and -tgt_words_min_frequency to limit vocabulary by frequency instead of absolute size.

Like words, word features vocabularies can be reused across datasets with the -features_vocabs_prefix. For example, if the processing generates theses features dictionaries:

data/demo.source_feature_1.dict

data/demo.source_feature_2.dict

data/demo.source_feature_3.dict

you have to set -features_vocabs_prefix data/demo as command line option.

The feature embedding size is automatically computed based on the number of values the feature takes. This default size reduction works well for features with few values like the case or POS.

For other features, you may want to manually choose the embedding size with the -src_word_vec_size and -tgt_word_vec_size options. They behave similarly to -src_vocab_size with a list of embedding size: word_vec_size[ feat1_vec_size[ feat2_vec_size[ ...]]].

Then, each feature embedding is concatenated to each other by default. You can instead choose to sum them by setting -feat_merge sum. Finally, the resulting merged embedding is concatenated to the word embedding.

Warning

In the sum case, each feature embedding must have the same dimension. You can set the common embedding size with -feat_vec_size.