Topics

BLEU

On the NIST OpenMT12 Arabic-English condition, the NNJ M features produce a gain of +3.0 BLEU on top of a powerful, feature-rich baseline which already includes a target-only NNLM.

Page 1, “Abstract”

The NNJ M features also produce a gain of +6.3 BLEU on top of a simpler baseline equivalent to Chiang’s (2007) original Hiero implementation.

Page 1, “Abstract”

Additionally, we present several variations of this model which provide significant additive BLEU gains.

Page 1, “Introduction”

The NNJ M features produce an improvement of +3.0 BLEU on top of a baseline that is already better than the 1st place MT12 result and includes

Page 1, “Introduction”

Additionally, on top of a simpler decoder equivalent to Chiang’s (2007) original Hiero implementation, our NNJ M features are able to produce an improvement of +6.3 BLEU —as much as all of the other features in our strong baseline system combined.

Page 2, “Introduction”

We demonstrate in Section 6.6 that using one hidden layer instead of two has minimal effect on BLEU .

Page 4, “Neural Network Joint Model (NNJ M)”

We demonstrate in Section 6.6 that using the self-normalized/pre-computed NNJ M results in only a very small BLEU degradation compared to the standard NNJ M.

Page 4, “Neural Network Joint Model (NNJ M)”

Ar-En ChEn BLEU BLEU OpenMT12 - 1st Place 49.5 32.6

Page 6, “Model Variations”

BLEU scores are mixed-case.

Page 6, “Model Variations”

On Arabic-English, the primary S2Tm2R NNJM gains +1.4 BLEU on top of our baseline, while the S2T NNLTM gains another +0.8, and the directional variations gain +0.8 BLEU more.

Page 6, “Model Variations”

This leads to a total improvement of +3.0 BLEU from the NNJM and its variations.

hidden layer

When used in conjunction with a precomputed hidden layer , these techniques speed up NNJ M computation by a factor of 10,000X, with only a small reduction on MT accuracy.

Page 1, “Introduction”

We use two 512-dimensional hidden layers with tanh activation functions.

Page 2, “Neural Network Joint Model (NNJ M)”

We chose these values for the hidden layer size, vocabulary size, and source window size because they seemed to work best on our data sets — larger sizes did not improve results, while smaller sizes degraded results.

Page 2, “Neural Network Joint Model (NNJ M)”

2.4 Pre-Computing the Hidden Layer

Page 4, “Neural Network Joint Model (NNJ M)”

Here, we present a “trick” for pre-computing the first hidden layer , which further increases the speed of NNJM lookups by a factor of 1,000X.

Page 4, “Neural Network Joint Model (NNJ M)”

Note that this technique only results in a significant speedup for self-normalized, feed-forward, NNLM-style networks with one hidden layer .

Page 4, “Neural Network Joint Model (NNJ M)”

We demonstrate in Section 6.6 that using one hidden layer instead of two has minimal effect on BLEU.

Page 4, “Neural Network Joint Model (NNJ M)”

For the neural network described in Section 2.1, computing the first hidden layer requires multiplying a 2689-dimensional input vector5 with a 2689 x 512 dimensional hidden layer matrix.

Page 4, “Neural Network Joint Model (NNJ M)”

Therefore, for every word in the vocabulary, and for each position, we can pre-compute the dot product between the word embedding and the first hidden layer .

Page 4, “Neural Network Joint Model (NNJ M)”

Computing the first hidden layer now only requires 15 scalar additions for each of the 512 hidden rows — one for each word in the input

Page 4, “Neural Network Joint Model (NNJ M)”

If our neural network has only one hidden layer and is self-normalized, the only remaining computation is 512 calls to tanho and a single 513-dimensional dot product for the final output score.6 Thus, only ~3500 arithmetic operations are required per n-gram lookup, compared to ~2.8M for self-normalized NNJ M without pre-computation, and ~35M for the standard NNJM.7

Our NIST system is fully compatible with the OpenMT12 constrained track, which consists of 10M words of high-quality parallel training for Arabic, and 25M words for Chinese.10 The Kneser-Ney LM is trained on SE words of data from English GigaWord.

NIST

On the NIST OpenMT12 Arabic-English condition, the NNJ M features produce a gain of +3.0 BLEU on top of a powerful, feature-rich baseline which already includes a target-only NNLM.

Page 1, “Abstract”

We show primary results on the NIST OpenMT12 Arabic-English condition.

Page 1, “Introduction”

We also show strong improvements on the NIST OpenMT12 Chinese-English task, as well as the DARPA BOLT (Broad Operational Language Translation) Arabic-English and Chinese-English conditions.

Page 2, “Introduction”

For Arabic word tokenization, we use the MADA-ARZ tokenizer (Habash et al., 2013) for the BOLT condition, and the Sakhr9 tokenizer for the NIST condition.

Page 6, “Model Variations”

We present MT primary results on Arabic-English and Chinese-English for the NIST OpenMT12 and DARPA BOLT conditions.

Page 6, “Model Variations”

6.1 NIST OpenMT12 Results

Page 6, “Model Variations”

Our NIST system is fully compatible with the OpenMT12 constrained track, which consists of 10M words of high-quality parallel training for Arabic, and 25M words for Chinese.10 The Kneser-Ney LM is trained on SE words of data from English GigaWord.

Specifically, we introduce a novel formulation for a neural network joint model (NNJ M), which augments an n-gram target language model with an m-word source window.

Page 1, “Introduction”

Formally, our model approximates the probability of target hypothesis T conditioned on source sentence S. We follow the standard n-gram LM decomposition of the target, where each target word ti is conditioned on the previous n — 1 target words.

Page 2, “Neural Network Joint Model (NNJ M)”

If our neural network has only one hidden layer and is self-normalized, the only remaining computation is 512 calls to tanho and a single 513-dimensional dot product for the final output score.6 Thus, only ~3500 arithmetic operations are required per n-gram lookup, compared to ~2.8M for self-normalized NNJ M without pre-computation, and ~35M for the standard NNJM.7

Page 4, “Neural Network Joint Model (NNJ M)”

“lookups/sec” is the number of unique n-gram probabilities that can be computed per second.

Page 4, “Neural Network Joint Model (NNJ M)”

Because our NNJM is fundamentally an n-gram NNLM with additional source context, it can easily be integrated into any SMT decoder.

Page 4, “Decoding with the NNJ M”

When performing hierarchical decoding with an n-gram LM, the leftmost and rightmost n — 1 words from each constituent must be stored in the state space.

Page 5, “Decoding with the NNJ M”

We also train a separate lower-order n-gram model, which is necessary to compute estimate scores during hierarchical decoding.

This does not include the cost of n-gram creation or cached lookups, which amount to ~0.03 seconds per source word in our current implementation.14 However, the n-grams created for the NNJ M can be shared with the Kneser-Ney LM, which reduces the cost of that feature.

Page 8, “Model Variations”

14In our decoder, roughly 95% of NNJM n-gram lookups within the same sentence are duplicates.

To make this a joint model , we also condition on source context vector 81-:

Page 2, “Neural Network Joint Model (NNJ M)”

Although there has been a substantial amount of past work in lexicalized joint models (Marino et al., 2006; Crego and Yvon, 2010), nearly all of these papers have used older statistical techniques such as Kneser-Ney or Maximum Entropy.

Page 8, “Model Variations”

This is consistent with our rescoring-only result, which indicates that k-best rescoring is too shallow to take advantage of the power of a joint model .

Page 9, “Model Variations”

We have described a novel formulation for a neural network-based machine translation joint model , along with several simple variations of this model.

lexicalized

Our model is purely lexicalized and can be integrated into any MT decoder.

Page 1, “Abstract”

In this paper we use a basic neural network architecture and a lexicalized probability model to create a powerful MT decoding feature.

Page 1, “Introduction”

Although there has been a substantial amount of past work in lexicalized joint models (Marino et al., 2006; Crego and Yvon, 2010), nearly all of these papers have used older statistical techniques such as Kneser-Ney or Maximum Entropy.

Page 8, “Model Variations”

Le’s model also uses minimal phrases rather than being purely lexicalized , which has two main downsides: (a) a number of complex, handcrafted heuristics are required to define phrase boundaries, which may not transfer well to new languages, (b) the effective vocabulary size is much larger, which substantially increases data sparsity issues.

Page 9, “Model Variations”

The fact that the model is purely lexicalized , which avoids both data sparsity and implementation complexity.

Page 9, “Model Variations”

For example, creating a new type of decoder centered around a purely lexicalized neural network model.