TECHNICAL PUBLICATIONS:

On some pitfalls in automatic evaluation and significance testing for MT

We investigate some pitfalls concerning the discriminatory power of MT evaluation measures and the accuracy of statistical significance tests. In a discriminative reranking experiment for phrase-based SMT we show that the NIST metric is more sensitive than BLEU or F-score measures despite their incorporation of aspects of fluency or meaning adequacy into MT evaluation. In an experimental comparison of two statistical significance tests we show that the bootstrap underestimates the true $p$-value, thus increasing the likelihood of type-I error, whereas the approximate randomization test is more accurate. Lastly, we point out a pitfall of randomly assessing significance in multiple pairwise comparisons. We conclude with a recommendation to combine NIST with approximate randomization, at more stringent rejection levels than is currently standard.