Assessing Human and Automated Quality Judgments in the French MT Evaluation Campaign CESTA

This paper analyzes the results of the French MT Evaluation Campaign, CESTA (2003-2006). The details of the campaign are first briefly described. The paper then focuses on the results of the two runs, which used human metrics, such as fluency or adequacy, as well as automated metrics, mainly based on n-gram comparison and word error rates. The results show that the quality of the systems can be reliably compared using these metrics, and that the adaptability of some systems to a given domain – which was the focus of CESTA's second run – is not strictly related to their intrinsic performance.