3
NIST Open MT Evaluations Purpose: To advance the state of the art of MT technology Method: Evaluations at regular intervals since 2002 Open to all who wish to participate Multiple language pairs, two training conditions Metrics: Automatic metrics (primary: BLEU) Human assessments May LREC 2008 Marrakech, Morocco

6
Opportunity knocks… New assessment model provided opportunity for human assessment research Application design How do we best accommodate the requirements of an MT human assessments evaluation? Assessment tasks What exactly are we to measure, and how? Documentation and assessor training procedures How do we maximize the quality of assessors’ judgments? May LREC 2008 Marrakech, Morocco

9
Assessment Tasks Adequacy Measures semantic adequacy of a system translation compared to a reference translation Preference Measures which of two system translations is preferable compared to a reference translation May LREC 2008 Marrakech, Morocco

19
Conclusions & Future Directions Continue improving human assessments as an important measure of MT quality and validation of automatic metrics What exactly are we measuring that we want automatic metrics to correlate with? What questions are the most meaningful to ask? How do we achieve better inter-rater agreement? Continue post-test analyses What are the most insightful analyses of results? Adjudicated “gold” score vs. statistics over many assessors? Incorporate user feedback into tool design and assessment tasks May LREC 2008 Marrakech, Morocco