TER

In Are Two Heads Better than One? Crowdsourced Translation via a Two-Step Collaboration of Non-Professional Translators and Editors

"8.0 0.5 1.0 1.5 2.0 TER between pre- and post-edit translation

Page 3, “Crowdsourcing Translation”

Aggressiveness (x-axis) is measured as the TER between the pre-edit and post-edit version of the translation, and effectiveness (y-axis) is measured as the average amount by which the editing reduces the translation’s TERgold.

Page 3, “Crowdsourcing Translation”

We use translation edit rate ( TER ) as a measure of translation similarity.

Page 3, “Crowdsourcing Translation”

TER represents the amount of change necessary to transform one sentence into another, so a low TER means the two

Page 3, “Crowdsourcing Translation”

Editor A TER d 0.01 — 0.03

Page 4, “Crowdsourcing Translation”

To capture the quality (“professionalness”) of a translation, we take the average TER of the translation against each of our gold translations.

Page 4, “Crowdsourcing Translation”

We measure aggressiveness by looking at the TER between

Page 4, “Crowdsourcing Translation”

the pre- and post-edited versions of each editor’s translations; higher TER implies more aggressive editing.

Page 4, “Crowdsourcing Translation”

Lowest TER 35.78

Page 8, “Evaluation”

The first method selects the translation with the minimum average TER (Snover et al., 2006) against the other translations; intuitively, this would represent the “consensus” translation.

Page 8, “Evaluation”

The second method selects the translation generated by the Turker who, on average, provides translations with the minimum average TER .

BLEU

Appears in 11 sentences as: BLEU (12)

In Are Two Heads Better than One? Crowdsourced Translation via a Two-Step Collaboration of Non-Professional Translators and Editors

Metric Since we have four professional translation sets, we can calculate the Bilingual Evaluation Understudy ( BLEU ) score (Papineni et al., 2002) for one professional translator (Pl) using the other three (P2,3,4) as a reference set.

Page 7, “Evaluation”

In the following sections, we evaluate each of our methods by calculating BLEU scores against the same four sets of three reference translations.

Page 7, “Evaluation”

This allows us to compare the BLEU score achieved by our methods against the BLEU scores achievable by professional translators.

Page 7, “Evaluation”

Table 2: Overall BLEU performance for all methods (with and without post-editing).

Page 8, “Evaluation”

The first oracle operates at the segment level on the sentences produced by translators only: for each source segment, we choose from the translations the one that scores highest (in terms of BLEU ) against the reference sentences.

Page 8, “Evaluation”

As expected, random selection yields bad performance, with a BLEU score of 30.52.

Page 8, “Evaluation”

Figure 5: Effect of candidate-Turker coupling (A) on BLEU score.

Page 8, “Evaluation”

The approach which selects the translations with the minimum average TER (Snover et al., 2006) against the other three translations (the “consensus” translation) achieves BLEU scores of 35.78.

Page 8, “Evaluation”

Using the raw translations without post-editing, our graph-based ranking method achieves a BLEU score of 38.89, compared to Zaidan and Callison-Burch (2011)’ s reported score of 28.13, which they achieved using a linear feature-based classification.

Page 8, “Evaluation”

This boost in BLEU score confirms our intuition that the hidden collaboration networks between candidate translations and transltor/editor pairs are indeed useful.

Page 8, “Evaluation”

In order to determine a value for A, we used the average BLEU , computed against the professional refer-

These have focused on an iterative collaboration between monolingual speakers of the two languages, facilitated with a machine translation system.

Page 2, “Related work”

In our setup the poor translations are produced by bilingual individuals who are weak in the target language, and in their experiments the translations are the output of a machine translation system.1 Another significant difference is that the HCI studies assume cooperative participants.

Although hiring professional translators to create bilingual training data for machine translation systems has been deemed infeasible, Mechanical Turk has provided a low cost way of creating large volumes of translations (Callison-Burch, 2009; Ambati and Vogel, 2010).

Page 2, “Related work”

art machine translation system (the syntax-based variant of Joshua) achieves a score of 26.91, which is reported in (Zaidan and Callison-Burch, 2011).

Page 8, “Evaluation”

In addition to its benefits of cost and scalability, crowdsourcing provides access to languages that currently fall outside the scope of statistical machine translation research.

BLEU score

In Are Two Heads Better than One? Crowdsourced Translation via a Two-Step Collaboration of Non-Professional Translators and Editors

In the following sections, we evaluate each of our methods by calculating BLEU scores against the same four sets of three reference translations.

Page 7, “Evaluation”

This allows us to compare the BLEU score achieved by our methods against the BLEU scores achievable by professional translators.

Page 7, “Evaluation”

As expected, random selection yields bad performance, with a BLEU score of 30.52.

Page 8, “Evaluation”

Figure 5: Effect of candidate-Turker coupling (A) on BLEU score .

Page 8, “Evaluation”

The approach which selects the translations with the minimum average TER (Snover et al., 2006) against the other three translations (the “consensus” translation) achieves BLEU scores of 35.78.

Page 8, “Evaluation”

Using the raw translations without post-editing, our graph-based ranking method achieves a BLEU score of 38.89, compared to Zaidan and Callison-Burch (2011)’ s reported score of 28.13, which they achieved using a linear feature-based classification.

Page 8, “Evaluation”

This boost in BLEU score confirms our intuition that the hidden collaboration networks between candidate translations and transltor/editor pairs are indeed useful.

Most NLP research into crowdsourcing has focused on Mechanical Turk , following pioneering work by Snow et al.

Page 2, “Related work”

Although hiring professional translators to create bilingual training data for machine translation systems has been deemed infeasible, Mechanical Turk has provided a low cost way of creating large volumes of translations (Callison-Burch, 2009; Ambati and Vogel, 2010).

Page 2, “Related work”

This data set consists 1,792 Urdu sentences from a variety of news and online sources, each paired with English translations provided by nonprofessional translators on Mechanical Turk .

graph-based

In Are Two Heads Better than One? Crowdsourced Translation via a Two-Step Collaboration of Non-Professional Translators and Editors

We develop graph-based ranking models that automatically select the best output from multiple redundant versions of translations and edits, and improves translation quality closer to professionals.

Page 1, “Abstract”

0 A new graph-based algorithm for selecting the best translation among multiple translations of the same input.

Page 2, “Introduction”

Using the raw translations without post-editing, our graph-based ranking method achieves a BLEU score of 38.89, compared to Zaidan and Callison-Burch (2011)’ s reported score of 28.13, which they achieved using a linear feature-based classification.

Page 8, “Evaluation”

In contrast, our proposed graph-based ranking framework achieves a score of 41.43 when using the same information.

translation system

In Are Two Heads Better than One? Crowdsourced Translation via a Two-Step Collaboration of Non-Professional Translators and Editors

These have focused on an iterative collaboration between monolingual speakers of the two languages, facilitated with a machine translation system .

Page 2, “Related work”

Although hiring professional translators to create bilingual training data for machine translation systems has been deemed infeasible, Mechanical Turk has provided a low cost way of creating large volumes of translations (Callison-Burch, 2009; Ambati and Vogel, 2010).

Page 2, “Related work”

(2013) translated 1.5 million words of Levine Arabic and Egyptian Arabic, and showed that a statistical translation system trained on the dialect data outperformed a system trained on 100 times more MSA data.

Page 2, “Related work”

art machine translation system (the syntax-based variant of Joshua) achieves a score of 26.91, which is reported in (Zaidan and Callison-Burch, 2011).

PageRank

Appears in 3 sentences as: PageRank (3)

In Are Two Heads Better than One? Crowdsourced Translation via a Two-Step Collaboration of Non-Professional Translators and Editors

The standard PageRank algorithm starts from an arbitrary node and randomly selects to either follow a random outgoing edge (considering the weighted transition matrix) or to jump to a random node (treating all nodes with equal probability).

Page 6, “Problem Formulation”

where 1 is a vector with all elements equaling to l and the size is correspondent to the size of V0 or VT. ,u is the damping factor usually set to 0.85, as in the PageRank algorithm.

Page 7, “Problem Formulation”

We set the damping factor ,u to 0.85, following the standard PageRank paradigm.