19 November 2007

If you look at MT papers published in the *ACL conferences and siblings, I imagine you'll find a cornucopia of results for translating into English (usually, from Chinese or Arabic, though sometimes from German or Spanish or French, if the corpus used is from EU or UN or de-news). The parenthetical options are usually just due to the availability of corpora for those languages. The former two are due to our friend DARPA's interest in translating from Chinese and Arabic into English. There are certainly groups out there who work on translation with other target languages; the ones that come most readily to mind are Microsoft (which wants to translate its product docs from English into a handful of "commercially viable" languages), and a group that works on translation into Hungarian, which seems to be quite a difficult proposition!

Maybe I'm stretching here, but I feel like we have a pretty good handle on translation into English at this point. Our beloved n-gram language models work beautifully on a languages with such a fixed word order and (to a first approximation) no morphology. Between the phrase-based models that have dominated for a few years, and their hierarchical cousins (both with and without WSJ as input), I think we're doing a pretty good job on this task.

I think the state of affairs in translation out of English is much worse off. In particular, I think the state of affairs for translation from a morphologically-poor language to a morphologically-rich language. (Yes, I will concede that, in comparison to English, Chinese is morphologically-poorer, but I think the difference is not particularly substantial.)

Why do I think this is an interesting problem? For one, I think it challenges a handful of preconceived notions about translation. For instance, my impression is that while language modeling is pretty darn good in English, it's pretty darn bad in languages with complex morphology. There was a JHU workshop in 2002 on speech recognition for Arabic, a large component of which was on language modeling. From their final report, "All these morphology-based language models yielded slight but consistent reductions in word error rate when combined with standard word-based language models." I don't want to belittle that work---it was fantastic. But one would hope for more than slight reduction, given how badly word based ngram models work in Arabic.

Second, one of my biggest pet-peeves about MT (well, at least, why I think MT is easier than most people usually think of it) is that interpretation doesn't seem to be a key component. That is, the goal of MT is just to map from one language to another. It is still up to the human reading the (translated) document to do interpretation. One place where this (partially) falls down is when there is less information in the source language than you need to produce grammatical sentences in the target language. This is not a morphological issue per se (for instance, in the degree to which context plays a role for interpretation in Japanese is significantly higher than in English---directly translated sentences would often not be interpretable in English), but really an issue of information-poor to information-rich translation. It just so happens that a lot of this information is often marked in morphology, which languages like English lack.

That said, there is at least one good reason why one might not work on translation out of English. For me, at least, I don't speak another language well enough to really figure out what's going on in translations (i.e., I would be bad at error analysis). The language other than English that I speak best is Japanese. But in Japanese I could probably only catch gross translation errors, nothing particularly subtle. Moreover, Japanese is not what I would call a morphologically rich language. I would imagine Japanese to English might actually be harder than the other way, due to the huge amount of dropping (pro-drop and then some) that goes on in Japanese.

If I spoke Arabic, I think English to Arabic translation would be quite interesting. Not only do we have huge amounts of data (just flip all our "Arabic to English" data :P), but Arabic has complex, but well-studied morphology (even in the NLP literature). As cited above, there's been some progress in language modeling for Arabic, but I think it's far from solved. Finally, one big advantage of going out of English is that, if we wanted, we have a ton of very good tools we could throw at the source language: parsers, POS taggers, NE recognition, coreference systems, etc. Such things might be important in generating, eg., gender and number morphemes. But alas, my Arabic is not quite up to par.

(p.s., I recognize that there's no reason English even has to be one of the languages; it's just that most of our parallel data includes English and it's a very widely spoken language and so it seems at least not unnatural to include it. Moreover, from the perspective of "information poor", it's pretty close to the top!)

14 comments:

I think another reason for working on other translation pairs is commercial. Just look at the percentage increase of Chinese, Portuguese, and Arabic speakers on the Internet:

http://www.internetworldstats.com/stats7.htm

There's gotta be interest in translation from English to X, as well as translation pairs not involving English. Now, the latter is yet another intriguing MT research problem, i.e. should we do bridge translation (X1->English, English->X2), direct translation (X1->X2), or a combination?

In fact, another (much smaller) DARPA program, the TransTac speech-to-speech translation project, IS looking at translation from English to (Iraqi) Arabic. The motivation is to allow full 2-way communication between the parties.

Yes, the morphological complexity of Arabic does cause a problem, but mainly for the metrics, rather than the translation itself, it would seem. BLEU and TER all show much worse performance for E2A than A2E, but subjective Likert-scale evaluation shows the two translation directions as performing about the same in many evals.

that usage list is pretty cool... would have been great if they could have worked the % increase into the graph itself. but it's pretty amazing.

as for the direct versus bridge, i guess it would depend primarily on if you have actual direct parallel data. if you don't, you're pretty much hosed and have to do bridge. it's possible that even with some parallel data, it might be better to do a combination (i wouldn't find this surprising at all). i guess one question is whether you can do anything more interesting than just X1 -> N-best-E and then N-best-E -> N^2-best-X2 and then rerank.

dave --

i had forgotten about this project, but my sense is that speech to speech in the case of transtac is a very very limited domain and that classification-based and small rule-based systems actually do quite well. are the results you're quoting for E2A vs A2E on this domain, or is it for the more general text-to-text in (eg) news? if the former, then my guess is there isn't actually much "generation" going on, which may explain away some of the good performance.

A lot of people think that the TransTac domain is very narrow, but that's not true, at least not when compared with other speech systems like dialog systems. Quite a wide range of topics are covered within its purview, and the Arabic vocab is about 75K. All of the surviving systems in the program in fact use statistical machine translation as their primary mechanism.

Now granted, it is not as broad as news (GALE), nor as syntactically demanding in terms of complex clausal structure to get right, etc. I don't know how E2A translation would do for that domain. Probably it would not be totally awful, though.

There has been some research in our lab on adopting n-grams for Finnish (and related languages) which is morphologically rich. The approach has been to segment the words automatically using unsupervised morpheme-like segments. This improves n-gram performance significantly for morphology rich languages. The segmentation was also applied to statistical MT, for which Finnish is a difficult source and target. There however the scores did not improve that much. The first paper below seems to have applied the methods to Arabic as well.

Some papers in case you're interested:http://www.cis.hut.fi/vsiivola/papers/creutz07naacl.pdfhttp://www.cis.hut.fi/svirpioj/papers/virpioja07mtsummit.pdf

You make a lot of valid points. The commercial interest English-X may be even bigger than X-English, but the funding situation in the US is different. Fortunately things a better here in the old world.

I have found in the translation of European languages that morphology is one of the main reasons why translations into a language is worse that translation out of it: generating morphology is much harder than translating it. I don't think this just an artefact of the BLEU score.

Now it's time to plug my Europark (2005) paper and the recent work on factored models...

I would disagree with you on the quality of automatic translation into English...still not reliable enough. I do agree that it's pretty impressive what n-gram models have yielded, but we might be close to the limits on what we can do with it.