Chief Family Officer

Thinking about machine translation

About the webinar

In this post I reflect on a recent webinar on Machine Translation (MT) (view the 57-min recording here), co-hosted by Asia Online and Moravia Worldwide. The webinar was packed with information and insights, and I strongly recommend it to anyone who works with multilingual content and wants to learn more about this up-and-coming translation technology.

The New Model for Partnerships in MT webinar comprehensively covered differences between the two main varieties of MT, trends in content creation and localization including the synthesis of MT with human translation and editing, the pressure on businesses to localize more content within ever shortening timeframes, how MT can help solve this problem, and what the bottom-line impact is.

Machine translation: the big picture

The presenters, Kirti Vashee, VP Enterprise Translation Sales at Asia Online (@kvashee on Twitter and blogger at eMpTy Pages) and Bob Myers, COO at Moravia Worldwide, went through a lot of content-packed slides to lay out the basics:

The point is not to compare and contrast MT and human translation, it is about them working together to create a rich, complex model.

Two MT approaches have been most influential: rule-based MT applications (e.g. Systran) have benefited from over 50 years of research, and might have reached its limitations. Statistical MT approach (e.g. Google Translate, Asia Online) is younger, still growing rapidly, and is the one that the presenters are betting on.

It is useful to distinguish between generic and customized MT. Generic MT (e.g. Babelfish, Google Translate) gives a general gist of the source content, is not specialized, and is designed for broad application. Customized MT (Asia Online’s model) creates a tailored offering for an individual client or industry (e.g. IT or travel), fine-tuning the MT to maximize performance in that specific area.

To train the MT engine, both bilingual content (large volumes of normalized and cleaned up TM data, glossaries, “golden” translations) and very high quality monolingual content are necessary inputs. If you have good examples of how monolingual content is applied to the MT engine, I’d love to hear them!

The business case for Machine Translation

According to Moravia and Asia Online, the problem is that only 0.5% of what needs to be translated actually gets translated. The proposed solution is to boost productivity by delivering translation that is “good enough” – that is, MT plus some (non-perfect) human editing, with the goal to have translated content that the user can read, understand, and use to complete the task they are trying to carry out.

Over the past several years, the trend indicates the shift from assisted customer support (phone, email), to automated support (phone prompts, computer-suggested responses, indexed manuals), to community support (“users are always online and talking”). It is easy to see how tapping into the community, giving power to users to respond to queries and solve problems of other users, while having an efficient technology to make the solutions available in more languages can directly and significantly cut the cost of customer support and improve user satisfaction.

ROI calculations that Moravia and Asia Online have completed support this claim, showing that it’s much cheaper to shift most of the support burden to the community, while saving time with instant (MT-powered) translations: most recent support articles are the most important ones because they are addressing the issues that users are dealing with right now.

Does MT work, and what makes it work?

Quality is a tricky subject in both human and machine translation, because of varying power and interest levels of key stakeholders in charge of evaluating and validating translations. The proposed solution is to turn over the decision of whether the translation quality is “good enough” to the users: if they are served a page that was translated with the help of an MT engine, does it result in them solving their problem?

The following factors are crucial for success with MT: large volumes of input data (translation memories as well as monolingual content), high quality of the said data (something that many clients tend to overlook, believing in quantity over quality), human editing at various stages to clean the corpus and continuously train the translation engine, extensive and thorough glossaries.

The key takeaway about MT output quality: data must be clean, since what you feed to the engine is what it learns. It’s not enough to have tons of data, the quality and cleaning process to improve the content quality and adapt it for MT use is just as important.

In closing…

As MT becomes more widespread, I wonder how it will impact the way how people communicate as we get more exposure to machine-influenced, or “good enough” content. Will we start to write, speak and think in more machine-like ways?

I would love to have the PPT version of the webinar, because slides were tightly packed with text and images. Unfortunately, Moravia did not make the presentation file downloadable, so you will need to watch the webinar to get the insights. Another reason I wanted the PPT is that I absolutely adored the multicultural avatars: little dudes wearing their kimonos, Mexican hats, turbans and so on. I can totally imagine an entire animated shot starring the multicultural dudes explaining technology and localization. I would have shared a picture if I’d managed to get hold of the presentation file, which sadly, I did not.

Update: my plea was heard and the presentation popped up in my mailbox. Introducing the “multicultural dudes”, courtesy of presenters:

Overall, this webinar is a very useful introduction for someone who is considering MT for their business, or anyone involved with localization business and willing to stay on top of the trends. Moravia Worldwide also offers a free consult for interested companies, to evaluate the cost and ROI (details and contacts at the end of the webinar). Even though I’m not in the market for machine translation right now, I am definitely sold on its significance for businesses who want to bring more relevant, useful content to their customers across the globe.

Like this:

Post navigation

11 thoughts on “Thinking about machine translation”

Thank you for your kind comments about the webinar. I am glad that you found it useful.

The question you raise about monolingual data is explained in some detail in a white paper I wrote on MT which is available at the L10NCafe Site in the resources section http://l10ncafe.com/Lists/Resources/AllItems.aspx (requires registration). I also share the paper at my LinkedIn profile. Basically, monolingual data is used by SMT systems to develop an understanding of the grammar and style and is nothing more than an analysis of patterns in native language content.

The L10NCafe also has a white paper that describes the customer support opportunity you describe here in some detail: “Reaching the Global Customer with Human Driven MT”

We will continue to present other high-volume content scenarios to show how man-machine collaboration can handle translation projects that involve tens of millions of words. I believe this will become much more commonplace.

We are only at the beginning of the content deluge and I am sure it will very likely mean a lot more work for everybody in the translation field.

I will definitely check out the webinar, as it is essential for agencies and translators to be ahead of the game regarding machine translation. I really enjoyed your summary of the current situation. Obviously, incorporating MT into current workflows and informing clients of their possibilities is essential in forging solid relationships and producing high-quality material.

Kirti, thank you for a personal note! Unfortunately, I’m not able to connect to L10NCafe (asks me for a login and a password for some reason) but I downloaded the white paper from your LinkedIn profile – it is great that it covers the topic in depth, with very specific examples about company applications and references to other experts.

I agree that the content is on the rise and that the translation industry will continue to evolve to meet the demand even though matching incremental investments cannot be taken for granted. The MT trend in general and your blog in particular are something to watch for anyone involved with localization.

Thank you for your very kind words on both the presentation and the graphics. We have gone to great pains to try and explain how machine translation is not longer a tool for the very technical and can now be used by the main stream translation community. The artwork and graphics are key to delivering these messages clearly and concisely so that the business side of machine translation can be better understood. Our “multicultural dudes” as you called then 🙂 will be appearing a lot more. We have designed many upcoming webinars that include them and a range of others in an attempt to explain how machine translation works and how language service providers and enterprises can benefit from a multicultural content and multi-language delivery models.

Another webinar that we delivered recently may also be of interest – How Machine Translation Can Help Language Service Providers (LSPs) Grow Their Business. The video replay is available at http://www.languagestudio.com/Webinars.aspx along with some of the videos from the Localization and Translation conference that was held in Bangkok last December.

@Kirti: thank you for the explanation. Now that I’ve finally managed to register, it makes sense how L10NCafe membership works – somehow, it was not really obvious before. Will be looking through that content…

@Dion: thanks for stopping by, and I will am looking forward to more multicultural dudes! They are very personable and get the business point in a casual yet very convincing way. And yes, I plan to be watching and covering more webinars on the topic.

Jenia,
Nicely written. I do think it’s important to point out some misconceptions. While Statistical MT is newer than Rule-Based, it is not necessarily better. Same goes for the reverse. It all depends on what project requirements you have. What’s better is to merge the two methods. This is called “Hybrid”. Systran has recently released a Hybrid system. Others are working on as well.

Cary
I agree that it cannot be said that one approach is definitely ALWAYS better than the other. There are many successful uses of both. In fact, at this point in time there may be more examples of RbMT successes. However, there is clear evidence that SMT continues to gain momentum and is increasingly the preferred approach. RbMT has been around for 50 years and the engines we see are in many cases (e.g. Systran English>French) the result of decades of investment and research. SMT is barely 5 years old and is only beginning.

The people best suited to answer the question of which is better are those who have explored both RbMT & SMT paradigms deeply, to solve the same problem. Unfortunately there are very few of these people around. The only ones I know for sure that have this knowledge are the Google Translate and Microsoft Translate teams and they have both voted in favor of SMT.

Today RbMT still makes sense when you have very little data, which is necessary to get a good SMT engine into place or where you have a good foundation engine already in place, which has been tested and is a good starting point for customization. Some say they also perform better on languages with very large structural and morphological differences.

What most people miss is that the free online engines are not a good representation of the best output possible with MT today. The best systems come after focused customization efforts, and the best examples for both RbMT and SMT are carefully customized systems that are built for very specific enterprise needs rather than for general translation.
It has also become very fashionable to use the word hybrid of late. From my viewpoint, characterizing the new Systran effort as a hybrid engine is misleading. It is an RbMT engine that applies a statistical post-process on the RbMT output to try and improve fluency. Fluency has always been a problem for RbMT and this process is an attempt to improve the quality of the RbMT output and thus this approach is not a true hybrid from my point of view. In the same way, linguistics are being added to SMT engines in different ways to handle issues like word order and dramatically different morphology which have been a problem for pure data based SMT approaches. I agree that statistics, data and linguistics are all necessary to get better results.

I would also like to present my case for the emerging dominance of SMT with some data that I think we can mostly agree, is factual and true and not just a matter of my opinion.

Fact 1: Google used Systran RbMT system as their translation engines for many years before switching to SMT. The Google engines are general purpose baseline systems (i.e. non domain focused). Most people will agree that Google compares favorably with Babelfish which is a RbMT engine. I am told they switched because they saw a better-quality future and continuing evolution with SMT which CONTINUES TO IMPROVE as more data becomes available and corrective feedback is provided. Most people agree that the Google engines have continued to improve since they switched to SMT.
Fact 2: Most of the widely used RbMT systems have been developed over many years (decades in some cases) while none of the SMT systems are over 5 years old and are still in infancy.
Fact 3: Microsoft switched from a RbMT engine to an SMT approach for all their public translation engines in the MSN Live portal. I presume for similar reasons as Google. They also use a largely SMT based approach to translate millions of words in their knowledge bases into 9 languages which is perhaps the most widely used corporate MT application. They too expect that the quality will improve faster than anything that could be done with their previous Systran approach.
Fact 4: Worldlingo switched from a RbMT foundation to SMT to get broader language coverage and attempt to reverse a loss of traffic (mostly to Google)
Fact 5: SMT providers have been able to easily outstrip RbMT providers in terms of language coverage and we are only at the beginning of this trend. Google had a base of 25 languages while they were RbMT based but now have over 45 language pairs that can go into any other language and apparently over 1,000 combinations with their SMT engines.
Fact 6: The Moses Open Source SMT engine has been downloaded over 4,000 times in the last two years. Many will be overwhelmed by the complexity but many new initiatives are coming forth from this exploration of SMT by the open source community and we have not yet really seen the impact of this.

Google and Microsoft have placed their bets. Even IBM, which still has a legacy RbMT offering (albeit dead), has their Arabic and Chinese speech systems linked to an SMT engine that they have developed. So now we have three of the largest IT companies in the world focused on SMT-based approaches.
However, this is just the data for the public online free engines. Many of us know that customized, domain focused systems are different and for enterprise use, the area that matters most. How easy is it to customize an SMT vs RbMT engine?

Fact 7: Philipp Koehn and his Univ. of Edinburgh team have published a paper (funded by Euromatrix) where they compared 6 European languages as baselines and after domain tuning with TM data for SMT, dictionaries for RbMT. They found that Czech, French, Spanish and German to English all had better domain results with SMT. Only the Eng>Ger domain had better results on domain focused systems. However, he did find that RbMT had better baselines in some cases than he had since he does not have the data resources that Google or Microsoft have.
Fact 8: Asia Online has been involved with Patent domain focused systems in Chinese and Japanese. We have produced higher quality translations than RbMT systems which have been carefully developed with over almost a decade of dictionary and rules tuning. The SMT systems were built over 3-6 months and will continue to improve. It should be noted that in both cases Asia Online is using linguistic rules in addition to raw data-based engine development.
Fact 9: The intellectual investment from the computational linguistics and NLP community is heavily biased towards SMT maybe by as much as a factor of 10X or 20X. You can verify this by looking at the focus of major conferences on MT in the recent past and in 2010. I suspect that this will mean continued advance and progress in the quality of SMT based approaches.

Some of my personal bias and general opinion on this issue:
— If you have a lot of bilingual matching phrase pairs (100K+) you should try SMT and in most cases you will get better results than a RbMT especially if you spend some time providing corrective feedback in an environment like Asia Online. I think man-machine collaborations are much more easily engineered in SMT frameworks. Corrective feedback can be immediately useful and can leverage future quality.
— SMT systems will continue to improve as long you have clean data foundations and continue to provide corrective feedback and retrain these systems periodically after “teaching” it what is getting wrong.
— SMT will win the English to German quality game in the next 3 years or sooner.
— SMT will become the preferred approach for most of the new high value markets like Br PT, Chinese, Indic Languages, Indonesian, Thai, Malaysian and major African markets.
— SMT will continue to improve significantly in future because – Open Source + Academic Research + Growing Data on Web + Crowdsourcing Feedback are all at play with this technology

SMT systems will improve as more data becomes available, bad data is removed and as pre and post processing technologies around the systems improve. I also suspect that the future systems will be some variation of SMT + Linguistics rather than just raw data-only based approaches. I also see that humans will be essential to driving the technology forward and that some in the professional industry will be at the helm, as they do in fact understand how to manage large scale translation projects better than most.

I have also covered this in some detail (and fairly I think) in a white paper that can be found in the L10NCafe or on my LinkedIn profile and there is much discussion about this subject in the Automated Language Translation group in LinkedIn where you can also read the views of others who disagree with me.

For both the Rule-based approach and the Statistical-based approach, each can be broken down into both Generic MT and Customizable MT.
I explain this in my presentation:
Inbound versus Outbound Translation. by Jeff Allen. Presented at Localization World , Bonn, Germany, June 29 – July 1, 2004.https://www.box.net/shared/d19y1bd3e3

The hybrid approach is not new. I’ve been writing about it for past decade on LANTRA, ProZ and Translatorscafe based on Multi-engine MT projects I conducted at the Center for MT.

What is new is that SMT has because a commercial offer over the past several years, and it is possible to customize it.

As for hybrid, well there is not a single meaning to hybrid. It all depends on what the hybrid system does, and how. I’ve written in the Systran Facebook wall some comments about their hybrid system based on attending their open demo forum at beg Dec09 and asking several very specific questions to the presenters. That hybrid approach is much different than the AsiaOnline hybrid approach.

In evaluating a hybrid system, it is important to ask some very specific questions and get a visual diagram of what hte system is doing, how it is interacting with data, what types of data formats the different processes can handle, and what the system is doing with internal modules at each phase.

@Cary: thank you for stopping by! I agree with you that it would be misleading to say that approach X is better in all instances – and it’s also true that different problems call for different solutions, and that blending approaches (whether you call it hybrid, or something else) helps to have “the best of all worlds”. I’m admittedly the rookie when it comes to MT, in comparison to everyone else who joined the discussion 🙂 My point of view: it did not seem to me that Moravia & Asia Online tried to bend facts when presenting the case for SMT, and based on the data and opinions they presented, I became convinced that SMT is (or going to be) the most significant growth driver. To have a fair benchmark, I’d love watch a Systran webinar for comparison, but I haven’t found any on the company website 😦

@Kirti: wow, thank you for a comment full of insights, as usual. I wanted to shout “blog post alert!” but then checked my Google Reader and it was already there.

@Jeff: thank you for sharing your thoughts and your presentation – a lot of useful examples and although the answer “it depends” can be frustrating when it comes to the quest for the best approach, this is the case with a lot of complex problems – and this is the domain where translation belongs.