Paradigm shift on machine translation?

The April-May issue of Multilingual, which I’m just catching up with, features seven articles on machine translation (MT). Having a long term interest in this area (which is not to say any expertise) and in its potential for less-widely spoken languages, and having broached the topic on this blog once previously, I thought I’d take a moment to briefly review these articles. They are (links lead to abstracts for non-subscribers):

In the first of the articles, Jaap van der Meer characterizes changes in attitudes about MT over the last 4 years as “revolutionary” — a move “from complete denial [of MT’s utility] to complete acceptance.” What happened? The answer seems to be a number of events and changes rather than a single triggering factor, perhaps an evolution to a “tipping point” of sorts. There have been ongoing improvements in MT, there was the establishment of the Translation Automation User Society (TAUS) in 2004 which “helped stimulate a positive mindset towards MT,” and the empowerment of internet users in the use of MT. Van der Meer also points out a shift in emphasis from finding “fully automated high quality translation” (FAHQT) to what he calls “fully automated useful translation” (FAUT – an acronym that presumably should not be read in French). The latter is not only a more realistic goal, but also one that reflects needs and uses in many cases.

As for the future, van den Meer sees a “shift from traditional national languages to ever more specialized technical languages.” My question is whether we can at the same time also see significant moves for less widely spoken languages.

Van den Meer’s article sets the tone and has me asking if indeed we are at a point where a fundamental shift is occurring the way we think of MT. The other articles look at specific issues.

Vadim Berman looks at some hurdles to making MT work, highlighting the importance of educating users – including mention of a recurrent theme: the importance of clean text going into the translation.

Two of the articles, by Lou Cremers and by Kerstin Berns and Laura Ramírez, discuss the practical value of MT in enterprise settings.

Cremers has some interesting thoughts about the utility of MT in an enterprise setting, something that has long seemed impractical, certainly when compared to translation memory (TM). He begins by noting that “a high end MT system will really work if used correctly, and may save a considerable amount of time and money,” and then procedes to discuss several factors he sees as key to getting good ROI: terminologies and dictionaries; quality input text; volume (pointing out among other things the fact that good MT will tend to lead to a larger amount of text being translated – a key point for considering the value of MT in other spheres of activity I might add); and workflow.

The “correct use” of MT relates largely to the quality of the text: “surprisingly simple writing rules governing he use of articles and punctuation marks will drastically improve MT output.”

Cremers offers a summation which seems to speak for several of the articles:

It’s not the absolute quality of the MT output that is important, but rather how much time it saves the translator in completing the task. In that way it is not different from TM. In both cases, human intervention is needed to produce high-quality translations.

Berns and Ramírez walk through the costs and benefits of MT in a business context. Here the issue is investing in a system but the reasoning could be applicable to different settings. They suggest that the kind of material to be translated is (unsurprisingly) a good guide to the potential utility of MT:

Do you have large text volumes with very short translation times and a high terminology density? Then it is very likely that MT will be a good solution for you. On the other hand, if you have small text volumes with varying text types and complex sentence structures, then it probably will bu too much effot to set up an effective process.

Two of the articles, by Hugh Lawson-Tancred and Rafael Guzmán, discuss “post-editing” as a tool to improve the output of MT.

Lawson-Tancred suggests – contrary to several of the other authors – that the utility of preparing the text going into MT may not be so critical, and that “the monolingual environment of the post-editor is a better place to smooth out the wrinkles of the translation process….” Interestingly, this concept focuses on context, with the basic unit for processing being 5-20 words (that is between the word level of dictionaries and whole sentences). His concludes by speculating that automated post-editing could “develop into a whole new area of applied computational linguistics.”

Guzmán, who has written a number of other articles on post-editing, discusses the use of TM in the context of verifying (post-editing) the product of MT. This basically involves ways of lining up texts in the source and translated languages for context and disambiguation. There are several examples using Spanish and English.

Finally, Dion Wiggins and Philipp Koehn discuss MT involving Asian languages, which most often entails different scripts. There are examples from several Asian languages illustrating the challenges involved.

This is an interesting set of articles to read to get a sense of the current state of the art as regards the application and applied research on MT. It’s a bit of a stretch for a non-specialist with limited context like me to wrap his mind around the ensemble of technical concepts and practices. One does come away, though, with the impression that MT is already a practical tool for a range of real-world tasks, and that we will be seeing much more widespread and sophisticated uses of it, often in tandem with allied applications (notably TM and post-editing). Are we seeing a paradigm shift in attitudes about MT?

At this time I’d really like to see a program to encourage young computer science students from diverse linguistic backgrounds in developing countries and indigenous communities to get into the field of research on MT. I’m convinced that it has the potential if approached strategically to revolutionize the prospects for minority languages and the ways we think about “language barriers.” That is more than just words – it has to do with education, knowledge and enhanced modes of communication. By extension, the set of human language technologies of which MT is a part, can in one way or another play a significant role in the evolution of linguistic diversity and common language(s) over the coming generations.

Post navigation

12 thoughts on “Paradigm shift on machine translation?”

Thanks for this useful summary.
There are a number of interesting developments that are pushing MT into practice, rather than being the endless “holy grail” tech project with a 5-year window.
• Statistical methods are speeding up engine production (no rules to craft manually),
• the blogosphere and social networking generally are thrusting people’s noses against the language barrier in their everyday computing,
• the open source community is getting to grips with simple and powerful tools to handle translation,
• and large companies with global communication needs cannot get their stuff out quick enough with the laborious workflows found in commercial translation contexts, so they are looking closely at MT.

On the language front, Asia Online is working closely on a number of Asian languages, India has a strong R&D base and seems to be developing solutions (not just PhD theses) for their language spread.
Language Weaver (MT company) has a Hausa-English engine, and South Africa must surely be looking at EU extensive experience in using tools and tech to run a multilingual country.

The key hope though is that communities will hijack the technology and push MT into new uses for new languages, so that we have a broad spectrum of experiments from which to select and build what Jaap van der Meer likes to call “translation out of the wall” – MT as a utility!

Thanks Andrew, I appreciate the feedback and info, I was not aware of Language Weaver’s work on Hausa (among African languages they’ve also done something on Somali as well as Arabic- the latter to & from several languages). In general, though my impression is that outside of South Africa and in particular the Meraka Institute, and apart from a very few individual experts, there has been little attention to MT for African languages. The hurdles in terms of lack of corpora and in some cases lack of stable orthographies are part of the reason, but a commitment to finding a way to leverage HLT for African languages is the key need. I do think that a program focusing on supporting a new generation of African experts in NLP/HLT could yield benefits for the experts, the field, and most of all the diverse language communities of Africa.

Health Connections International — an international, non-profit public health organisation based in the Netherlands — is in response to the rapidly expanding HIV and AIDS pandemic establishing a multi-lingual web-based platform, where non-English speaking physicians, healthcare practitioners, and other HIV professionals – in low income countries – may seek information and advice from their English-speaking colleagues.

I am seeking guidance and advice in the selection and use of appropriate CAT programs or other solutions. Can you assist? Your advice is most appreciated. Look forward to reactions. If you have any questions please do not hesitate to contact me

Kind regards,

Murdo Bijl
________________________________________
Director of International Development
Health Connections International

The pursuit of useful rather than the ultimate high-quality machine translation is not new and a not a paradigm shift of the last few years. I worked as an MT researcher in the eighties and already at that time a more practical approach to MT development was prevalent.

Thanks Job. Maybe I should use the term “watershed” (or “tipping point”?) in trying to describe what has been changing. From the view of a lot of users, apparently, there is a difference. Van de Meer’s use of the term “revolutionary” in writing about changes in attitudes about the utility of MT is what had me thinking along the lines that at least some people’s mindsets had shifted.

“At this time I’d really like to see a program to encourage young
computer science students from diverse linguistic backgrounds in
developing countries and indigenous communities to get into the
field of research on MT.”

From my understanding there are quite a few students from these
backgrounds working on research into machine translation. Part of the
problem that I see is that the research often does not get turned into
development — _if_ the research does get funding (for example the work
on indigenous South American languages at CMU in the AVENUE project),
often the development of the system only lasts until the funding runs
out, and is limited to prototypes and proofs-of-concept.

It’s worth noting that only in recent (the last 3-4) years have free
software machine translation platforms (e.g. Apertium[1] and Moses) been
available. Before this, research had to be done with closed-source or
“academic/research use only” tools which limited the prospects for
commercialisation (and therefore funding for further development beyond
the the research stage).

Anyway, thanks for the post, and this is where the spam comes in (see
disclaimer[1] below — and sorry if you’ve heard it before): Apertium
(http://wiki.apertium.org/) is a free-software (GPL) project aimed at
creating machine translation systems for lesser-resourced and
marginalised languages. Most of our work has been done on Romance
languages, but we’re actively seeking collaborators for new language
pairs.

Most of our research involves finding methods of speeding up
rule-based machine translation system development in an unsupervised way
(using corpora where available) such that we can decrease the amount of
time it takes to create a new system. An example of our system in use
include — e.g. La Voz de Galicia (a regional newspaper in the Galicia)
which is now printed bilingually (in Spanish and Galician) thanks to
machine translation.

Thanks Fran, and no problem about letting us know more about Apertium. I actually learned about it a little while back from Mikel Forcada. I found interesting the concept of “shallow transfer” MT between closely related languages. However things play out wrt rule-based vs. statistical MT, I’m wondering if this approach might be helpful for work among closely-related minority languages without the resources (corpora) to seriously do statistical based MT.

For instance, I understand that in South Africa, the path for translations into Nguni languages (Zulu, Xhosa, Swati, Northern Ndebele) goes first into one of those tongues and then from that to the others. Having a good “shallow transfer” MT as a tool for this second stage (and other translations between them) would one imagines be quite useful.

Having worked on both rule-based MT systems (which I couldn’t get to work properly) and later statistical MT systems (which I did get to work with great succes), I am slightly prejudiced towards the statistical side. But, I believe that statistical MT has resulted in a break-through in MT and given a breadth and quality of systems that has convinced the ordinary user that “MT may be useful”.

One important side-effect of the move to statistical methods is that knowing the languages being translated is less important than with rule-based systems. So when you say:

“At this time I’d really like to see a program to encourage young computer science students from diverse linguistic backgrounds in developing countries and indigenous communities to get into the field of research on MT.”

you may be making it harder than necessary. Mono-lingual (or non-indigionous) computer scientists (or computation linguists like me) can do most of the work. Of course you need “target language” speakers along the way, but they need not know any computer science.

Thanks Søren, I’m just getting to appreciate the possibilities of statistical based MT. It also apparently has the challenge from the perspective of many less widely spoken or less-resourced languages of requiring corpus resources that are not yet there.

I understand your point regarding how statistical based MT techniques have less need for speakers of the languages in question. I should amend my pitch on this (later) but for the moment would put it into the context that I think it is important to train a new generation of computer scientists and computational linguists from developing countries where so many less-resourced languages are spoken.

Don, your blog entry provides an interesting perspective and I think a thoughtful summary on what is going on with the specific issue of MT as described in that issue of Multilingual. (It would be great if they would let you link to the actual article rather than the abstract as your blog entry has seemed to have triggered an interesting sequence of comments and could become more popular over time.)

I would like to offer some opinions that take a step back and put this increasing interest and awareness of MT in some perspective. I offer this as someone who is not a localization industry expert, but as someone who has been in the broad IT industry for 20 years, and recently involved with evangelizing early SMT technology for 3 years at Language Weaver and now promoting exciting new 2nd generation SMT at Asia Online. This should make my bias clear. My intent is not to make authoritative or conclusive remarks here, (though it may seem so sometimes), but simply to offer some initial observations to further the dialogue on where the business translation market is headed.

I think it is useful to look at some of the broad questions to understand what is happening and also perhaps get some initial understanding of why it is happening. This can be done I think by asking a set of questions that raise awareness of the forces at work.

The questions, I think are:

What is being translated? How is this changing?
Who is doing the translating?
How does the translation industry work and how is this changing?
What are the technological advances that are emerging?
What is the combined impact of all these things?

What is being translated? How is this changing?
There are really two worlds here, the enterprise (corporate/government) world and the online user world. The enterprise market tends to focus on documentation, marketing materials and light web content which is all relatively static content but apparently a $10B+ market. The localization industry is focused on optimizing this activity. However, this is changing as more companies realize that there is great value in providing richer and deeper knowledge resources for global customers. Microsoft has led the way with their MT based conversion of their massive knowledge bases that has millions of users happily using mostly raw SMT. Given the volume of the data, there is no option other than MT to make this content available in the many language that they do make it available in. Others are following the MSFT model now but the localization industry players, so far have added very little to enable and facilitate this. Possibly, some will emerge that learn how to help corporate customers do this. The other big change is that this new content is huge in volume and much more dynamic compared to the traditional content that the localization industry is optimized for. This trend is building momentum as others want to emulate what Microsoft has done. It will likely expand even further into community forums as has already begun at Microsoft. The amount of information in general that asks to be translated is growing at an internet pace i.e. very rapidly. I suspect the demand for translation will expand to many new kinds of knowledge that facilitate collaboration in global enterprises and continue to increase the engagement with the global customer.

The online users have several “free” translation sites to choose from and the SMT initiative from Google in particular, shows that the translation quality is getting better all the time. Millions of users are translating millions of web pages a day and this is rapidly becoming the most common form of translation seen by anybody who is online. This will continue to expand as the increasingly non-English online population (largely Asian) expands. So it is predictable that the languages will expand the volume of usage will grow. The global enterprise has an opportunity to take control of this trend and ensure that the automated translations in their domain are superior to any available free translation by learning how to effectively use SMT for their own focused domains.

Who is doing the translating?
It is estimated that there are 500,000 to 750,000 professional translators, freelancers out there who make money from performing translation work. They often are working for a Language Service Provider (LSP) or sometimes directly for an enterprise with a multilingual focus. A very small percentage of these people will use tools like TM or other CAT tools but the trend is definitely to try and use some technology leverage to raise productivity. While this group does the bulk of the “high quality” enterprise translations there are many who do not consider themselves professionals who are also competent to do simple translations.

Google estimates that there are 600M+ people on the web who are at least bilingual if not trilingual. Many of these people could be drawn in to offer translation services for things they care about. Microsoft and SUN are examples of how even software programs can be localized with community support. SUN has 69 different language versions of OpenOffice, the web community at large was responsible for creating 60 of these languages. SUN facilitated this through an enlightened management and community engagement approach. This is a trend that will gather momentum and be further accelerated with improving MT systems. The open source model for localization will continue to make more inroads. Facebook is another recent example which shows the community can be drawn into the translation process when it matters to users. Is it possible that Facebook chose the community approach to translating their interfaces because they are really interested in making 200 languages available, not just the standard FIGS CJK configuration that the localization industry is optimized around?

How does the translation industry work and how is this changing?
Very simply put, the localization world is separated into Buyers (typically corporate enterprises) and Vendors (LSPs who help the buyers manage their mostly static content focused translation efforts). Translators scattered across the globe interact with these entities to get the work of enterprise translation done. The industry has developed some efficiencies in how they manage the translation process but relative to a lot of other technology areas in IT, the translation industry is still a very fragmented cottage industry, and still relatively inefficient. The change agents in the industry rarely come from standard localization professional backgrounds and we are at the cusp of seeing new kinds of people entering the industry. A few localization professionals that are driving change, are now beginning to drive new translation initiatives, usually going beyond static content and finding content that can drive and enhance international business initiatives. As we all know, while user manuals are necessary, very few users actually ever open them, and so the value of the documentation has generally been closely linked to the status of the localization professional in the organization. This is beginning to change as translation is seen as the means to engage with the global customer and raise customer satisfaction and loyalty. As business line managers enter into the translation focused agenda, the value of translation will rise as the links to global revenue and initiatives are made and better understood. However, the nature of the content that is being translated and is associated with high value is changing and is much more dynamic. The localization industry is unlikely going to be able to cope with this without an increasing use of translation automation technology. It is becoming clear even to long time naysayers that the time for MT is now.

What are the technological tools/advances that are emerging?
In general many kinds of CAT and translation automation tools are coming into the market with a few being firmly established. TM tools are perhaps the most widely used automation tool today but this is already evolving into next generation TM called advanced leveraging. Some industry analysts are already predicting that TM will “evolve” into merged TM/MT systems or more likely into next generation TM/SMT solutions. It is somewhat shocking that in an industry that is claimed to be in the $12B range the total market for all software tools is less than $100M according to CSA. The lack of standards and true data interchange makes most of the TMS systems risky investments for the enterprise as products that were core translation management infrastructure last year, could be phased out this year after an acquisition. New TMS systems continue to emerge as the need is great but for the most part they create islands of incompatible data and further complicate data interchange. As people begin to understand that organization of linguistic assets is a high value proposition, it is hoped that stronger standards will emerge and define how the community can share and leverage the translation assets that they have. The TMX 2.0 standard is a good starting point.

While there continues to be debate over whether SMT is better than RbMT, it is clear that the most global companies in the world (MSFT, Google, IBM) have clearly placed a stake in the ground in favor of SMT. The evidence suggests that SMT systems can be built faster, improve rapidly and are much more amenable to rapid feedback and quality improvements from feedback. However, there are some language combinations where RbMT seem to have greater success e.g. English Hungarian or Korean. The leaders in SMT system development are now adding more linguistic aspects to their methodology and thus should soon be able to handle even difficult pairs like those just mentioned.

Some of the new tools that show some promise include tools that facilitate linguistic asset consolidation, web based collaboration so that tools are centralized and translation processes better managed. The SaaS model is one that is particularly well suited for the emerging web based collaborative model and there should be some interesting new productivity enhancing possibilities around this.

Asia Online has a comprehensive SaaS based infrastructure that facilitates data cleaning/preparation, linguistic asset organization, building custom SMT engines with a tight post-editing and proof reading feedback loop that allows the system to continuously improve and learn. This is Web 2.0 SMT technology and a great leap forward from the first attempts at building SMT systems pioneered by others. This kind of comprehensive infrastructure can be integrated into TMS systems and TM tools used by translators. It will allow a much greater level of control on how SMT engines are built and how users can collaborate to improve the systems on an on-going basis. One of the major promises of SMT is that they can learn and continue to improve, the Asia Online system is infrastructure that makes this possible. This infrastructure makes it possible to undertake translation projects where millions of pages are translated in a few months. It is possible that other new tools that help collect and clean and align data will also emerge in the not so distant future.

There are also tools that help you simplify and clean up the source so that they translate better and more easily for both MT and human translators. This is an area that we can expect will continue to improve but of course as we move into more dynamic content on blogs and user forums these tools will also need to adapt to the language style that you find there.

What is the combined impact of all these things?
I think we are seeing some of the trends at work already as described in Don’s summary. There is a clearly a growing shift away from static content to more dynamic content especially in the customer support area. There is a growing awareness that linguistic assets can be leveraged by SMT and second generation TM technologies. There is a growing awareness of data preparation and consolidation strategies to leverage future translation initiatives. There is a growing willingness to share data even though a workable sharing mechanism has yet to be developed.

As the web moves more from a static content delivery mechanism to a medium to build communities with shared goals it will enable much more interaction between companies and their customers. We are already seeing this in some IT areas and we can expect that this growing dialogue will become more multilingual. Partner/User communities may in future actually create the best documentation since they are the ones with the greatest vested interest and there is already evidence of this. Customers can help to clean up raw MT output to help drive a continuously improving body of knowledge content.

I think you will also see that it will be possible to extend translation beyond the narrow information dissemination role it has had to a true communication model. In a customer support situation it is already possible to conceive of MT systems that enable real time cross lingual chat between a global customer and customer support person.

In the world of patents we see a very heavy concentration of patents filed in English, German and Japanese speaking countries. SMT offers the possibility of extending this important knowledge to many other languages so that other language groups can also participate in the scientific and technical innovation process.

I am sure other readers can come up with many more examples that these forces will enable.

I think it is useful to look at some of the broad questions to understand what is happening and also perhaps get some initial understanding of why it is happening.

Thank you Kirti, for a most interesting set of perspectives on the current state of MT and TM. It is useful to explore the field via questions such as you’ve done, and indeed in so doing the follow-on questions become very interesting. It does seem that there are a number of things going on in his area.

The notion of a merging of MT and TM is something I wondered about, as we use large amounts of text to develop SMT (statistical MT) and use TM for managing translation of large amounts of text.

Re RBMT (rule-based MT) vs. SMT, another area where the former might do well is with pairs of closely-related languages. Where there is not a significant amount of text in a language with which to develop SMT, RBMT may also have some advantages.