Pages

Tuesday, January 18, 2011

It has been commonly understood by many in the world of statistical MT, that success with SMT is based almost completely on the volume of data that you have. The experts at Google, ISI and TAUS have all been saying this for years. There has been some discussion that questioned this “only data matters” assumption but largely many in the SMT world continue to believe the statement below, because to some extent it has actually been true. Many of us have witnessed the steady quality improvements at Google Translate in our casual use to read an occasional web page (especially after they switched to SMT), but for the most part these MT engines rarely rise above gisting quality.

"The more data we feed into the system, the better it gets..." Franz Och, Head of SMT at Google

However, in an interesting review of the challenges of Google’s MT efforts in the Guardian, we begin to see some recognition that MT is a REALLY TOUGH problem to solve with machines, data and science alone. The article also quotes Douglas Hofstader who questions whether MT will ever work as a human replacement, since language is the most human of human activities. He is very skeptical and suggests that this quest to create accurate MT (as a total replacement for human translators), is basically impossible. While I too have serious doubts whether machines will ever learn meaning and nuance at a level that compares with competent humans, I think we should focus on the real discovery here, i.e. more data is not always better and/or that computers and data alone are not enough. MT is still a valuable tool, and if used correctly can provide great value in many different situations. The Google admission according to this article is as follows:

“Each doubling of the amount of translated data input led to about a 0.5% improvement in the quality of the output,” and "We are now at this limit where there isn't that much more data in the world that we can use." Andreas Zollmann of Google Translate.

But, Google is hardly throwing in the towel on MT, they will try “to add on different approaches and (explore) rules-based models."

Interestingly, the “more data is better” issue is also being challenged in the search arena. In their zeal to index the world’s information, Google attempts to crawl and index as many sites as possible (because more data is better, right?). However, spammers are creating SEO focused “crap content” that increasingly shows up at the top of Google searches. (I experienced this first hand myself, when I searched for widgets to enhance this blog. I gave up after going through page after page of SEO focused crap.)This article describes the impact of this low-quality content created by companies like Demand Media and are summarized succinctly in the quote below.

Searching Google is now like asking a question in a crowded flea market of hungry, desperate, sleazy salesmen who all claim to have the answer to every question you ask. Marco Arment

But getting back to the issue of data volume and MT engine improvements, have we reached the end of the road? I think this is possibly true for some languages, i.e. data-rich languages like French, Spanish and Portuguese, where it is quite possible that tens of billions of words underlie the MT systems already. It is not necessarily true for sparse-data, or less present languages on the net (pretty much anything other than FIGS and maybe CJK), and we will hopefully see these other languages continue to improve as more data becomes available. In the graphic below we can see a very rough and generalized relationship between data volume and engine quality. I have a very rough estimate of the Google scale on top, and a lower data volume scale for customized systems at the bottom (that are generally focused on a single domain) where less is often more.

Ultan O’Broin provides an important clue(I think anyway) for continued progress: “There's a message about information quality there, surely.” At Asia Online we have always been skeptical of the “the more data the better” view and we have ALWAYS claimed that data quality is more important than volume. One of the problems created by large scale automated data-scraping is that it is more than possible to pick-up large amounts of noise and digital dirt or just plain crap through this approach. Early SMT developers all use crawler based web-scraping techniques to acquire the training data to build their baseline systems. We have all learned by now I hope, that it is very very difficult to identify and remove noise from a large corpus, since by definition noise is random and unidentifiable through automated cleaning routines which can usually only target known patterns. (It is interesting to see that “crap content” also undermines the search algorithms, since machines (i.e.spider programs) don’t make quality judgments on the data they crawl. Thus Google can, and does easily identify crap content as the most relevant and important content for all the wrong reasons as Arment points out above.)

Though corporate translation memories (TM) can be of higher quality than web-scraped data sometimes, TM also tends to gather digital debris over time. This noise comes from a) tools vendors who try to create lock-in situations by adding proprietary meta-data to the basic linguistic data, b) the lack of uniformity between human translators and c) poor standards that make consolidation and data sharing highly problematic. In a blog article describing a study of TAUS TM data consolidation, Common Sense Advisory describes this problem quite clearly: “Our recent MT research contended that many organizations will find that their TMs are not up to snuff — these manually created memories often carve into stone the aggregated work of lots of people of random capabilities, passed back and forth among LSPs over the years with little oversight or management.”

So what is the way forward, if we still want to see ongoing improvements?

I cannot really speak to what Google should do (they have lots of people smarter than me thinking about this), but I can share the basic elements of a strategy that I see is clearly working in producing continuously improving customized MT systems developed by Asia Online. It is much easier to improve customer specific systems than a universal baseline.

Make sure that your foundation data is squeaky clean and of good linguistic quality (which means that linguistically competent humans are involved in assessing and approving all the data that is used in developing these systems).

Normalize, clean and standardize your data on an ongoing and regular basis.

Focus 75% of your development effort on data analysis and data preparation.

Focus on a single domain.

Understand that dealing with MT is more akin to interaction with an idiot-savant than with a competent and intelligent human translator.

Involve competent linguists through various stages of the process to ensure that the right quality focused decisions are being made.

Use linguistically informed development strategies as pure data based strategies are only likely to work to a point.

For language pairs with very different syntax, morphology and grammar it will probably be necessary to add linguistic rules.

Use linguists to identify error patterns and develop corrective strategies.

Understand the content that you are going to translate and understand the quality that you need to deliver.

Clean and simplify the source content before translation.

And if quality really matters always use human validation and review.

All of this could be summarized simply as, make sure that your data is of high quality and use competent human linguists throughout the development process to improve the quality. This is true today and will be true tomorrow.

I suspect that effective man-machine collaborations will outperform pure data-driven approaches in future, as we are already seeing with both MT and search, and I would not be so quick to write off Google. I am sure that they can still find many ways to continue to improve. As long as the 6 billion people not working in the professional translation industry care about getting access to multilingual content, people will continue to try and improve MT. And if somebody tells you that machines can generally outperform or replace human translators (in 5 years no less), don’t believe them, (but understand there is great value in learning how to use MT technology more effectively anyway). We have quite a ways to go yet till we get there, if ever at all.

I recall a conversation with somebody at DARPA a few years ago, who said that the universal translator in Star Trek was the single most complex piece of technology on the Starship Enterprise, and that mankind was likely to invent everything else on the ship, before they had anything close to the translator that Captain Kirk used.

MT is actually still making great progress but it is wise to be always be skeptical of the hype. As we have seen lately, huge hype does not necessarily lead to success, as Google Buzz and Wave have shown.

8 comments:

Replacing the human translator with a machine translator is impossible because we do not have an exact and complete understanding of what the human translators are doing. In fact, we do not yet have an exact and complete understanding of human communication. So, I believe that replacing a human translator with a machine translator is currently impossible, but only temporarily. Once we understand exactly what we are doing, then we can program the computer to do the same.

The choice of links and quotes to support this blog post is somehow inconsistent, although the general argument appears balanced.

You provide two links to TAUS articles to illustrate the supposedly TAUS view that more data is always better. Yet the first link (“the volume of data that you have.”) leads to an article from March 2009 which states:

“No, the key is really how to put one and one together, in other words how to get the best results out of the combination of the data and the engine.”

“Yes, data sharing is beneficial, but equally (if not more) important is it to have ‘clean' data. And... wouldn't it be great if we all used the same terminology if we refer to the same thing? Things we are learning, and putting in practice. In this latest research TAUS is getting technical and practical for our members who want to get their hands ‘dirty'.”

The second link (“TAUS”) leads to an article from May 2009, which describes the Google approach, congratulates them, and provides a comparison to TAUS Data Association:

“We share the vision but execute differently. TDA works with trusted translations only and classifies data by owner, industry, domain and content type to give a significant quality push for MT for members who can pool data within their industry. This classification will also support cross-lingual industry-specific taxonomies and search capabilities in an effective way. TDA respects IP rights and is owned by its members. TDA members maintain the data and watch over the quality.”

From these articles it’s clear that the TAUS perspective and execution is not as categorized in this blog post.

Later in the post you use a quote from Common Sense Advisory on sharing TM data in TDA to highlight ‘eloquently’ the problem.

You do not provide links to any of the use cases on the TDA site or other sites which prove that shared data is beneficial for training customized MT engines.

You do not use any of the quotes from the TDA homepage. One example:

"We joined TDA because it’s the first of its kind and the prominence of other like minded members. We believe from this group great ideas will spring." Paula Shannon Lionbridge, 2008

Given the increasing number of TM sharing offerings in the market, we might reasonably conclude that there’s some value to be gained.

You don’t quote practitioners:

"We're expecting substantial quality improvements in our and competitors’ machine translation engines, as vast amounts of domain-specific data become available through TDA. It's likely there'll be issues with leveraging multiple TMs if terminology and style are too different. And so certain pre-selection, normalization and standardization processes may be needed." Manuel Herranz, Pangeanic, 2009, TAUS Innovation and Interoperability Report

My focus in the blog was on “Is more data enough to keep GOOGLE MT systems improving? and “Have we reached the limit of that view?” My point of focus was the Guardian article and the reactions to it from @localization and @translationguy in particular.

The post does not undermine or question the TAUS / TDA mission in any way, which I am aware can, and has in fact added value for some if not many users and members. In the Asia Online TDA data consolidation study we indicated very clearly that when the TDA TM data is cleaned, standardized and normalized, pooling data does make sense and does produce better SMT systems. We did this in gory detail in a 50 page report that describes exactly what Manuel is saying in the quote you have provided. However, we also warn that TM in general can be noisy and thus less useful when this is the case, as several TAUS members have also pointed out at AMTA and TDA meetings.

My intention in providing the links is to show that all of us (especially Google) initially believed that data volume was enough to make continuing progress. And anybody who followed the links would see many of the quotes you provide above, and also see that this “more data is better” notion can be nuanced sometimes as you rightly point out.

The data vs. quality chart also shows that systems improve with more data TO A POINT. Most people will need to pool data to achieve anywhere close to those volumes. Google is the first to really hit the volume limits. I suspect that they probably have more data in English-Finnish alone than TDA has in total. In building customized systems, most of us are not close to the volume where this is really an issue.

I did not include TDA member comments on why they joined, because I am not sure how this is really relevant to the core focus of the post.

The CSA comments also validate the “volume is not enough” issue and simply that quality and consistency also matter. (BTW I simply state that they say this clearly). The links enable people to see the comment in its original form (without bias I may introduce) and I have simply used a selective quote to highlight the quality issue.

I think we are all learning that hype is likely to create disappointment with MT, SMT, Speech-2-Speech systems, Google Buzz, Google Wave and even the TDA.

While the TDA helps some people who need TM, this is not enough for really good SMT systems, it is also necessary to have huge volumes of monolingual data. The big discovery from Guardian article from my viewpoint is that there is one more huge reason to focus on data quality, as we now understand that data volume alone may not be enough. Quality matters too and may matter more.

Interesting discussion, with two separate aspects, philosophy and practicalities. I fully agree with the author´s recommendations to prep linguistic assets before incorporation into any serious corpus or project.

On the other hand, the conversation aobut to which extent increases in the size of linguistic corpora drive MT quality improvements reminds me of the speed of light: it is theoretically possible, but unlikely.

In strict adherence to the Paretto principle, you can achieve gisting quality with relative ease. Moving beyond that point seems to become exponentially more difficult. There may be movement, but it is so small that it is probably difficult to identify.

That said, I am sure that MT will replace translators... in certain usages. It is already happening. While people may not buy a novel translated using MT, milliones of web denizens are using online translation systems to answer some basic questions about information presented to them in other languages. At that level, where the goal is to get some basic information for no cost, MT is in fact replacing translators. Or to be more precise, MT is providing a service that did not exist before: people went without it because professional translation was too expensive for information not deemed critical enough to pay for obtaining it.

Finally, let us remember that human translators, the golden standard of quality, make mistakes... just like software programs. In the case of MT vs human translation, mistakes are a question of number and caliber, not an either or.

I posted this on the LinkedIn group. I will paste it here for the general public as well.

Whilst data is paramount to any SMT system, not all types of data are relevant to any specific SMT application. Better-harvested data has proved to provide better and better results than "any" time of data. By "better-harvested" I mean data which is relevant to the domain for which the system is designed. Take away anything that is not relevant for my domain (biology, software, technical, etc). Post-editors can live with words left in the original language because they are bilingual, so if the engine does not have the word "mule" or "snail" in software nor "network" or "folder" in biology, it is probably because it was never needed until that time. Picking the right data sets is very important and from there, the more the better. As far as SMT systems work on n-grams, repetitions also seem to have a good impact, apparently.

I have heard this from a lot of Moses users, for example, that an additional effort had to be paid to the "building" or "self-generation" or texts which have helped the statistical system to decide one way or another. This sets a point for the "post-edited material improves the scores". Indeed, because you are feeding similar (and repetitions) into the engine. I deal with this briefly in my latest blog entry http://pangeanic.wordpress.com/2010/12/12/moses-is-not-the-new-messiah/ stating the limitations and also the good points about Moses, which has become one de facto standard in the industry, such as GT.

The issue with GT, as acknowledged by most staticians is that it is too big, too general, although the output in some languages (particularly French and Spanish) is becoming very good. Many companies need an internal system as they do not want their pre-published documentation flying over the Internet. Most translators do not care and use any plug-ins to any web tools (mainly GT).

Maybe "the more data, the better" concept should be modified to "the more GOOD data, the better". This is at least what Asia Online machine translation specialist are saying. Their approach is the concentrate on "small volume of clean data" instead of "large volume of dirty data". Please read more at their web site: open http://www.asiaonline.net/technology.aspx and search for "unique approach".