Pages

Thursday, June 8, 2017

For as long as I have been engaged with the professional translation industry, I have seen that there exist great confusion and ambiguity around the concept of "translation quality". This is a services industry where nobody has been able to coherently define "quality" in a way that makes sense to a new buyer and potential customer of translation services. It also is, unfortunately, the basis of a lot of the differentiation claims made by translation agencies in competitive situations. Thus, is it surprising that many buyers of translation services are mystified and confused about what this really means?To this day it is my sense that the best objective measures of "translation quality", imperfect and flawed though they may be, come from the machine translation community. The computational linguistics community have very clear definitions of adequacy and fluency that can be reduced to a number, and have the perfect order that mathematics provide.The tranlsation industry is however reduced to confusing discussions, where ironically the words and terms used in the descriptions, are ambiguous and open to multiple interpretations. It is really hard to just say, " We produce translations that are accurate, fluent and natural," since we have seen that these words mean different things to different people. To add to the confusion, translation output quality discussions are often conflated with translation process related issues. I maintain that the most articulate and generally useful discussion on this issue comes from the MT and NLP communities.

I feel compelled to provide something on this subject below that might be useful to a few, but I acknowledge that this remains an unresolved issue, that undermines the perceived value of the primary product that this industry produces. Here are the basic criteria that a Translation Service Provider offering a quality service should fulfill:a) Translation

Correct transfer of information from the source text to the target text.

Appropriate choice of terminology, vocabulary, idiom, and register in the target language.

Appropriate use of grammar, spelling, punctuation, and syntax, as well as the accurate transfer of dates, names, figures, etc. in the target language.

Appropriate style for the purpose of the text.

b) Work process

Certification in accordance with national and/or international quality standards.

Gábor Ugray provides an interesting perspective on "Translation
Quality" below and again raises some fundamental questions about the
value of new fangled quality assessment tools, when we have yet to
clarify why we do what we do. He also provides very thoughtful guidance
on the way forward and suggests some things that IMO might actually
improve the quality of the translation product. Quality
definitions based on error counts etc.. are possibly useful to the
dying bulk market as Gabor points out, and as he says, "real quality"
comes from clarifying intent, understanding the target audience,
long-term communication and writing experience, and from new in situ and
in process tools, that enhance the translator work and
knowledge-gained-via-execution experience that these new tools might
provide. Humans learn and improve by watching carefully when they make
mistakes, (how, why, where), not by keeping really accurate counts of
errors made.

We desperately need new tools that go beyond the TM and MT
paradigm as we know it today, and really understand what might be useful
and valuable to a translator or an evolving translation process.
Fortunately, Gabor is in a place where he might get some to listen to
these new ideas, and even try new implementations that actually produce
higher quality.The emphasis and callouts in his post below are almost all mine.

================

An idiosyncratic mix of human and machine translation might be the key to tracing down the notorious ransomware, WannaCry. What does the incident tell us about the translating profession’s prospects? A post on – translation quality.

Quality matters, and it doesn’t

Flashpoint’s stunning linguistic analysis[1] of the WannaCry malware was easily the most intriguing piece of news I read last week (and we do live in interesting times). This one detail by itself blows my mind: WannaCry’s ransom notice was dutifully localized into no less [2] than 28 languages. When even the rogues are with us on the #L10n bandwagon, what other proof do you need that we live in a globalized age?

But it gets more exciting. A close look at those texts reveals that only the two Chinese versions and the English text were authored by a human; the other 25 are all machine translations. A typo in the Chinese suggests that a Pinyin input method was used. Substituting 帮组 bāngzǔ for 帮助 bāngzhù is indicative of a Chinese speaker hailing from a southern topolect. Other vocabulary choices support the same theory. The English, in turn, “appears to be written by someone with a strong command of English, [but] a glaring grammatical error in the note suggests the speaker is non-native or perhaps poorly educated.” According to Language Log[3], the error is “But you have not so enough time.”
I find all this revealing for two reasons. One, language matters. With a bit of luck (for us, not the hackers), a typo and an ungrammatical sentence may ultimately deliver a life sentence for the shareholders of this particular venture. Two, language matters only so much. In these criminals’ cost-benefit analysis, free MT was exactly the amount of investment those 25 languages deserved.

This is the entire translating profession’s current existential narrative in a nutshell. One, translation is a high-value and high-stakes affair that decides lawsuits; it’s the difference between lost business and market success. Two, translation is a commodity, and bulk-market translators will be replaced by MT real soon. Intriguingly, the WannaCry story seems to support both of these contradictory statements.

Did the industry sidestep the real question?

I remember how 5 to 10 years ago panel discussions about translation quality were the most amusing parts of conferences. Quality was a hot topic and hotly debated. My subjective takeaway from those discussions was that (a) everyone feels strongly about quality, and (b) there’s no consensus on what quality is. It was the combination of these two circumstances that gave rise to memorable, and often intense, debates.

Fast-forward to 2017, and the industry seems to have moved on from this debate, perhaps admitting through its silence that there’s no clear answer.

Or is there? The heated debates may be over, but quality assessment software seems to be all the rage. There’s TAUS’s DQF initiative[4]. Its four cornerstones are (1) content profiling and knowledge base; (2) tools; (3) a quality dashboard; (4) an API. CSA’s Arle Lommel just wrote [5] about three new QA tools on the block: ContentQuo, LexiQA, and TQAuditor. Trados Studio has TQA, and memoQ has LQA, both built-in modules for quality assessment.

I have a bad feeling about this. Could it be that the industry simply forgot that it never really answered the two key questions, What is quality? and How do you achieve it? Are we diving headlong into building tools that record, measure, aggregate, compile into scorecards and visualize in dashboards, without knowing exactly what and why?

A personal affair with translation quality

I recently released a pet project, a collaborative website for a German-speaking audience. It has a mix of content that’s partly software UI, partly long-form, highly domain-specific text. I authored all of it in English and produced a rough German translation that a professional translator friend reviewed meticulously. We went over dozens of choices ranging from formal versus informal address to just the right degree of vagueness where vagueness is needed, versus compulsive correctness where that is called for.

How would my rough translation have fared in a formal evaluation? I can see the right kind of red flags raised for my typos and lapses grammar, for sure. But I cannot for my life imagine how the two-way intellectual exchange that made up the bulk of our work can be quantified. It’s not a question of correct vs. incorrect. The effort was all about clarifying intent, understanding the target audience, and making micro-decisions at every step of the way in order to achieve my goals through the medium of language.

Lessons from software development

The quality evaluation of translations has a close equivalent in software development.

With the latest surge of quality tools, CAT tools now have quality metrics based on input from human evaluators. Software developers have testers, bug tracking systems and code reviews that do the same.

But that’s where the similarities end. Let me key you in on a secret. No company anywhere evaluates or incentivizes developers through scorecards that show how many bugs each developer produced.

Some did try, 20+ years ago. They promptly changed their mind or went out of business.[6]

Ugly crashes not withstanding, the software industry as a whole has made incredible progress. It is now able to produce more and better applications than ever before. Just compare the experience of Gmail or your iPhone to, well, anything you had on your PC in the early 2000s.

The secret lies in better tooling, empowering people, and in methodologies that create tight feedback loops.

Tooling, empowerment, feedback

In software, better tooling means development environments that understand your code incredibly well, give you automatic suggestions, allow you to quickly make changes that affect hundreds of files, and to instantly test those changes in a simulated environment.

No matter how you define quality, in intellectual work, it improves if people improve. People, in turn, improve through making mistakes and learning from them. That is why empowerment is key. In a command-and-control culture, there’s no room for initiative; no room for mistakes; and consequently, no room for improvement.

But learning only happens through meaningful feedback. That is a key ingredient of methodologies like agile. The aim is to work in short iterations; roll out results; observe the outcome; adjust course. Rinse and repeat.

Takeaways for the translation industry

How do these lessons translate (no pun intended) to the translation industry, and how can technology be a part of that?

The split. It’s a bit of an elephant in the room that the so-called bulk translation market is struggling. Kevin Hendzel wrote about this very in dramatic terms in a recent post[7]. There is definitely a large amount of content where clients are bound to decide, after a short cost-benefit analysis, that MT makes the most sense. Depending on the circumstances it may be generic MT or the more expensive specialized flavor, but it will definitely not be human translators. Remember, even the WannaCry hackers made that choice for 25 languages.

But there is, and will always be, a massive and expanding market for high-quality human translation. Even from a purely technological angle, it’s easy to see why MT systems don’t translate from scratch. They extrapolate from existing human translations, and those need to come from somewhere.

My bad feeling. I am concerned that the recent quality assessment tools make the mistake of addressing the fading bulk market. If that’s the case, the mistake is obvious: no investment will yield a return if the underlying market disappears. Source: TAUS Quality Dashboard [link]

Why do I think that is the case? Because the market that will remain is the high-quality, high-value market, and I don’t see how the sort of charts shown in the image above will make anyone a better translator.

Let’s return to the problems with my own rough translation. There are the trivial errors of grammar, spelling and the like. Those are basically all caught by a good automatic QA checker, and if I want to avoid them, my best bet is a German writing course and a bit of thoroughness. That would take me to an acceptable bulk translator level.

As for the more subtle issues – well, there is only one proven way to improve there. That way involves translating thousands of words every week, for 5 to 10 years on end, and having intense human-to-human discussions about those translations. With that kind of close reading and collaboration, progress doesn’t come down to picking error types from a pre-defined list.

Feedback loops. Reviewer-to-translator feedback would be the equivalent of code reviews in software development, and frankly, that is only part of the picture. That process takes you closer to software that is beautifully crafted on the inside, but it doesn’t take you closer to software that solves the right problems in the right way for its end users. To achieve that, you need user studies, frequent releases and a stable process that channels user feedback into product design and development.

Imagine a scenario where a translation’s end users can send feedback, which is delivered directly to the person who created that translation. I’ll key you in on one more secret: this is already happening. For instance, companies that localize MMO (massively multiplayer online) games receive such feedback in the form of bug reports. They assign those straight to translators, who react to them in a real-time collaborative translation environment like memoQ server. Changes are rolled out on a daily basis, creating a really tight and truly agile feedback loop.

Technology that empowers and facilitates. For me, the scenario I just described is also about empowering people. If, as a translator, you receive direct feedback from a real human, say a gamer who is your translation’s recipient, you can see the purpose of your work and feel ownership. It’s the agile equivalent of naming the translator of a work of literature.

If we put metrics before competence, I see a world where the average competence of translators stagnates. Instead of an upward quality trend throughout the ecosystem, all you have is a fluctuation, where freelancers are data points that show up on this client’s quality dashboard today, and a different client’s tomorrow, moving in endless circles.

I disagree with Kevin Hendzel on one point: technology definitely is an important factor that will continue to shape the industry. But it can only contribute to the high-value segment if it sees its role in empowerment, in connecting people (from translators to end users), in facilitating communication, and in establishing tight and actionable feedback loops. The only measure of translation quality that everyone agrees on, after all, is fitness for purpose.

Gábor Ugray is co-founder of Kilgray, creators of the memoQ collaborative translation environment and TMS. He is now Kilgray’s Head of Innovation, and when he’s not busy building MVPs, he blogs at jealousmarkup.xyz and tweets as @twilliability.

25 comments:

"The computational linguistics community have very clear definitions of adequacy and fluency"

I can't see them, unless you are thinking of the academic definitions, which are hard to measure, quantitatively.

The problem with BLEU is not the method itself, but with the reference samples, which cannot be but biased because "perfection" is assumed to be human, because they are chosen by humans, because edit distance, even at a segment level, is uneffective to measure intrinsic quality. By the way, both of you gave no definition of translation quality yet.

In fact, registry and style, which are supposed to affect adequacy and fluency are vague and ambiguous to get, and hard to measure, quantitatively.

All this is one of the reasons for me to reach my gun when I hear quality associated with translation, even more after 35 years in this sector.

And yes, the industry may have forgot to answer the two key questions of what is quality and how to achieve it, because it has been doped for years with the intrinsic artistic nature of translation, even when it is an industrial-like business. We use 'industry,' don't we?

As per QA tools, the many false positives in which they run, which are in fact impossible to eliminate, show that they are still far from tools that can really make a difference. On the other hand, aren't they all based on Levenshtein distance?

One last comment on the absurdity of the premium v. bulk market. Translation tools like memoQ exist thanks to the so-called bulk market, which is actually made up primarly of translation shops.

The so-called "premium market" is a hoax, it does not exist. First, should it exist, it would be a segment, not a market. Secondly, there are possibly 'premium' customers, paying more and, most importantly, better than ordinary customers (i.e. translation shops), but they are hard to find and retain.

Have you ever noticed that all those who beset colleagues, especially the younger ones, with this idea carefully avoid providing a single proof of the existence of this market and showing that they benefit of it?

Much of the discussion on translation quality are frustrated by the vagueness of the concept in general in the translation industry, so, perhaps I do overstate the precision of the definitions in the CL community. However, compared to the super vague discussion in the translation industry this: https://www.taus.net/academy/best-practices/evaluate-best-practices/adequacy-fluency-guidelines

and this which attempts to get A & F scores without a reference set: http://www.aclweb.org/anthology/P/P11/P11-2027.pdf are dramatically clearer than anything in the "industry" on what adequacy and accuracy mean, and how they could be measured.

I did actually provide a lamish attempt at a definition, just under my comment: "I feel compelled to provide something on this subject". I agree it is pretty bad.

And BLEU with multiple references (4 if possible) is actually quite useful if this properly done, which is often the rub.

This is a messy issue to discuss since the discussion generates so little of value, but that may be exactly why we need to keep trying until we find a way to do this in a way that is actually useful and valuable to many. Remember that MT is a path that is littered with failure -- repeated and frequent failure. Yet we continue to try.

Thank you for your opening comments that help take the discussion forward. I think Gábor makes some very interesting points, from whichever angle in the industry we happen to be looking at this topic.

I'd like to make an observation on your comment about QA tools. The many false positives that QA tools traditionally generate can be decreased substantially enough to make a difference in the QA process. The "secret recipe" is to employ locale-specific checks that would apply to every target locale. The challenge with this method is that it takes a lot of time to develop and tailor and refine these algorithms in order to have better results. However, there can be better results and I can vouch for that through my personal experience at lexiQA.

I should also note that we are not using Levenshtein distance metrics to achieve that. Mainstream QA tools use that measurement as the focus of their checks is consistency both at the word/unit level and the segment level. However, consistency checks alone will obviously produce a very high FP rate. This is a topic that I also happen to discuss extensively in a recent article series about linguistic QA (https://www.linkedin.com/pulse/future-linguistic-quality-assurance-localization-vassilis-korkas).

Hmmm, two contrasting articles in a single blog post. I enjoyed Gábor's article, but I am mystified by Kirti's introduction, and especially the regret expressed about the lack of an agreed definition of quality.

Can we agree on what is good music? On good taste in fashion? On the best political convictions, economic policies, product design or other areas? Why should we be surprised about our lack of agreement on what translation quality means?

I am even more mystified by Luigi's insistence on the primacy of measurable factors (e.g. in his suggestion that style and register are not measurable and therefore non-existent), about his assumption that we all regard translation as an "industry", and even more so about his outright rejection of the premium market (or segment, or whatever he may call it) simply because he has not experienced it.

Translation is not a single entity. There are many different use cases, skill levels, production scenarios, career patterns and requirements, and in practice the translation sector is broken down into many different markets. We cannot DEFINE what quality is in a once-and-for-all statement (and still less in an authoritative formula). We can DESCRIBE various types of translation project in terms of the use case needs, the possible tools, the requirements for the standard of the output etc. But to do this, we need to use WORDS. And some people in the translation scene are not comfortable using words - they can only make sense of the dicussion by using NUMBERS.

And that, in itself, is an interesting pointer to the vested interests in the debate.

“Why should we be surprised about our lack of agreement on what translation quality means?”

The ability to define quality in some way IMO is necessary for most professional services which expect to build an ongoing customer base. If you consider other services e.g. hotels (temporary stay), haircuts, accounting services you may realize that there are a range of options that vary in price and UNDERSTOOD QUALITY. This ability to understand the differences in quality makes it easier for a customer to decide amongst available options. The following are not the best examples but I think illustrate the general difficulty.

Consider an accounting firm scenario: Deloitte & KPMG vs Uncle Joe’s Accounting Svcs, most customers will understand that the big firms provide comprehensive accounting services with carefully selected service personnel to large companies but Uncle Joe may be much better for small companies e.g. mechanics, pizza stores and other small businesses who need to file tax returns and maintain basic accounting records – book-keeping. Several quality indicators are available beyond just price, because the end product has a defined quality that is understood by a buyer

In translation, the quality discussion and definition is much more muddy and vague when the product is translated output. A $100M software company will have difficulty choosing between a large LSP and a small one to translate documentation, even though the end product could be quite different, and of large variation in quality and it may take bad customer feedback to find this out. This is difficult for a buyer who does not have a localization team, and thus I think can scare many potential customers away. When a customer realizes that most LSPs use freelance translators who can vary in competence and quality that can also raise concerns. Probably the safest solution for a buyer is to identify competent, high-quality translators and work directly with them which happens in some very specialized segments. But we all know how difficult that is.

This inability to define quality in a way that is well understood by all kinds of buyers, also lowers barriers to entry and creates “industry” fragmentation – while Uncle Joe needs a CPA to be taken seriously, anyone essentially can claim to be the CEO of an LSP because he has a few translator friends who he can broker or simply because he is functionally bilingual.

Because you are supposed to be in business to make a living, aren't you? In this case "The ability to define quality in some way IMO is necessary for most professional services which expect to build an ongoing customer base."Possibly you are an amateur, translating for passion and love, not for money. In this case, even your first comment should be read in a totally different perspective.

Victor, Perhaps it is not a possibility to define this in a way that makes sense to buyers. Thus, I said it is not or should not be surprising to any observer, that buyers are confused. This inability may also be the root of the lack of respect that many in the industry feel. If the people who pay you to do what you do cannot tell when you have done a good job or a bad job, and have little to measure this by, I think it has direct implications on the general value of the service, price,professional status etc.. In contrast the MT community can talk about BLEU scores and ongoing improvement and even though we all know that BLEU is flawed, it is still a consistent and persistent attempt to explain quality improvements that buyers can perhaps grasp more easily than they can most other translation quality descriptions.

Hmm, another ambiguous term. Your "buyers" are completely different from my clients, who know what they want (someone who can handle their specialised subject matter, understand what they are saying in German and put it into good English). Many of them have done study courses in English-speaking countries, but still prefer to give their translation work to me as a native speaker.At present they are not confused about what they want, but that would probably change if you tried to explain BLEU (or any other standardised numerical system) to them.

I use the word buyer in a broader sense, not an established client who understands that Victor is reliable and produces acceptable quality. The NEW buyer (who may not be bilingual) who wants to translate content, but has never done this before is very likely going to need to know what he will get before he begins.Even your client has a tacit quality level that needs to be met and I am sure you will hear complaints if you send him MT output instead of your own work. To appeal to NEW buyers it is necessary to have a meaningful and clear conversation about the quality of the delivered product.

Viktor,everything is measurable, more or less loosely. Machine learning is now enabling also the measurement of style (I wrote about this in my blog and in the TAUS Review.) Anyway, as long as you provide no objective criteria, style and register remain so subjective to be non-significant for any measurement whatsoever. And this is exactly what has been happening in translation for centuries, and is still reinforced everyday in translation courses.The distinction between market and segment is very precise, in economics. It is also a terminological issue in this respect, and is not opinion matter.And please, don't assume: when you assume... You have no element to say that I experience no premium customer. Actually, in 35 years, I have won and lost more than one.As per a once-and-for-all definition of quality, I'm sorry, there is one actually. You can find it in any of Crosby's articles, in the relevant ISO standards, and more.What you are saying when you are trying to "describe the various tupe of translation projects" is actually outlining requirements.Finally, I'm sorry for you for not valueing numbers, but this is not even an obsolete stance, it is dull.

You write: "as long as you provide no objective criteria, style and register remain so subjective to be non-significant for any measurement whatsoever. And this is exactly what has been happening in translation for centuries".In other words, you regard "measurement" as the prime standard by which everything else must be judged, the "holy grail" of translation. That may be relevant to some types of translation project, especially if you are working on the refinement of MT solutions.But in the type of work I deal with, a subjective evaluation by a competent expert is the best way to judge the quality of a translation, and any numerical measurement is usually irrelevant.As you so kindly point out, this approach is based on many centuries of experience, and I do not believe that all of history should be dispatched to the rubbish heap lock-stock-and-barrel.

Victor I don't think it is necessary to always have numerical measures except where it is useful. But I think every NEW client will want a way to define this so that they can use it, know what it is likely to be before they buy it and have more than a general idea that the product is ready. The more clarity on the definition of quality upfront, the easier the conversation is after the work is delivered. Clear Quality definition helps create expectation equivalency.

Viktor,I must assume that you deal with stuff as dreams are made on, but I am afraid it has more to do with faith than with reason, and business usually does not allow leaps of faith, although some business may require some, like finance.I am also sorry for Galileo: His work was obviously meaningless for you; maybe it is still blasphemous to you.What else? You are a main of faith, who probably does not believe in science but firmly believes in Übersetzungswissenschaft, I'm a positivist, for whom numbers are a reassuring haven.Take care.

This is an interesting discussion! May I add a new notion? In this article on tcworld (http://bit.ly/2szM0Rm), there's the idea to combine the "human element" with "numbers" to get a grasp on translation quality - based on pre-defined expectations.

Hi Arnold, I read your link, and while the appeal to treat the review process as a constructive part of the process seems logical in theory, in practice I feel that the shelf life of appeals to try harder, get things right, give the translator instant feedback, use controlled language in source texts etc. is probably not much longer than last year's new year's resolutions. And I didn't notice any significant use of numbers in the article either (apart from a pretty graph with no obvious application), so I am not sure that the MT advocates will be more impressed than I am.

Hi Victor, it's an approach to assess and manage quality, no more, no less. On June 22nd, TAUS had an interesting webinar on "Quality is Measurement" with Bodo Vahldieck from VMWare and Daniel Chin from Spartan Software. They have developed and implemented their own review platform which VMWare uses to measure quality. Bodo mentioned "big data" which allows them to identify problems in real time. So, this similar approach is working, at least for VMWare. Unfortunately, the recording is not online yet.

Hi Arnold, I'm not surprised that TAUS comes into this with its confession of faith, i.e. that "quality is measurement", or that an appeal is made to "big data". If you have been modestly aware of my blog, you may already know that I am not exactly a TAUS fanboy. The problem is that we live in different universes. While TAUS and consorts dine on "big data" and crunch quality measurement numbers for after-dinner recreation, I continue to translate complex texts which regularly contain terminology that even Google has hardly ever heard of (not to mention the syntactical contortions that are often involved), and I have clients who are grateful when I point out logical inconsistencies in the source text.I have no doubt that the mass-produced translation "industry" exists, and that MT and semi-automated quality evaluation are used in that "industry". But I tend to freak out when MT enthusiasts scold translators like me for not getting on the numbers bandwagon. And I then laugh when MT apologists suggest that translators should engage with the MT community to improve the situation. If chickens were to spend their time negotiating hunting rules with the fox, where would you find a boiled egg in ten years' time?

Hi Victor, I totally understand your point of view and I couldn't agree more. There are at least two, probably more, universes of translations out there. For your universe, TAUS etc. doesn't make any sense. But for the translation "industry", as you call it, quality measurement and management will become more and more important. Quality has many facets, but in the end your and the translation "industry" both have to deliver on this promise. Otherwise, the customers will go somewhere else.