Pages

Thursday, August 10, 2017

This is a guest post by Gábor Ugray on NMT model building challenges and issues. Don't let the playful tone and general sense of frolic in the post fool you. If you look more closely, you will see that it very clearly defines an accurate list of challenges that one might come upon when one ventures into building a Neural MT engine. This list of problems is probably the exact list that the big boys (Microsoft, FaceBook, Google, and others) have faced some time ago. I have previously discussed how SYSTRAN and SDL are solving these problems. While this post describes an experimental system very much from a do-it-yourself perspective, production NMT engines might differ only by the way in which they handle these various challenges.

This post also points out a basic issue about NMT - while it is clear that NMT works, often surprisingly well, it is still very unclear what predictive patterns are learned, which makes it hard to control and steer. Most (if not all) of the SMT strategies like weighting, language model, terminology over-ride etc.. don't really work here. Data and algorithmic strategies might drive improvement, but linguistic strategies seem harder to implement.

While it took many years before an open source toolkit (Moses) appeared for SMT, we see that NMT already has four open source experimentation options: OpenMT, Nematus, Tensorflow NMT, and Facebook's Caffe2. It is possible the research community at large may come up with innovative and efficient solutions to the problems we see described here. Does anybody still seriously believe that LSPs can truly play in this arena building competitive NMT systems by themselves? I doubt it very much and would recommend that LSPs start thinking about which professional MT solution to align with because NMT indeed can help build strategic leverage in the translation business if true expertise is involved. The problem with DIY (Do It Yourself) is that having multiple tool kits available is not of much use if you don't know what you are doing.

Discussions on NMT also seem to be often accompanied by people talking about the demise of human translators (by 2029 it seems). I remain deeply skeptical, even though I am sure MT will get pretty damned good on certain kinds of content, and believe that it is wiser to learn how to use MT properly, than dismiss it. I also think the notion of that magical technological convergence that they call Singularity is kind of a stretch. Peter Thiel (aka #buffoonbuddypete) is a big fan of this idea and has a better investment record than I do, so who knows. However, I offer some quotes from Steven Pinker that have the sonorous ring of truth to them:"There is not the slightest reason to believe in a coming singularity. Sheer processing power [and big data] is not a pixie dust that magically solves all your problems." Steven Pinker

"… I’m skeptical, though, about science-fiction
scenarios played out in the virtual reality of our imaginations. The
imagined futures of the past have all been confounded by boring details:
exponential costs, unforeseen technical complications, and insuperable
moral and political roadblocks. It remains to be seen how far artificial
intelligence and robotics will penetrate into the workforce. (Driving a
car is technologically far easier than unloading a dishwasher, running
an errand, or changing a baby.) Given the tradeoffs and impediments in
every other area of technological development, the best guess is: much
farther than it has so far, but not nearly so far as to render humans
obsolete."

The emphasis below is all mine.

=====

We wanted a Frankenstein translator and ended up with a bilingual chatbot. Try it yourself! (The original title)

I don’t know about you, but I’m in a permanent state of frustration with the flood of headlines hyping machines that “understand language” or are developing human-like “intelligence.” I call bullshit! And yet, undeniably, a breakthrough is happening in machine learning right now. It all started with the oddball marriage of powerful graphics cards and neural networks.With that wedding party still in full swing, I talked Terence Lewis[*] into an even more oddball parallel fiesta. We set out to create a Frankenstein translator, but after running his top-notch GPU on full power for four weeks, we ended up with an astonishingly good translator and an astonishingly stupid bilingual chatbot.

And while we’re at it: Terence is obviously up for mischief, but more importantly, he offers a completely serious English<>Dutch machine translation service commercially. There is even a plugin available for memoQ, and the MyDutchPal system solves many of the MT problems that I’m describing later in this post.

And yet the plane is aloft! A fitting metaphor for AI’s state of the art.
Source: the internets.

So, check out the live demo below this image, then read on to understand what on earth is going on here.

Understanding deep learning

It all started in May when I read Adrian Colyer’s[2] summary of the article Understanding deep learning requires re-thinking generalization[3]. The proposition of Chiyuan Zhang & co-authors is so fascinating and relevant that I’ll just quote it verbatim:

What is it that distinguishes neural networks that generalize well from those that don’t?
[...]
Generalisation is the difference between just memorising portions of
the training data and parroting it back, and actually developing some
meaningful intuition about the dataset that can be used to make
predictions.

The authors describe how they set up a series of original experiments to investigate this. The problem domain they chose is not machine translation, but another classic of deep learning: image recognition. In one experiment, they trained a system to recognize images – except they garbled the data set, randomly shuffling labels and photos. It might have been a panda, but the label said bicycle, and so on, 1.2 million times over. In another experiment, they even replaced the images themselves with random noise.

The paper’s conclusion is… ambiguous. Basically, it shows that neural networks will obediently memorize any random input (noise), but as for the networks’ ability to generalize from a real signal, well, we don’t really know. In other words, the pilot has no clue what they are doing, and yet the plane is still flying, somehow.

I immediately knew that I wanted to try this exact same thing, but with a purpose-built neural MT system. What better way to show that no, there’s no talk about “intelligence” or “understanding” here! We’re really dealing with a potent pattern-recognition-and-extrapolation machine. Let’s throw a garbled training corpus at it: genuine sentences and genuine translations, but matched up all wrong. If we’re just a little bit lucky, it will recognize and extrapolate some mind-bogglingly hilarious non-patterns, our post about it will go viral, and comedians will hate us.

Choices, ingredients, and cooking

OK, let’s build a Frankenstein translator by training an NMT engine on a corpus of garbled sentence pairs. But wait…

What language pair should it be? Something that’s considered “easy” in MT circles. We’re not aiming to crack the really hard nuts; we want a well-known nut and paint it funny. The target language should be English, so you, dear reader, can enjoy the output. The source language… no. Sorry. I want to have my own fun too, and I don’t speak French. But I speak Spanish!

Crooks or crooked cucumbers? There is an abundance of open-source training data[4] to choose from, really. The Hansards are out (no French), but the EU is busy releasing a relentless stream of translated directives, rules and regulations, for instance. It’s just not so much fun to read bureaucratese about cucumber shapes. Let’s talk crooks and romance instead! You guessed right: I went for movie subtitles. You won’t believe how many of those are out there, free to grab.

Too much goodness. The problem is, there are almost 50 million Spanish-English segment pairs in the OpenSub2016[5] corpus. NMT is known to have a healthy appetite for data, but 50 million is a bit over the line. Anything for a good joke, but we don’t have months to train this funny engine. I reduced it to about 9.5 million segment pairs by eliminating duplicates and keeping only the ones where the Spanish was 40 characters or longer. That’s still a lot, and this will be important later.

Straight and garbled. At this stage, we realized we actually needed two engines. The funny translator is the one we’re really after, but we should also get a feel for how a real model, trained from the real (non-garbled) data would perform. So I sent Terence two large files instead of one.

The training. I am, of course, extremely knowledgeable about NMT, as far as bar conversations with attractive strangers go. Terence, on the other hand, has spent the past several months building a monster of a PC with an Nvidia GTX 1070 GPU, becoming a Linux magician, and training engines with the OpenNMT framework[6]. You can read about his journey in detail on the eMpTy Pages blog[7]. He launched the training with OpenNMT’s default parameters: standard tokenization, 50k source and target vocabulary, 500-node, 2-layer RNN in both encoder and decoder, 13 epochs. It turned out one epoch took about one day, and we had two models to train. I went on vacation and spent my days in suspense, looking roughly like this:

An astonishingly good translator

The “straight” model was trained first, and it would be an understatement to say I was impressed when I saw the translations it produced. If you’re into that sort of thing, the BLEU score is a commendable 32.10, which is significantly higher than, well, any significantly lower value.[8]

The striking bit is the apparent fluency and naturalness of the translations. I certainly didn’t expect a result like this from our absolutely naïve, out-of-the-box, unoptimized approach. Let’s take just one example:

Did you spot the tiny detail? It’s the feminine pronoun her in the translation. The Spanish equivalent, le, is gender-neutral, so it had to be extrapolated from la doctora – and that’s pretty far away in the sentence! This is the kind of thing where statistical systems would probably just default to masculine. And you can really push the limits. I added stuff to make that distance even longer, and it’s still her in the impossible sentence, La doctora no podía participar en la conferencia que los profesores y los alumnos habían organizado en el gran auditorio de la universidad para el día anterior, además no nos quedaba mucho tiempo, por eso le conté los detalles importantes yo mismo.
But once our enthusiasm is duly curbed, let’s take a closer look at the good, the bad and the ugly. If you purposely start peeling off the surface layers, the true shape of the emperor’s body begins to emerge. Most of these wardrobe malfunctions are well-known problems with neural MT systems, and much current research focuses on solving them or working around them.

Unknown words. In their plain vanilla form, neural MT systems have a severe limitation on the vocabulary (particularly target-language vocabulary) that they can handle. 50 thousand words is standard, and we rarely, if ever, see systems with a vocabulary over 100k. Unless you invest extra effort into working around this issue, a vanilla system like ours produces a lot of unks[9], like here:

This is a problem with fancy words, but it gets even more acute with proper names, and with rare conjugations of not-even-so-fancy words.

Omitted content. Sometimes, stuff that is there in the source simply goes AWOL in the translation. This is related to the fact the NMT systems attempt to find a most likely translation, and unless you add special provisions, they often settle for a shorter output. This can be fatal if the omitted word happens to be a negation. In the sentence below, the omitted part (in red) is less dramatic, but it’s an omission all the same.

Lynch trabaja como siempre, sin orden ni reglas: desde críticas a la televisión actual a sus habituales reflexiones
sobre la violencia contra las mujeres, pasando porparanoias mitológicas sobre el bien y el mal en la historia
estadounidense.
---
Lynch works as always, without order or rules: from criticism to television on
current television to his usual reflections about violence against the women,
going through right and wrong in American history.

Hypnotic recursion. Very soon after Google Translate switched to Neural MT for some of its language combinations, people started noticing odd behaviors, often involving loops of repeated phrases.[10] You see one such case in the example above, highlighted in green: that second television seems to come out of thin air. Which is actually pretty adequate for Lynch, if you think about it.

Learning too much. Remember that we’re not dealing with a system that “translates” or “understands” language in any human way. This is about pattern recognition, and the training corpus often contains patterns that are not linguistic in nature.

Mi hermano estaba conduciendo a cien km/h.
---
My brother was driving at a hundred miles an hour.

Since when is a mile a translation of kilometer? And did the system just learn to convert between the two? To some extent, yes. And that’s definitely not linguistic knowledge. But crucially, you don’t want this kind of arbitrary transformation going on in your nuclear power plant’s operating manual.

Numbers. You will have guessed by now: numbers are a problem. There are way too many of them critters to fit into a 50k-vocabulary, and they often behave in odd ways in bilingual texts attested in the wild. Once you stray away from round numbers that probably occur a lot in the training corpus, trouble begins.

Finally, data matters. Our system might be remarkably good, but it’s remarkably good at subtitlese. That’s all it’s ever seen, after all. In Subtitle Land, translations like the one below are fully legit, but they won’t get you far in a speech writing contest for the Queen.

No le voy a contar a la profesora.
---
I'm not gonna tell the teacher.

The garbled model

Now on to the “crazy” model! I made a tremendous mental effort to keep my expectations low, but secretly, at the bottom of my heart, I was hoping for the kind of nonlinear oddity that you get if you start inputting жо into Google Translate[11]:

Let’s just put it this way: I’ve heard funnier jokes before. And those jokes tended to be a lot less repetitive, too. OK, with a bit of luck you do get a few highlights, in the “free self-help advice for nuts” kind of way, but that’s about it.

En este día de Julio, me gustaría escribir algunas reflexiones sobre
como me siento, en relación con mi mismo, que es una de las relaciones
más difíciles y complejas que una persona debe llevar a adelante, y en
relación con los demás...
---
I'm sure you're aware of the fact that you're the only one who's
been able to find out what's going on, and I don't want you to think
that I'm the only one who can help you.

There seem to be two rules to this game:

What you input doesn’t matter a whole lot. The only thing that makes a real difference is how long it is.

The crazy “translations” have nothing to do with the source. They are invariably generic and bland. They could almost be a study in noncommittal replies.

And that last sentence right there is the key, as I realized while I was browsing the OpenNMT forums[12]. It turns out people are using almost the same technology to build chatbots with neural networks. If you think about it, the problem can indeed be defined in the same terms. In translation, you have a corpus of source segments and their translations; you collect a lot of these and train a system to give the right translation for the right source. In a chatbot, your segment pairs are prompts and responses, and you train the system to give the right response to the right prompt.

Except, this chatbot thing doesn’t seem to be working as well as MT. To quote the OpenNMT forum: People call it the "I Don't Know" problem and it is particularly problematic for chatbot type datasets.
For me, this is a key (and unanticipated) take-away from the experiment. We set out to build a crazy translator, but unwittingly we ended up solving a different problem and created a massively uninspired bilingual chatbot.

Two takeaways

Beyond any doubt, the more important outcome for me is the power of neural MT. The quality of the “straight” model that we built drastically exceeded my expectations, particularly because we didn’t even aim to create a high-quality system in the first place. We basically achieved this with an out-of-the-box tool, the right kind of hardware, and freely available data. If that is the baseline, then I am thrilled by the potential of NMT with a serious approach.

The “crazy” system, in contrast, would be a disappointment, were it not for the surprising insight about chatbots. Let’s pause for a moment and think about these. They are all over the press, after all, with enthusiastic predictions that in a very short time, they will pass the Turing test, the ultimate proof of human intelligence.

Well, it don’t look that way to me. Unlike translated sentences, prompts and responses don’t have a direct correlation. There is something going on in the background that humans understand, but which completely eludes a pattern recognition machine.For a neural network, a random sequence of letters in a foreign language is as predictable a response as a genuine answer given by a real human in the original language. In fact, the system comes to the same conclusion in both scenarios: it plays it safe and produces a sequence of letters that’s a generally probable kind of thing for humans to say.

Let’s take the following imaginary prompts and responses:

How old are you?
No, seriously, I took the red door by mistake.

Guess who came to yoga class today.
Poor Mary!

It would be a splendid exercise in creative writing to come up with a short story for both of them. Any of us could do it in a breeze, and the stories would be pretty amusing. There is an infinite number of realities where these short conversations make perfect sense to a human, and there is an infinite number of realities where they make no sense at all. In neither case can the response be predicted, in any meaningful way, from the prompt or the preceding conversation. Yet that is precisely the space where our so-called artificial “intelligence” currently live.

The point is, it’s ludicrous to talk about any sort of genuine intelligence in a machine translation system or a chatbot based on recurrent neural networks with a long short-term memory.

Comprehension is that elusive thing between the prompts and the responses in the stories above, and none of today’s technologies contains a metaphorical hidden layer for it. On the level our systems comprehend reality, a random segment in a foreign language is as good a response as Poor Mary!

About Terence *

Terence Lewis, MITI, entered the world of translation as a young brother in an Italian religious order, where he was entrusted with the task of translating some of the founder's speeches into English. His religious studies also called for a knowledge of Latin, Greek, and Hebrew. After some years in South Africa and Brazil, he severed his ties with the Catholic Church and returned to the UK where he worked as a translator, lexicographer[13] and playwright. As an external translator for Unesco, he translated texts ranging from Mongolian cultural legislation to a book by a minor French existentialist. At the age of 50, he taught himself to program and wrote a rule-based Dutch-English machine translation application which has been used to translate documentation for some of the largest engineering projects in Dutch history. For the past 15 years, he has devoted himself to the study and development of translation technology. He recently set up MyDutchPal Ltd to handle the commercial aspects of his software development. He is one of the authors of 101 Things a Translator Needs to Know[14].

References

[1] The live demo is provided "as is", without any guarantees of fitness for purpose, and without any promise of either usefulness or entertainment value. The service will be online for as long as I have the resources available to run it (a few weeks probably).
Oh yes, I'm logging your queries, and rest assured, I will be reading them all. I am tremendously curious to see what you come up with, and I want to enjoy all the entertaining or edifying examples that you find.
[2] the morning paper. an interesting/influential/important paper from the world of CS every weekday morning, as selected by Adrian Colyer.blog.acolyer.org/
[3] Understanding deep learning requires rethinking generalization. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals. ICLR 2017 conference submission.openreview.net/forum?id=Sy8gdB9xx¬eId=Sy8gdB9xx
[4] OPUS, the open parallel corpus. Jörg Tiedemann.opus.lingfil.uu.se/
[5] OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. Pierre Lison, Jörg Tiedemann.stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf
[6] OpenNMT: Open-Source Toolkit for Neural Machine Translation.arxiv.org/abs/1701.02810opennmt.net/
[7] My Journey into "Neural Land". Guest Post by Terence Lewis on the eMpTy Pages blog.kv-emptypages.blogspot.com/2017/06/my-journey-into-neural-land.html
[8] Never trust anyone who brags about their BLEU scores without giving any context. I’m not giving you any context, but you have the live demo to see the output for yourself.
Also, a few words about this score. I calculated it on a validation set that contains 3k random segment pairs removed from the corpus before training. So they are in-domain sentences, but they were not part of the training set. The score was calculated on the detokenized text, which is established MT practice, except in NMT circles, who seem to prefer the tokenized text, for reasons that still escape me.
And if you want to max out on the metrics fetish, the validation set’s TER score is 47.28. There. I said it.
[9] Don’t get me wrong, I’m a great fan of unks. They can attend my parties anytime, even without an invitation. If I had a farm I would be raising unks because they are the cutest creatures ever.
[10] Electric sheep. Mark Liberman on Language Log.languagelog.ldc.upenn.edu/nll/?p=32233
[11] From the same Language Log post quoted previously. Translations were retrieved on August 6, 2017; they are likely to change when Google updates their system.

Gábor Ugray is co-founder of Kilgray, creators of the memoQ collaborative translation environment and TMS. He is now Kilgray’s Head of Innovation, and when he’s not busy building MVPs, he blogs at jealousmarkup.xyz and tweets as @twilliability.

Monday, July 24, 2017

This is largely a guest post by Manuel Herranz of Pangeanic, slightly abbreviated and edited from the original, to make it more informational and less promotional. Last year we saw FaceBook announce that they were going to shift all their MT infrastructure to a Neural MT foundation as rapidly as possible, this was later followed by NMT announcements from SYSTRAN, Google, and Microsoft. In the months since we have seen that many MT technology vendors have also jumped onto the NMT wagon. Some with more conviction than others. The view for those who can go right into the black box and modify things (SDL, MSFT, GOOG, FB and possibly SYSTRAN) is, I suspect, quite different from those who use open source components and have to perform a "workaround" on the output of these black box components. Basically, I see there are two clear camps amongst MT vendors:

Those who are shifting to NMT as quickly as possible (e.g. SYSTRAN)

Those who are being much more selective and either "going hybrid = SMT+NMT" or building both PB-SMT and NMT engines and choosing the better one.(e.g. Iconic).

Pangeanic probably falls in the first group based on the enthusiasm in this post. Whenever there is a paradigm shift in MT methodology the notion of "hybrid" invariably comes up. A lot of people who don't understand the degree of coherence needed in the underlying technology generally assume this is a better way. Also, I think that sometimes, the MT practitioner has too much investment sunk into the old approach and is reluctant to completely abandon the old for the new. SMT took many years to mature and what we see today is an automated translation production pipeline that includes multiple models (translation, language, reordering etc..) together with pre and post processing of translation data. The term hybrid is sometimes used to describe this overall pipeline because data can be linguistically-informed on some of these pipeline steps.

When SMT first emerged, many problems were noticed (relative to the old RBMT model), and it has taken many years to resolve some of them. The solutions that worked for SMT will not necessarily work for NMT and in fact, there is a good reason to believe they clearly will not. Mostly because the pattern matching technology in SMT is quite different, even though it is much better understood, and more evident than in NMT. The pattern detection and learning that happens in NMT is much more mysterious and unclear at this point. We are still learning what levers to pull to make adjustments and fix weird problems that we see. What can be carried forward easily are data preparation, data and corpus analysis and data quality measures that have been built over time. NMT is a machine learning (pattern matching) technology that learns from data that you show it. Thus far it is limited to translation memory and glossaries.

I am somewhat skeptical about the "hybrid NMT" stuff being thrown around by some vendors. The solutions to NMT problems and challenges are quite different (from PB-SMT) and to me, it makes much more sense to me to go completely one way or the other. I understand that some NMT systems do not yet exceed PB-SMT performance levels, and thus it is logical and smart to continue using the older systems in such a case. But given the overwhelming evidence with NMT research and actual user experience in 2017, I think the evidence is pretty clear that NMT is the way forward across the board. It is a question of when, rather than if, for most languages. Adaptive MT might be an exception in the professional use scenario because it is learning in real time if you work with SDL or Lilt. While hybrid RBMT and SMT made some sense to me, hybrid SMT+NMT does not make any sense to me and triggers blips on my bullshit radar, as it reeks of marketing-speak rather than science. However, I do think that Adaptive MT built with an NMT foundation might be viable, and could very well be the preferred model for MT for years to come, in post-editing and professional translator use scenarios in future. It is also my feeling that as these more interactive MT/TM capabilities become more widespread the relative value of pure TM tools will decline dramatically. But I am also going to bet that an industry outsider will drive this change, simply because real change rarely comes from people with sunk costs and vested interests. And surely somebody will come up with a better workbench for translators than standard TM matching, one which provides translation suggestions continuously, and learns from ongoing interactions.

I am going to bet that the best NMT systems will come from those who go "all in" with NMT and solve NMT deficiencies without resorting to force-fitting old SMT paradigm remedies on NMT models or trying to go "hybrid", whatever that means.

Google Director of Research Peter Norvig said recently in a video about the future of AI/ML in general that although there is a growing range of tools for building software (e.g. the neural networks), “we have no tools for dealing with data." That is: tools to build data, and correct, verify, and check them for bias, as their use in AI expands. In the case of translation, the rapid creation of an MT ecosystem is creating a new need to develop tools for “dealing with language data” – improving data quality and scope automatically, by learning through the ecosystem. And transforming language data from today’s sourcing problem (“where can I find the sort of language data I need to train my engine?”) into a more automated supply line.

For me this statement by Norvig is a pretty clear indication that perhaps the greatest value-add opportunities for NMT come from understanding, preparing and tuning the data that ML algorithms learn from. In the professional translation market where MT output quality expectations are the highest, it makes sense that data is better understood and prepared. I have also seen that the state of the aggregate "language data" within most LSPs is pretty bad, maybe even atrocious. It would be wonderful if the TMS systems could help improve this situation and provide a richer data management environment to enable data to be better leveraged for machine learning processes. To do this we need to think beyond organizing data for TM and projects, but at this point, we are still quite far from this. Better NMT systems will often come from better data, which is only possible if you can rapidly understand what data is most relevant (using metadata) and can bring it to bear in a timely and effective way. There is also an excessive focus on TM in my opinion. Focus on the right kind of monolingual corpus can also provide great insight, and help to drive strategies to generate and manufacture the "right kind" of TM to drive MT initiatives further. But this all means that we need to get more comfortable working with billions of words and extracting what we need when a customer situation arises.

===============

The Pangeanic Neural Translation Project

So, time to recap and describe our experience with neural machine translation with tests into 7 languages (Japanese, Russian, Portuguese, French, Italian, German, Spanish), and how Pangeanic has decided to shift all its efforts into neural networks and leave the statistical approach as a support technology for hybridization.

We selected training sets from our SMT engines as clean data to train the same engines with the same data and run parallel human evaluation between the output of each system (existing statistical machine translation engines) and the new engines produced by neural systems. We are aware that if data cleaning was very important in a statistical system, it is even more so with neural networks. We could not add additional material because we wanted to be certain that we were comparing exactly the same data but trained with two different approaches.

A small percentage of bad or dirty data can have a detrimental effect on SMT systems, but if it is small enough, statistics will take care of it and won’t let it feed through the system (although it can also have a far worse side effect, which is lowering statistics all over certain n-grams).

We selected the same training data for languages which we knew were performing very well in SMT (French, Spanish, Portuguese) as well as those that have been known to researchers and practitioners as “the hard lot”: Russian as the example of a very rich morphologically language and Japanese as a language with a radically different grammatical structure where re-ordering (that’s what hybrid systems have done) has proven to be the only way to improve.

We used a large training corpus of 4.6 million sentences (that is nearly 60 million running words in English and 76 million in Japanese). In vocabulary terms, that meant 491,600 English words and 283,800 character-words in Japanese. Yes, our brains are able to “compute” all that much and even more, if we add all types of conjugations, verb tenses, cases, etc. For testing purposes, we did what is supposed to do not to inflate percentage scores and took out 2,000 sentences before training started. This is a standard in all customization – a small sample is taken out so the engine that is generated translates what is likely to encounter. Any developer including the test corpus in the training set is likely to achieve very high scores (and will boast about it). But BLEU scores have always been about checking domain engines within MT systems, not across systems (among other things because the training sets have always been different so a corpus containing many repetitions or the same or similar sentences will obviously produce higher scores). We also made sure that no sentences were repeated and even similar sentences had been stripped out of the training corpus in order to achieve as much variety as possible. This may produce lower scores compared to other systems, but the results are cleaner and progress can be monitored very easily. This has been the way in academic competitions and has ensured good-quality engines over the years.

The standard automatic metric in SMT did not detect much difference between the output in NMT and the output in SMT.

However, WER was showing a new and distinct tendency.

NMT shows better results in longer sentences in Japanese. SMT seems to be more certain in shorter sentences (training a 5 n-gram system)

And this new distinct tendency is what we picked up when the output was evaluated by human linguists. We used Japanese LSP Business Interactive Japan to rank the output from a conservative point of view, from A to D, A being human quality translation, B a very good output that only requires a very small percentage of post-editing, C an average output where some meaning can be extracted but serious post-editing is required and D a very low-quality translation without no meaning. Interestingly, our trained statistical MT systems performed better than the neural systems in sentences shorter than 10 words. We can assume that statistical systems are more certain in these cases when they are only dealing with simple sentences with enough n-grams giving evidence of a good matching pattern.

We created an Excel sheet (below) for human evaluators with the original English to the left with the reference translation. The neural translation followed. Two columns were provided for the rating and then the statistical output was provided.

Neural-SMT EN>JP ranking comparison showing the original English, the reference translation, the neural MT output and the statistical system output to the right

German, French, Spanish, Portuguese and Russian Neural MT results

The shocking improvement came from the human evaluators themselves. The trend pointed to 90% of sentences being classed as perfect translations (naturally flowing) or B (containing all the meaning, with only minor post-editing required). The shift is remarkable in all language pairs, including Japanese, moving from an “OK experience” to a remarkable acceptance. In fact, only 6% of sentences were classed as a D (“incomprehensible/unintelligible”) in Russian, 1% in French and 2% in German. Portuguese was independently evaluated by translation company Jaba Translations.

This trend is not particular to Pangeanic only. Several presenters at TAUS Tokyo pointed to ratings around 90% for Japanese using off-the-shelf neural systems compared to carefully crafted hybrid systems. Systran, for one, confirmed that they are focusing only on neural research/artificial intelligence and throwing away years of rule-based work, statistical and hybrid efforts.

Systran’s position is meritorious and very forward thinking. Current papers and some MT providers still resist the fact that despite all the work we have done over the years, Multimodal Pattern Recognition has got the better hand. It was only computing power and the use of GPUs for training that was holding it behind.

Neural networks: Are we heading towards the embedment of artificial intelligence in the translation business?

BLEU may be not the best indication of what is happening to the new neural machine translation systems, but it is an indicator. We were aware of other experiments and results by other companies pointing in a similar direction. Still, although the initial results may have made us think that there was no use to it, BLEU is a useful indicator – and in any case, it was always an indicator of an engine’s behavior not a true measure of an overall system versus another. (See the Wikipedia article https://en.wikipedia.org/wiki/Evaluation_of_machine_translation).

Machine translation companies and developers face a dilemma as they have to do without the research, connectors, plugins and automatic measuring techniques and build new ones. Building connectors and plugins are not so difficult. Changing the core from Moses to a neural system is another matter. NMT is producing amazing translations, but it is still pretty much a black box. Our results show that some kind of hybrid system using the best features of an SMT system is highly desirable and academic research is moving in that direction already – as it happened with SMT itself some years ago.

Yes, the translation industry is at the peak of the neural networks hype. But looking at the whole picture and how artificial intelligence (pattern recognition) is being applied in several other areas, in order to produce intelligent reports, tendencies, and data, NMT is here to stay – and it will change the game for many, as more content needs to be produced cheaply with post-edition, at light speed when good machine translation is good enough. Amazon and Alibaba are not investing millions in MT for nothing – they want to reach people in their language with a high degree of accuracy and at a speed, human translators cannot.

Manuel Herranz is the CEO of Pangeanic. Collaboration with Valencia’s Polytechnic research group and the Computer Science Institute led to the creation of the PangeaMT platform for translation companies. He worked as an engineer for Ford machine tool suppliers and Rolls Royce Industrial and Marine, handling training and documentation from the buyer’s side when translation memories had not yet appeared in the LSP landscape. After joining a Japanese group in the late 90’s, he became Pangeanic’s CEO in 2004 and began his machine translation project in 2008 creating the first, command-line versions of the first commercial application of Moses (Euromatrixplus) and was the first LSP in the world to implement open source Moses successfully in a comercial environment, including re-training features and tag handling before they became standard in the Moses community.

Tuesday, July 18, 2017

This is a post by Vassilis Korkas on the quality assurance and
quality checking processes being used in the professional translation
industry.(I still find it really hard to say localization,
since that term is really ambiguous to me, as I spent many years trying
to figure out how to deliver properly localized sound through digital audio platforms. To me, localized sound = cellos from the right and
violins from the left of the sound stage. I have a strong preference for instruments to stay in place on the sound stage for the duration of the piece. )

As the
volumes of translated content increase, the need for automated
production lines also grows. The industry is still laden with products
that don't play well with each other, and buyers should insist that
vendors of the various tools that they use enable and allow easy
transport and downstream processing of any translation related content.
Froom my perspective automation in the industry is also very limited, and there is a huge need
for human project management because tools and processes don't connect
well. Hopefully, we start to see this scenario change. I also hope that the database engines for these new processes are much smarter about NLP and much more ready to integrate machine learning elements as this too will allow the development of much more powerful, automated, and self correcting tools.As an aside, I thought this chart was very interesting, (assuming it is actually based on some real research), and shows why it is much more worthwhile to blog than to share content on LinkedIn, Facebook or Twitter. However, the quality of the content does indeed matter and other sources say that high quality content has an even longer life than shown here.

Finally, CNBC had this little clip describing employment growth in the translation sector where they state: "The number of people employed in the translation and interpretation industry has doubled in the past seven years."Interestingly, this is exactly the period where we have seen the use of MT also dramatically increase. Apparently, they conclude that technology has also helped to drive this growth.

The emphasis in the post below is mine.

==========

In
pretty much any industry these days, the notion of quality is one that
seems to crop up all the time. Sometimes it feels like it’s used merely
as a buzzword, but more often than not quality is a real concern, both
for the seller of a product or service and the consumer or customer. In
the same way, quality appears to be omnipresent in the language services
industry as well. Obviously, when it comes to translation and
localization, the subject of quality has rather unique characteristics compared to other services, however, ultimately it is the expected goal in any project.

In this article, we will review what the established practices are for monitoring and achieving
linguistic quality in translation and localization, examine what the
challenges are for linguistic quality assurance (LQA) and also attempt
to make some predictions for the future of LQA in the localization
industry.

Quality assessment and quality assurance: same book, different pages

Despite
the fact that industry standards have been around for quite some time,
in practice, terms such as ‘quality assessment’ and ‘quality assurance’,
and sometimes even ‘quality evaluation’, are often used
interchangeably. This may be due to a misunderstanding of what each
process involves but, whatever the reason, this practice leads to
confusion and could create misleading expectations. So, let us take this
opportunity to clarify:

[Translation] Quality Assessment (TQA) is the process of evaluating the overall quality of a completed translation by using a model with pre-determined
values which can be assigned to a number of parameters used for scoring
purposes. Such models are the LISA, the MQM, the DQF, etc.

Quality Assurance
“[QA] refers to systems put in place to pre-empt and avoid errors or
quality problems at any stage of a translation job”. (Drugan, 2013: 76)

Quality
is an ambiguous concept in itself and making ‘objective’ evaluations is
a very difficult task. Even the most rigorous assessment model requires
subjective input by the evaluator who is using it. When it comes to
linguistic quality, in particular, we would be looking to improve on
issues that have to do with punctuation, terminology and glossary
compliance, locale-specific conversions and formatting, consistency,
omissions, untranslatable items and others. It is a job that requires a
lot of attention to detail and strict adherence to rules and guidelines –
and that’s why LQA (most aspects of it, anyway) is a better candidate
for ‘objective’ automation.

Given the volume of translated words
in most localization projects these days, it is practically prohibitive
in terms of time and cost to have in place a comprehensive QA process,
which would safeguard certain expectations of quality both during and
after translation. Therefore it is very common that QA, much like TQA,
is reserved for the post-translation stage. A human reviewer, with or
without the help of technology, will be brought in when the translation
is done and will be asked to review/revise the final product. The
obvious drawback of this process is that significant time and effort
could be saved if somehow revision could occur in parallel with the
translation, perhaps by involving the translator herself with the
process of tracking errors and making these corrections along the way.

The
fact that QA only seems to take place ‘after the fact’ is not the only
problem, however. Volumes are another challenge – too many words to
revise, too little time and too expensive to do it. To address
this challenge, Language Service Providers (LSPs) use sampling (the
partial revision of an agreed small portion of the translation) and
spot-checking (the partial revision of random excerpts of the
translation). In both cases, the proportion of the translation that is
checked is about 10% of the total volume of translated text, and that is
generally considered agreeable to be able to say whether the whole
translation is good or not. This is an established and accepted industry
practice that was created out of necessity. However, one doesn’t need
to have a degree in statistics to appreciate that this small sample,
whether defined or random, is hardly big enough to reflect the quality
of the overall project.

The progressive increase of the volumes of
text translated every year (also reflected in the growth of the total
value of the language service industry, as seen below) and the
increasing demands for faster turnaround times makes it even harder for
QA-focused technology to catch up. The need for automation is greater
than ever before.

Built-in
QA checks in CAT tools range from the completely basic to the quite
sophisticated, depending on which CAT tool you’re looking at.
Stand-alone QA tools are mainly designed with error detection/correction
capabilities in mind, but there are some that use translation quality
metrics for assessment purposes – so they’re not quite QA tools as such.
Custom tools are usually developed in order to address specific needs
of a client or a vendor who happens to be using a proprietary
translation management system or something similar. This obviously
presupposes that the technical and human resources are available to
develop such a tool, so this practice is rather rare and exclusive to
large companies that can afford it.

Consistency is king – but is it enough?

Terminology and glossary/wordlist
compliance, empty target segments, untranslated target segments,
segment length, segment-level inconsistency, different or missing
punctuation, different or missing tags/placeholders/symbols, different
or missing numeric or alphanumeric structures – these are the most
common checks that one can find in a QA tool. On the surface at least,
this looks like a very diverse range that should cover the needs of most
users. All these are effectively consistency checks. If a certain
element is present in the source segment, then it should also exist in
the target segment. It is easy to see why this kind of “pattern
matching” can be easily automated and translators/reviewers certainly
appreciate a tool that can do this for them a lot more quickly and
accurately than they can.

Despite the obvious benefits of these
checks, the methodology on which they run has significant drawbacks.
Consistency checks are effectively locale-independent and that creates
false positives (the tool detects an error when there is none), also
known as “noise”, and false negatives (the tool doesn’t detect an error
when there is one). Noise is one of the biggest shortcomings of QA tools
currently available and that is because of the lack of locale
specificity in the checks provided. It is in fact rather ironic that
the benchmark for QA in localization doesn’t involve locale-specific
checks. To be fair, in some cases users are allowed to configure the
tool in greater depth and define such focused checks on their own
(either through existing options in the tools or with regular
expressions).

Source: XKCD

But, this makes the process more intensive for the user and it comes
as no surprise that the majority of users of QA tools never bother to do
that. Instead, they perform their QA duties relying on the sub-optimal
consistency checks which are available by default.

Linguistic quality assurance is (not) a holistic approach

In
practice, for the majority of large scale localization projects, only
post-translation LQA takes place, mainly due to time pressure and
associated costs – an issue we touched on earlier in connection with the
practice of sampling. The larger implication of this reality is that:

a) effectively we should be talking about quality control rather than
quality assurance, as everything takes place after the fact; and

b)
quality assurance becomes a second-class citizen in the world of
localization. This contradicts everything we see and hear about the
importance of quality in the industry, where both buyers and providers
of language services prioritise quality as a prime directive.

As
already discussed, the technology does not always help. CAT tools with
integrated QA functionality have a lot of issues with noise, and that is
unlikely to change anytime soon because this kind of functionality is
not a priority for a CAT tool. On the other hand, stand-alone QA tools
with more extensive functionality work independently, which means that
any potential ‘collaboration’ between stand-alone QA tools and CAT tools
can only be achieved in a cumbersome and intermittent workflow:
complete the translation, export it from the CAT tool, import the
bilingual file in the QA tool, run the QA checks, analyse the QA report,
go back to the CAT tool, find the segments which have errors, make
corrections, update the bilingual file and so on.

The continuously
growing demand in the localization industry for the management of
increasing volumes of multilingual content in pressing timelines and the
compliance with quality guidelines means that the challenges described
above will have to be addressed soon. As the trends of online
technologies in translation and localization become stronger, there is
an implicit understanding that existing workflows will have to be
uncomplicated in order to accommodate future needs in the industry. This
can indeed be achieved with the adoption of bolder QA strategies and
more extensive automation. The need in the industry for a more efficient
and effective QA process is here now and it is pressing. Is there a new
workflow model which can produce tangible benefits both in terms of
time and resources? I believe there is, but it will take some faith and
boldness to apply it.

Get ahead of the curve

In
the last few years, the translation technology market has been marked by
substantial shifts in the market shares occupied by offline and online
CAT tools respectively, with the online tools gaining rapidly more
ground. This trend is unlikely to change. At the same time, the age-old
problems of connectivity and compatibility between different platforms
will have to be addressed one way or another. For example, slowly
transitioning to an online CAT tool and still using the same offline QA
tool from your old workflow is inefficient as it is irrational,
especially in the long run.

A deeper integration between CAT and
QA tools also has other benefits. The QA process can move up a step in
the translation process. Why have QA only in post-translation when you
can also have it in-translation? (And it goes without saying that
pre-translation QA is also vital, but it would apply to the source
content only so it’s a different topic altogether.) This shift is indeed
possible by using API-enabled applications – which are in fact already
standard practice for the majority of online CAT tools. There was a time
when each CAT tool had its own proprietary file formats (as they still
do), and then the TMX and TBX standards were introduced and the industry
changed forever, as it became possible for different CAT tools to
“communicate” with each other. The same will happen again, only this
time APIs will be the agent of change.

Source: API Academy

Looking further ahead, there are also some other exciting ideas which
could bring about truly innovative changes to the quality assurance
process. The first one is the idea of automated corrections. Much in the
same way that a text can be pre-translated in a CAT tool when a
translation memory or a machine translation system is available, in a QA
tool which has been pre-configured with granular settings it would be
possible to “pre-correct” certain errors in the translation before a
human reviewer even starts working on the text. With a deeper
integration scenario in a CAT tool, an error could be corrected in a
live QA environment the moment a translator makes that error.

This
kind of advanced automation in LQA could be taken even a step further
if we consider the principles of machine learning. Access to big data in
the form of bilingual corpora which have been checked and confirmed by
human reviewers makes the potential of this approach even more likely.
Imagine a QA tool that collects all the corrections a reviewer has made
and all the false positives the reviewer has ignored and then it
processes all that information and learns from it. Every new text
processed and the machine learning algorithms make the tool more
accurate in what it should and should not consider to be an error. The possibilities are endless.

Despite
the various shortcomings of current practices in LQA, the potential is
there to streamline and improve processes and workflows alike, so much
so that quality assurance will not be seen as a “burden” anymore, but
rather as an inextricable component of localization, both in theory and
in practice. It is up to us to embrace the change and move forward.

Vassilis Korkas is the COO and a co-founder of lexiQA.
Following a 15-year academic career in the UK, in 2015 he decided to
channel his expertise in translation technologies, technical translation
and reviewing into a new tech company. In lexiQA he is now involved with content development, product management, and business operations.

NoteThis is the abridged version of a four-part article series published by the author on lexiQA’s blog: Part 1 – Part 2 – Part 3 – Part 4