Pages

Thursday, July 14, 2016

This is a guest post by Silvio Picinini who works in a team at eBay that provides linguistic feedback and addresses linguistic issues, specifically
to enhance large scale MT projects underway at eBay.
To my mind this is an example of best practices in MT, where you have NLP and MT experts working together with linguists to solve large scale
translation problems in a collaborative way.

The eBay linguistic team has actually been producing
a number of articles
that describe various kinds of linguistic tasks that are increasingly needed to add value and quality to large scale MT efforts. I think these
articles are worth greater attention, as they have a high SNR (signal to noise ratio.) They are educating and informing readers of very specific
things that IMO together add up to examples of best practice.
I am hoping that Silvio and his colleagues become regular contributors to this blog so that more people get access to this valuable
information.

I was honored to be invited by Kirti to write for this blog. I hope to deserve it, by sharing my experiences as a translator working with machine
translation. Recently I was really impressed by Kirti's post on how a lot of content is being
translated outside of the translation services industry. I would like to add a few thoughts to that.

I work with User-Generated Content for eBay. Users all over the world describe what they are selling, creating titles and descriptions for their items.
In the millions. We need to translate the information on these items so that users that speak other languages can buy them. So this is the job,
translate millions of items quickly, almost instantly. A new initiative at eBay is structuring data in a different way, and making it easier to create
product reviews. In a short period, we accumulated millions of reviews. A review written in English about a digital camera (a product sold globally) is
probably very useful for a buyer in Germany or in Mexico. So we need these reviews translated for these buyers. Could we do this hiring human
translators? No. It is easy to see that given the volume, time and cost involved, human intervention is out of the question. Virtually anything that is
open to users, allowing them to create their own content, will generate volumes that are not feasible to be translated by hand. These are real
scenarios from eBay, but also Facebook recently announced the translation of posts with their own MT engine, and Amazon is working on MT.

In addition to what is already happening, we live in a world where new forms of content created by users appear every day. This is of interest to a lot
of people, and that will require translation. So here are some types of User-Generated Content that, in my opinion, seem that will be of interest
beyond their original language. I am guessing that their companies may be interested in translating this in the (near) future:

Rental Homes reviews on Airbnb

TripAdvisor reviews of places to see, eat and stay

Netflix movie reviews

LinkedIn articles

Tweets

How-to guides

Knowledge bases

Even Yelp reviews that seem local can be of interest to visitors from other countries or speakers of a second language in the same country (French
in Canada, Spanish in the US)

So this is what I meant with the title of the post: Translators would never be offered User-Generated Content translations, so when these jobs go to
machine translation engines, they are not really affected by it in any way. MT is not taking any translator's jobs if there was no job in the first
place. But maybe translators would like to affect this enormous translation market. Kirti has been posting guidance on how translators can prepare to
participate in this opportunity.

From my experience at eBay, here are a few thoughts about the role that translators may play.

MT engines will need to be trained. The specific content needed for training may not be available to be harvested. Therefore, companies will need to
create training data for their engines. This training data will be post-edited from the MT output, and this is a job that requires the human
intervention of post-editors and reviewers. The quality of the MT output needs to be measured, and the measurement requires (in the case of BLEU) a
human translated reference. So there is also a role for translators, instead of post-editors, in creating references for MT measurements.

The importance of the pattern over the individual error: the usual mindset for translators and reviewers is to focus on every error that they see,
correct them and then produce perfect quality.
For MT, the mindset should focus on patterns of errors. Translators will be trying to make a bigger impact by finding patterns of errors that will
improve the quality on a larger scale,
on every better translation that the MT engine produces.

Translators have the linguistic ability to see these patterns. In this paper
at AMTA 2014, I presented a few patterns found in Brazilian Portuguese:

Diminutives are widely used by users in informal language, and are not commonly present in the training data, which is usually in a more formal
language.

The lack of diacritical marks is common among users, both for accents and for marks that modify letters such as ç, ã and õ. The usual training data
is usually written in a more formal language and will contain all the diacritical marks. The MT will have to deal with these differences, such as
"relogio" vs. "relógio" and "calca" vs. "calça".

Some words are intended for the target language but are also words in the source language, causing issues. "Costumes" is a word in English, but
also in Portuguese.

Some words are misspelled because certain letters have the same sound, causing issues for MT. For example, "engraçados" spelled as "engrassados" (ç
and ss have the same sound).

Some words are spelled as people pronounce them, and this is different from the correct written pronunciation. For example, "roupa" spelled as
"ropa". MT needs to deal with that.

Some English words are spelled as they would be written with Portuguese language rules. So "Michael Jordan" would become "Maico Jordam".

There are MT companies, academic experts and customer engineering teams working with MT. It may be time for the language experts to play a role.

Silvio Picinini is a Machine Translation Language Specialist at eBay since 2013. With over 20 years in Localization, he worked on quality-focused
positions for an LSP, and as an in-house English into Brazilian Portuguese translator for Oracle. A former quality engineer, Silvio holds degrees
in electrical and electronic/software engineering. LinkedIn: https://www.linkedin.com/in/silviopicinini