Lost in Google’s Translation

Last week Google Translate announced that it now has over 200 million monthly users. As Alexis Madrigal noted in the Atlantic, this means that Google is now translating as much in a day as a human being would in a year – an amount of text equivalent to a million books.

Google Translate is far from perfect – its garbled prose, creative grammar and bizarre word substitutions have been dubbed “Dada Processing” – but it is one of the few Google products one can unequivocally say does more good than harm. Because of Google Translate, millions of people access ideas that would have once remained impenetrable. The default dismissal of foreign media is gone.

Since its inception in 2006, Google has added 65 languages from every region in the world, with two notable exceptions: Central Asia and sub-Saharan Africa. No languages from Central Asia make the Google cut, despite the fact that they have far more speakers than many of the languages that do. Pashto (50 million), Uzbek (21 million), and Uyghur (9 million) are not included. Neither are the African languages Hausa (40 million), Yoruba (19 million), or Zulu (10 million). The sole inclusions from sub-Saharan Africa are Swahili and Afrikaans, a language which derives from Dutch. In contrast, Icelandic (310,000), Welsh (480,000), and Irish (60,000) – as well as every other European language – are represented.

Translation has the potential to shift the politics of perception. Here Central Asia and sub-Saharan Africa share something else in common: they are already on the losing side of this game. These two regions are the least likely to be covered by the international media and the most likely to be dismissed as barbaric, obscure or irrelevant by non-specialists.

Political scientist Laura Saey recently wrote about the problems that plague media coverage of Africa: the paucity of international correspondents, the callous framing of tragic events, the echo chamber of repeated coverage, the tendency to ignore regional diversity, the casual racism and condescension. These practices mark Central Asian media coverage as well. “If it takes hundreds of deaths or a revolution to make you report on a country, don’t cover its ‘humorous’ political and economic failures,” Matthew Kupfer wrote last week on Registan, bemoaning the international media mockery of Kyrgyzstan’s inability to pay its bills.

Central Asia and sub-Saharan Africa share another important similarity: their events are framed through the languages of colonization. Most international coverage of Central Asia is written by people who speak Russian. Similarly, sub-Saharan Africa is covered by speakers of English, French and Arabic. This is not the fault of the reporters: when bureaus assign so few people to such large regions, one cannot reasonably expect them to know all the local languages, and so it makes sense for them to rely on a lingua franca. But such reliance is problematic when it comes to the internet, because there is so much room to do better.

It is increasingly common for news agencies to cover a country by reprinting claims made online. This is why CNN films Facebook pages and why complex conflicts get reduced to “Twitter revolutions”. But without an ability to translate local languages, reporters rely on whatever material they can understand – meaning that, to give one example, Russian-language content is often used to represent what is going on in Uzbek communities. Internet content created by speakers of languages like Uzbek – often preferred by citizens of Uzbekistan even if they do know Russian – is ignored.

As a result, important insights and debates remain invisible to the outside world. “There is another internet, a secret internet, in which meaningful political conversations take place in Uzbek, Kyrgyz, Kazakh, Turkmen, and Tajik, yet the majority of the world remains none the wiser,” I wrote in 2010, countering Evgeny Morozov and others who found Kyrgyzstan’s online media irrelevant. This is as true today as it was then.

Google is moving in the right direction: translation for Kazakh and Kyrgyz is being developed, and Google Africa is soliciting contributions from Africans interested in expanding the site’s capabilities. One hopes that these additions will increase the regional knowledge necessary to write with depth and compassion. Nothing substitutes for human translation, especially of Central Asian and African websites filled with jokes, idioms and poetry. But Google Translate can give a sense of what people are concerned about, which may help shift coverage away from the trivialities and biases cited by Saey. Moreover, it allows citizens who only speak regional languages to access foreign media and translate their own works for a broader audience – a feat which the excellent Global Voices has achieved on a more selective scale. For citizens involved in politics or international affairs, this is an invaluable gift.

“There is never interpretation, understanding and knowledge when there is no interest,” Edward Said wrote in Covering Islam, critiquing media bias toward the Muslim world. Central Asia and sub-Saharan countries face a related but arguably more severe problem: it is hard to create interest in places that most do not consider as individual, complex entities. In the digital era, impenetrable means invisible; invisible means irrelevant. Adding more local and national languages to Google Translate is a small step toward remedying this problem.

Subscribe to receive updates from Registan

This post was written by...

Sarah Kendzior is an anthropologist who studies politics and the internet in the former Soviet Union. She has a PhD in cultural anthropology from Washington University in Saint Louis and an MA in Central Eurasian Studies from Indiana University. Her research has been published in many academic journals and media outlets, including American Ethnologist, Central Asian Survey, Demokratizatsiya and the Atlantic. She is currently an instructor at Washington University, where she teaches a course called "The Internet, Politics, and Society." Follow her on Twitter.

(Full disclosure: I studied Xhosa, a language very close to Zulu, in college)

Google Translate is limited in part by its methodology–languages are made available after millions of pages of documents have been scanned and entered into the database; translation is accomplished using a statistical analysis of documents in the translation language. Users who suggest alternate translations then help to make foreign translation more accurate.

This is one of the major reasons some of the languages you mentioned aren’t made available–a lack of reliable human-translated documents and a smaller user base able to access and improve machine translation (keep in mind that it’s not the number of speakers, but rather the number of literate speakers who have internet access, who can ultimately improve Google’s machine translation.)

The second issue at play is the availability of text for use in machine translation. Not only are many of these languages unlikely to be found in print*, but as the languages are largely spoken, rather than written or read, they are subject to more frequent changes. Unlike languages like English, French, or even Arabic, speakers of Xhosa and Zulu do not have a long linguistic history captured in print to fall back upon.

TL, DR: Central Asian and sub-Saharan languages pose a challenge to machine translation. This is not to say that Google cannot or should not try to make these languages available for translation in the future, only that working with these languages may require a different approach than the one used to date.

*The South African government does a commendable job of translating government documents and information into each of the nation’s eleven national languages, including Afrikaans, Zulu, and Xhosa.

Michael Hancock-ParmerApril 30, 2012 at 12:39 pm

Speaking for Central Asian languages, I don’t believe these setbacks apply to the degree A Mango suggests. There is a vast wealth of fiction and non-fiction available in Central Asian languages with accompanying Russian translation. I’m not sure what the critical mass is for such machine translations to work, but there’s certainly enough to start. There are also several books in English translation that were originally in Central Asian languages, though the number pales before Russian. In short, I don’t think Google has a good excuse besides their (unwitting?) continuing of institutionalized racism and nationalism. Uzbek and Kazakh are just not as important as Irish and Afrikaans in the Google-envisioned reality in which we live. And that needs to change.

They don’t have these setbacks? Sure, there are translations, but it seems fairly obvious that the volume of machine-readable, reliable translations is not as high for many of the currently unavailable language. Clearly though, the problem is surmountable, and if you look at the development history, you see that options began to really take off once translation began to happen via another language (usually English). Don’t be surprised to see Uzbek or even Tajik and Turkmen added in the next couple years. I have no idea how the algorithm actually works, but I wouldn’t be surprised if the quality of the Russian to English translation had to hit a certain level before Central Asian languages, like Armenian and Azerbaijani before them were added.

Let’s not forget though that Google has no more moral obligation to offer translation to every language under the sun. It’s no more institutionally racist or nationalist to do what your algorithm allows you to do well than it is to choose to study the language and literature of just one part of the world. Considering that in the world where most of us use money to assign value to labor and resources and make decisions about how to apply them, it shouldn’t be terribly surprising that Africa’s biggest economy and one of Europe’s most dynamic economies in the last couple decades get some love, especially when these languages are ones that the algorithm would likely be able to tackle fairly easily. It’s borderline miraculous that languages spoken in areas relatively less connected to the global system are being included at all.

The problem is that English is the “bridge” language within the software, and although there are large amounts of texts in Russian and Kyrgyz, there are very few (comparatively) in both Kyrgyz and English. Machine translation requires a huge amount of text in order to be effective. That’s why a bunch of interesting young guys have organized a crowdsourced effort to translate texts at enetil.kg. Kazakh will have a similar problem, though if Kazakhstan wants to, I’m sure they could throw money at the problem to get the same or better results.

Either way, expect it to take a year or two before Kazakh shows up even as a beta language, and considerably longer than that before the translations become adequate for more than basic use.

Andrea DorayMay 1, 2012 at 10:56 am

What a wonderful and heart-rending post! I also appreciate all your comments and insight about machine translation. I am astonished, actually, that so many ways exist to circumvent Babylon. I’ve used Google Translate to communicate with my friends with Turkmenistan who understand Turkmen and Russian, but not English. My Turkmen is too limited, so even a mediocre translation to Russian keeps us connected. I’ve obtained English translations of Turkmen writing, poetry primarily, and it’s true–Central Asia also speaks of the haunting beauty of its steppes, the challenges and delights of its deserts, and the hearts of its people. Knowing their voices, in just the limited ways that I do, I too hope that these voices will soon be heard in their native languages.

PaulMay 12, 2012 at 10:02 am

Yes, as mentioned by the commenters above, the onus is not on Google as implied in the article but on citizens themselves. Kyrgyz wikipedia is flourishing solely because of the activism of a group that have article-writing retreats where they each write 40-50 articles. The success of Kyrgyz google translate will also depend on what happens inside the country, rather than wait for internet companies to wave a magic wand.

Kyrgyzstan and other states mentioned are lucky that they have a state to safeguard their language, so any comparison with sub-saharan Africa, where literally thousands of languages enjoy no constitutional safety and could disappear in a few lifetimes, is misplaced. Google will give the Kyrgyz the opportunity to have accurate Kyrgyz to x language translations if they do some legwork!