How AT&T Can Translate Your Speech in Real Time

AT&T Translator, a service on the company's teleconference system that translates speech between languages in real time, is currently in pilot testing by some of the company's biggest business customers. PopMech caught up with Mazin Gilbert, assistant vice president for technical research at AT&T Labs–Research, to learn about the challenges of teaching machines to understand human speech.

Machine-based language translation has been a longtime dream of science-fiction authors. C-3PO, after all, was fluent in more than 6 million forms of communication. What inspired your researchers to develop AT&T Translator?

Language is one of the largest barriers to communication globally. In the 1980s, we produced a short film of what communications would be like in the future. We had a vision that at some point in our lifetime there would be some intelligence in the network where you could pick up the phone and talk to anyone in the world regardless of the language you spoke.

How did you turn that vision into a reality?

The technology is a product of more than two decades of research at AT&T in speech recognition, speech synthesis, and natural language processing. There's nothing like this in the world of enabling multiparties to converse in real time across languages. It requires tremendous expertise in linguistics, machine learning, speech, and signal processing that we have at AT&T.

We demonstrated the first prototype of English-to-Spanish translation in the lab in 1988 (and continued to research and refine the technology). But given that we're a communications company, it fits into our business nicely and that's why we're focused on pushing it out to the market.

What is the user experience like?

You call into a conferencing service. Your user and audience (can be any place in the world). You set your preference for native language (or languages), [and] what you hear or read is that speaker in your native language. You can speak in your language and they will receive it in their native language too. It's really very transparent.

Which languages does the translating system currently understand?

English, French, Italian, German, Spanish . . . and Chinese, Japanese [and] Korean, all from speech in and out, [and] 12 other languages from text which we will roll out to speech over time.

What happens when the person talking in Spanish suddenly switches to English to read out loud a street address?

We can deal with that. It's not a simple problem to identify what language a person is speaking. But one of the technologies we have is identifying the language as you speak. So when you change your language (which is not uncommon), then we are able to detect that.

How many steps does it take to translate in real time?

There are many components to this problem. To do multiparty communication, you need people who understand how to do really high-quality speech recognition. Then you need a team to translate that to the target language. Since we don't know what the conversation is going to be about, we have to worry about scale: unlimited vocabulary, [and] the words may be in more than one language.

There are huge numbers of parts back and forth. And you need a team that can work on text-to-speech. Finally, they (the end user) have to hear it in a compelling voice that doesn't sound like a machine is talking.

What's the hardest part?

There are many, many challenges. The hardest part is the real-time nature of this. You have to recognize the language, transcribe the language, translate the language, and do it while the person is talking. The processing power this takes is enormous. It's (also) a very expensive endeavor.

How does this compare in quality to other automated translators, like Google Translate, the robot that translates Web pages into another language? Google's service doesn't always deliver smooth-reading article translations.

A smooth delivery of translation is certainly a quirk to many translation services, given the complexity of language, and no system is perfect today. What makes (AT&T Translator) a strong competitor in speech and language services is that it is powered through the cloud, which provides lower latency and faster results for the end user.

It also uses machine-learning technology, meaning that its accuracy in speech and language services, including translation, improves every time the system is used. We've invested decades in research and development of speech technologies and have more than 600 U.S. patents and additional patent applicationsin part to develop more natural-sounding speech that provides a smoother user experience.

How else is this product different from the robotic voices we're used to hearing on airport shuttles, GPS mapping devices, and the like?

One of the things we're working on is to make the (translated) voice sound like the speaker speaking in a different language. Intonation [voice quality] carries a lot of information . . . and you want to take and convey that to the listener. That is our goal: that it will sound like you.

Where is this technology headed? Might we see a real C-3P0 one day?

We envision a world where language is no longer a barrier for communication, whether for entertainment, health care, language learning, education, hospitality or just conferencing.

Communication will happen across any device. You will be seeing people interacting across languages, and speaking your language through the magic AT&T network where the intelligence resides. You will be able to watch any program on demand in your own native language, and do that anywhere in the world. The opportunities are endless.

A Part of Hearst Digital Media
Popular Mechanics participates in various affiliate marketing programs, which means we may get paid commissions on editorially chosen products purchased through our links to retailer sites.