Google Makes AI Sound Natural

Do you know anyone who enjoys speaking with automated or interactive voice response systems? They offer many benefits, for sure. But the systems are often impersonal, as well, delivering a frustrating, unintuitive, unnatural experience.

Now imagine how companies would feel if the tables were turned and you used artificial intelligence to interact with them? Imagine having a digital assistant handle routine calls for you, like scheduling hair appointments or making dinner reservations? Think your stylist would be put off by a digital assistant asking for a cut and color at noon on Saturday?

Probably not, if the assistant sounded and acted like a live human being instead of a robot.

Google’s new Assistant, powered by Duplex

At Google’s annual developer conference in May, the company introduced Duplex: technology that gives Google’s new digital Assistant natural, spontaneous conversational skills. To the delight and awe of attendees, Google played recordings of Assistant booking a hair appointment and making dinner reservations.

In each demo, Assistant carried on two-way conversations like you or I would, complete with natural pauses, inflections and responses to questions and cues, including appropriate “mmhmms” and “gotchas,” and responses to impromptu twists in the conversation, like a restaurant not accepting reservations for parties with fewer than six people.

The biggest challenges the engineers faced when creating the Duplex engine was mimicking the flow of conversation. Human behavior is tricky and unpredictable, and natural language is complex and difficult to comprehend, which makes it all the more remarkable they were able to create natural-sounding speech complete with expected pauses, responses and tone changes.

According to Google’s AI blog, Google Duplex is successful because of its advances in understanding, interacting, timing and speaking.

Duplex is Google’s gaggle of innovations

If you want to get technical, the blog states, “At the core of Duplex is a recurrent neural network (RNN) designed to cope with these challenges, built using TensorFlow Extended (TFX). To obtain its high precision, we trained Duplex’s RNN on a corpus of anonymized phone conversation data. The network uses the output of Google’s automatic speech recognition (ASR) technology, as well as features from the audio, history of the conversation, parameters of the conversation (e.g., the desired service for an appointment or the current time of day) and more. We trained our understanding model separately for each task, but leveraged the shared corpus across tasks. Finally, we used hyperparameter optimization from TFX to further improve the model.”

Phew! That’s a mouthful. And then to make the voice itself sound natural, Google added, “We use a combination of a concatenative text to speech (TTS) engine and a synthesis TTS engine (using Tacotron and WaveNet) to control intonation depending on the circumstance.”

Acronym patent soup

To create Duplex, then, the engineers combined a lot of patented technologies, like RNN, TFX, ASR, TTS, Tacotron and WaveNet into an Assistant with game-changing new conversational capabilities.

Keep an eye out for it in near future as Google plans to test Duplex technology in Assistant this summer, helping users make reservations and appointments and requesting store hours.

In the meantime, we’ll just have to rough it and call the barber by pressing the numbers on our phone. Or, try making some voice commands through the AI assistants available from other major tech leaders like Amazon’s Alexa, Apple’s Siri or Microsoft’s Cortana.