DeepMind’s WaveNet Voice Synthesizer Is Live in Google Assistant

This site may earn affiliate commissions from the links on this page. Terms of use.

No one knew exactly what DeepMind was up to when it was acquired by Google a few years back. Now DeepMind is an Alphabet company, working on big machine learning problems like how to beat humans at Go, improving AI problem solving, and making computer-generated speech more realistic. On that last count, you can experience the fruits of DeepMind’s labors right now if you’ve got an Android phone or a Google Home. The “WaveNet” voice engine is now available in Google Assistant.

Google launched Assistant about a year ago as an evolution of its existing Google voice command system. For the first time, Google voice interactions were available not only on phones, but also as a part of your home with the Google Home smart speaker. Assistant gives you access to Google search data, device control, and smart home integrations. It’s available on all Android phones running v6.0 or higher by long-pressing the home button. So, you don’t have to buy a Google Home to experience Assistant.

The voice model used in Assistant at launch wasn’t bad, but Google just rolled a vastly improved version of the voices for English and Japanese. DeepMind confirms these are implementations of WaveNet, which it first demoed in 2016. At the time, WaveNet was too computationally intensive for use on consumer devices, but just over a year later and that’s changed. You can experience the new Assistant voice below or open Assistant on your phone and go to Settings > Preferences > Assistant Voice.

WaveNet is a form of parametric text-to-speech (TTS) that is entirely synthetic. Until recently, virtually all TTS systems were based on concatenative systems. In concatenative TTS, a large volume of high-quality recordings of a real voice are chopped up and reassembled to form the words. This is expensive and still won’t sound entirely human. Parametric TTS is cheaper, but it often sounds even more robotic.

DeepMind used a convolutional neural network that was trained on a large sample of human speech. The resulting speech synthesizer can generate more believable voice waveforms from scratch with over 16,000 samples per second. The audio from WaveNet picks up on natural inflection and accents better, which prevents the flat “robotic” feel from creeping in as often.

The new WaveNet model running as part of Google Assistant is 1,000 times faster than the demo version, allowing it to generate 20 seconds of high-quality audio in just one second. DeepMind promises a full paper soon that will detail how this was accomplished.