A year ago, the Alphabet lab working on artificial intelligence introduced the WaveNet deep neural network for “generating raw audio waveforms that is capable of producing better and more realistic-sounding speech than existing techniques.”

Over the past 12 months, the team worked on making this “computationally intensive” research prototype run on consumer products, starting with Google Assistant voices for US English and Japanese. The new model can create waveforms 1000 times faster with higher fidelity and resolution than the original.

This computational approach to text-to-speech is a huge step given how previously a human voice actor was required to record a huge database of sound parts that would then get merged together.

However, these systems can result in unnatural sounding voices and are also difficult to modify because a whole new database needs to be recorded each time a set of changes, such as new emotions or intonations, are needed.

DeepMind’s breakthrough approach last year involved a “deep generative model that can create individual waveforms from scratch.” It allowed for more natural sounds that flow better and feature natural intonation, accents, and even skeuomorphic things like “lip smacks.”