WaveNet: Mimicing Human Speech through Neural Networks

Google has been driving AI personification for years now with its launch of the DeepMind project in 2010. Now, almost seven years later, they are still strapped in the left seat. Within a new blog post on their latest work on the project, they present a functional neural network called “WaveNet” that is designed to replicate human speech. Below is a snippet of the blog:

The above animation shows how a WaveNet is structured. It is a fully convolutional neural network, where the convolutional layers have various dilation factors that allow its receptive field to grow exponentially with depth and cover thousands of timesteps.

At training time, the input sequences are real waveforms recorded from human speakers. After training, we can sample the network to generate synthetic utterances. At each step during sampling a value is drawn from the probability distribution computed by the network. This value is then fed back into the input and a new prediction for the next step is made. Building up samples one step at a time like this is computationally expensive, but we have found it essential for generating complex, realistic-sounding audio.

About Johnmark James

Hello, I am a machine cognition enthusiast and the podcaster of the MPCR lab. Currently, I am attending Embry-Riddle Aeronautical University majoring in Aeronautical Science and Unmanned Aircraft Systems.