Sound demos for "WaveFlow: A Compact Flow-based Model for Raw Audio"

Our small WaveFlow has 5.91M parameters (i.e. 64 residual channels), and it can synthesize 22.05 kHz high-fidelity raw audio 42.6× faster than real-time on a GPU. In contrast, WaveGlow requires 87.8M parameters (i.e. 256 residual channels) for generating high-fidelity audio, and its performance degrades quickly with small residual channels. We also present audio samples from Gaussian autoregressive WaveNet and ClariNet.

Audio synthesis conditioned on mel spectrogram

WaveFlow (64-layer, res. channels = 256)

WaveGlow (96-layer, res. channels = 256)

Ground-truth (recorded speech)

WaveFlow (64-layer, res. channels = 128)

WaveGlow (96-layer, res. channels = 128)

WaveNet (30-layer, res. channels = 128)

WaveFlow (64-layer, res. channels = 64)

WaveGlow (96-layer, res. channels = 64)

ClariNet (60-layer, res. channels = 64)

Text-to-speech synthesis

The rainbow passage: When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow. The rainbow is a division of white light into many beautiful colors. These take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon. There is, according to legend, a boiling pot of gold at one end. People look, but no one ever finds it.