Deep Voice: Real-Time Neural Text-to-Speech for Production

Deep Voice: Real-Time Neural Text-to-Speech for Production

Baidu Research presents Deep Voice, a production-quality text-to-speech system constructed entirely from deep neural networks. The biggest obstacle to building such a system thus far has been the speed of audio synthesis – previous approaches have taken minutes or hours to generate only a few seconds of speech. We solve this challenge and show that we can do audio synthesis in real-time, which amounts to an up to 400X speedup over previous WaveNet inference implementations.

Synthesizing artificial human speech from text, commonly known as text-to-speech (TTS), is an essential component in many applications such as speech-enabled devices, navigation systems, and accessibility for the visually-impaired. Fundamentally, it allows human-technology interaction without requiring visual interfaces.

Modern TTS systems are based on complex, multi-stage processing pipelines, each of which may rely on hand-engineered features and heuristics. Due to this complexity, developing new TTS systems can be very labor intensive and difficult.

Deep Voice is inspired by traditional text-to-speech pipelines and adopts the same structure, while replacing all components with neural networks and using simpler features. This makes our system more readily applicable to new datasets, voices, and domains without any manual data annotation or additional feature engineering.

Deep Voice lays the groundwork for truly end-to-end speech synthesis without a complex processing pipeline and without relying on hand-engineered features for inputs or pre-training.

Our current pipeline is not yet end-to-end, and consists of a phoneme model and an audio synthesis component. The clips below are synthesized from text with our entire pipeline. Here are two utterances chosen at random.