We introduce a technique for augmenting neural text-to-speech (TTS) with low dimensional trainable speaker embeddings to generate different voices from a single model. As a starting point, we show improvements over the two state-ofthe-art approaches for single-speaker neural TTS: Deep Voice 1 and Tacotron. We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but constructed with higher performance building blocks and demonstrates a significant audio quality improvement over Deep Voice 1. We improve Tacotron by introducing a post-processing neural vocoder, and demonstrate a significant audio quality improvement. We then demonstrate our technique for multi-speaker speech synthesis for both Deep Voice 2 and Tacotron on two multi-speaker TTS datasets. We show that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.

Deep Voice 2 learns to generate speech by finding shared qualities between different voices. Specifically, each voice corresponds to a single vector – about 50 numbers which summarize how to generate sounds that imitate the target speaker. Unlike all previous TTS systems, Deep Voice 2 learns these qualities from scratch, without any guidance about what makes voices distinguishable.

In this work, we explore how entirely-neural speech synthesis pipelines may be extended to multispeaker text-to-speech via low-dimensional trainable speaker embeddings. We start by presenting Deep Voice 2, an improved single-speaker model. Next, we demonstrate the applicability of our technique by training both multi-speaker Deep Voice 2 and multi-speaker Tacotron models, and evaluate their quality through MOS. In conclusion, we use our speaker embedding technique to create high quality text-to-speech systems and conclusively show that neural speech synthesis models can learn effectively from small amounts of data spread among hundreds of different speakers.

The results presented in this work suggest many directions for future research. Future work may test the limits of this technique and explore how many speakers these models can generalize to, how little data is truly required per speaker for high quality synthesis, whether new speakers can be added to a system by fixing model parameters and solely training new speaker embeddings, and whether the speaker embeddings can be used as a meaningful vector space, as is possible with word embeddings.