Junichi Yamagishi

The Centre for Speech Technology Research

Thousands of
Voices and Geographical GUI for HMM-based Speech SynthesisOur robust
speaker-adaptive speech synthesis system can generate the voice of
any speaker. It only requires a small amount of data from each
speaker because it uses model adaptation. This means that it is now
possible to create a virtually unlimited number of different
voices.

In fact, we believe this is the largest
known collection of synthetic voices in existence. We built so many
voices (1500+ voices built on ASR corpora plus several voices built
on TTS corpora using the same techniques) that it became impossible
to represent them in list or table form. Instead, we devised an
interactive geographical representation, shown above. Each marker
corresponds to an individual speaker. Blue markers show male
speakers and red markers show female speakers. Some markers are in
arbitrary locations (in the correct country) because precise
location information is not available for all speakers. Then right
box shows list of speakers that user can choose with speakers’
gender and nationality. This is based on Google Maps and AJAX
Language (Translation) APIs as well as our Festival TTS system
running on a University of Edinburgh server. Clicking on a marker
will play synthetic speech from that speaker. Currently the
interactive mode supports all English and some of the Spanish
voices. For other languages only pre-synthesised examples are
available, but we plan to add an interactive text-to-speech feature
in the very near future.

What's more, the method is almost completely automatic and can even
work from existing recordings such as speeches, movies, TV and
podcasts. This will enable new applications of text-to-speech
technology. Please click [CELEB] section on the demo!

For details, please refer to the following journal paper published
from IEEE: