Speech generation in a spoken dialogue system

SUNScholar Research Repository

JavaScript is disabled for your browser. Some features of this site may not work without it.

Speech generation in a spoken dialogue system

Visagie, Albertus Sybrand

2004-12

ENGLISH ABSTRACT: Spoken dialogue systems accessed over the telephone network are rapidly becoming more
popular as a means to reduce call-centre costs and improve customer experience. It is
now technologically feasible to delegate repetitive and relatively simple tasks conducted
in most telephone calls to automatic systems. Such a system uses speech recognition to
take input from users. This work focuses on the speech generation component that a
specific prototype system uses to convey audible speech output back to the user.
Many commercial systems contain general text-to-speech synthesisers. Text-to-speech
synthesis is a very active branch of speech processing. It aims to build machines that
read text aloud. In some languages this has been a reality for almost two decades. While
these synthesisers are often very understandable, they almost never sound natural. The
output quality of synthetic speech is considered to be a very important factor in the user’s
perception of the quality and usability of spoken dialogue systems.
The static nature of the spoken dialogue system is exploited to produce a custom
speech synthesis component that provides very high quality output speech for the particular
application. To this end the current state of the art in speech synthesis is surveyed
and summarised. A unit-selection synthesiser is produced that functions in Afrikaans,
English and Xhosa.
The unit-selection synthesiser selects short waveforms from a recorded speech corpus,
and concatenates them to produce the required utterances. Techniques are developed for
designing a compact corpus and processing it to produce a unit-selection database. Speech
modification methods were researched to build a framework for natural-sounding speech
concatenation. This framework also provides pitch and duration modification capabilities
that will enable research in languages such as Afrikaans and Xhosa where text-to-speech
capabilities are relatively immature.