5aSC41. Modeling segment durations with artificial neural networks.

Session: Friday Morning, December 6

Time:

Prediction of segment duration in TTS systems has in the past generally
been accomplished under arithmetic approaches such as multiplicative and
incompressability models [D. Klatt, J. Acoust. Am. 54, 1102--1104 (1973); R.
Port, J. Acoust. Soc. Am. 69, 262--724 (1981)], and sums-of-products models [J.
van Santen, Comput. Speech Lang. 8, 95--128 (1994)]. Other research, however,
suggests a more complex speech timing system than is captured by such models [H.
Gopal, J. Phon. 18, 497--518 (1990)]. In this study, a limited domain of vowel
duration phenomena are modeled under several designs of simple feedforward
networks. The networks' performance is then examined by using their output in
our TTS system, and evaluating the naturalness of the resulting utterances in a
perceptual experiment. Preliminary results indicate that simple two-layer
perceptrons are able to learn the basic patterns of environmentally conditioned
variations in segment duration, while more sophisticated networks are required
to capture the complexities of these factors' interactions.