EXPRESSIVE SPEECH SYNTHESIS USING AMERICAN ENGLISH TOBI: QUESTIONS AND CONTRASTIVE EMPHASIS
John F. Pitrelli and Ellen M. Eide
Proceedings of IEEE ASRU 2003: Automatic Speech Recognition and
Understanding Workshop, St. Thomas, U.S. Virgin Islands, December 1-4, 2003.
ABSTRACT:
We describe American English concatenative text-to-speech synthesis experiments
in which "expressions," here, questioning and contrastive emphasis, are
each associated with a ToBI prosodic template. ToBI labels, along with
text features, are in turn incorporated into decision-tree models of F0
and segment duration to be used during synthesis, sparing the need for
expression-specific large corpora and decision trees. Synthesizing using this
approach enables listeners to perform the difficult task of distinguishing
yes-no questions from identically-worded declarative sentences 78%
of the time, compared to the baseline system's 50%.
For contrastive emphasis, a sentence is synthesized with emphasis on a word
which is chosen appropriately or inappropriately based on a preceding sentence.
Listeners' mean opinion scores for appropriate emphases exceed inappropriate
by 0.40 on a 1-to-5 scale for the experimental system, compared to a difference
of 0.11 for the baseline, a significant system difference (p < 0.01).