LINGUIST List 15.1283

Thu Apr 22 2004

Review: Cognitive Science: Tatham & Morton (2004)

Editor for this issue: Naomi Ogasawara <naomilinguistlist.org>

What follows is a review or discussion note contributed to our Book
Discussion Forum. We expect discussions to be informal and
interactive; and the author of the book discussed is cordially invited
to join in.
If you are interested in leading a book discussion, look for books
announced on LINGUIST as "available for review." Then contact
Sheila Dooley Collberg at collberglinguistlist.org.

AUTHOR: Tatham, Mark; Morton, Katherine
TITLE: Expression in Speech
SUBTITLE: Analysis and synthesis
PUBLISHER: Oxford University Press
YEAR: 2004
Announced at http://linguistlist.org/issues/14/14-2847.html
Marianne Jessen, Dept of Logopedics,Fachhochschule Fresenius, Idstein.
Michael Jessen, Forensic Speech and Audio Dept, Bundeskriminalamt,
Wiesbaden.
''Expression in Speech'' focuses on the issue of how current speech
synthesis systems (e.g. within text-to-speech applications or dialogue
systems) can be improved by adding or enhancing acoustic correlates of
expression. ''Expression'' is seen as a ''manner of speaking, a way of
externalizing feelings, attitudes, and moods - conveying information
about our emotional state'' (p. 39); Tatham and Morton (TM) also use
the term ''tone of voice'' synonymously with expression (p.65). TM are
not interested in any quick, short sighted solutions to the issue of
expression in speech synthesis. Instead, before turning to more
concrete implementation design proposals in the latter part of their
book, TM go through great efforts to capture the issue of expression
in speech more generally, including its foundations in the biology and
psychology of emotions and the linguistic pragmatics of emotive
expression in speech. They point out explicitly that the phonetics of
expression in speech is not just a set of salient acoustic correlates
of strong basic emotions overlaid on entire utterances, and that it
should not be synthesized in this manner. Instead, what happens in
natural speech is that often very subtle and blended emotions are
conveyed for only small sections of speech, that there is a
complicated interaction between acoustic and linguistic (choice of
lexical items etc.) cues to emotions, and that the speaker is not just
a passive victim to the biology of emotion and its reflection in
speech but that expression in speech can be modified and adjusted on a
cognitive and sometimes conscious level. This cognitive mediation
includes the fact that the speaker can perceive or infer the reaction
of the listener to the expressive content of his/her speech within the
context of the conversation and is able to make adjustments. TM
propose that a speech synthesis system should be able to model all of
these aspects. As for the incorporation of listener reactions, TM
claim that an automatic speech recognition module can increase the
capabilities of the speech synthesis module. In general, TM emphasize
that speech synthesis should not end with a model of the speaker and
her/his expression capabilities but should ultimately be
listener-oriented. This not only would be an appropriate way of
capturing the goal-oriented nature of speech production on a
scientific level but it would also be of commercial interest - after
all, it is the customer who will be the listener of the synthetic
speech.
TM in the final part of their book propose a speech production model
(see Fig. 16.1, p. 365) in which on a ''static plane'' the
phonology/phonetics of a language and their interface is captured as
the set of grammatical/ linguistic-phonetic rules and constraints of
speech with ''neutral expression'' (p. 302). In addition to this
static plane there is a ''dynamic prosody/phonology tier'',
responsible for planning utterances and a ''dynamic phonetic tier'',
responsible for rendering utterances. The rendering module receives
input from a ''dynamic cognitive phonetics agent'', which supervises
and modifies the rendering process based on contextual and
environmental information. Apparently, while the static components
cover what is addressed in most of current phonology and phonetics,
the dynamic components focus on psycholinguistic and
linguistic-pragmatic factors. This model implies a plea by TM for a
broad-sighted view of phonetics, in which psycholinguistic and
pragmatic factors are taken into account, so that a topic like
expression in speech does not assume a marginal role in phonetics. TM
mention that their theory of ''Cognitive Phonetics'' (e.g. p. 360) is
a proposal into that direction. TM make proposals as to how their
speech production model and their account of expression in speech can
be implemented as part of a speech synthesis architecture. Within this
agenda they present a number of XML declarations in which they lay out
a prosodic hierarchy. A node <expression> is on top of this hierarchy,
which proceeds further down with prosodic categories such a
<intonational phrase>, <accent group>, and <syllable> (p. 370). Aside
from the practical aspects of this hierarchy (capturing that
expressions usually have a longer temporal domain, i.e. change less
rapidly than units of linguistic prosody) TM also claim that in the
planning of an utterance the speaker first formulates the ''prosodic
wrapper'' and subsequently the segmental content, contrary to the more
traditional notion that the segmental make up of an utterance is
planned first and then provided with linguistic and expressive prosody
(pp. 384-386).
Since ''Expression in Speech'' is a lot about imagining how speech
synthesis can be improved in the future, let us for illustration
purposes (and fun) beam aboard the Enterprise 1701-D and listen to the
type of (Sci-Fi-projected) speech synthesis found there. (To Tatham
and Morton: this is not to ridicule your book but to cherish its
value; to all who don't like or know Paramout Picture's Star Trek:The
Next Generation: please skip to the next paragraph.) First there is
the voice of the ship's computer, everybody can talk to from the
bridge, the elevator and all over the ship. The computer speaks in a
voice that is essentially expressionless. Actually the voice is not
fully without expression: it speaks in an overall friendly manner,
which is an illustration of TM's point that ''all speech is
expression-based'' (title of Chapter 14). But this friendly kind of
voice by the computer is always the same, no matter how inappropriate
for the context and how annoying for the listener. In TM's terms, the
node <expression> has an attribute such as ''low-emotion friendly'' as
a permanent setting for every utterance. This kind of inflexible way
of including expression in speech synthesis is what TM's argue
against. What will probably meet their expectations, however, is the
voice of the unique android Lieutenant Commander Data. Data is not
able to experience emotions but in his speech and nonverbal behavior
is able to express a certain degree of emotion. He usually cannot
express strong and basic emotions; at least he is not very good at it,
although when demanded in situations like a theater play his
expressive abilities into that direction improve (cf. TM's XML
declaration of emotive aspects in Hamlet's speech, p. 304f.).
According to TM, what is both more difficult and more required of a
speech synthesis system is the ability to express subtle and blended
rather than extreme and basic emotions. What an interactive system
needs in their words is ''less intense expressiveness to increase its
naturalness and credibility'' (p. 90) - a feature certainly met in
Data's speech. Data also meets TM's proposal that a speech synthesis
system should be able to perceive or infer listener reactions and to
relate those reactions to the verbal or vocal expressive content of
its speech with the ability to adjust it. In his regular interactions
with the other crew members Data can for example perceive physical or
verbal/vocal signs of distress in reaction to his behavior and can ask
if he in any way offended the person he talked to. Another point: is
the goal of expressive speech synthesis to model just the expression
or also the physical and perhaps psychological aspects that come with
an emotive reaction, as a stage prior to or interacting with the
expression (TM pp. 277-280 for discussion)? More philosophically: can
or should machines ever cross the body-mind barrier and even be able
to EXPERIENCE emotions? That certainly went wrong with Data's android
brother Lore, who turned into a raving lunatic over his abilities to
experience emotions - but who knows. By the way, being an android,
Data is also the perfect embodiment of an articulatory synthesizer,
which many in the field of speech synthesis think will ultimately be
the best way of doing synthesis.
''Expression in Speech'' in some ways has more the character of a
monograph for the advanced reader than of a basic textbook or handbook
because it presupposes that - or is of maximal value if - the reader
is familiar with or willing to familiarize her/himself elsewhere with
the principles of speech synthesis, with the literature on emotion in
speech, and with background subjects such as phonology or
psycholinguistics. For example, although different speech synthesis
techniques such as formant synthesis, unit- selection synthesis, or
diphone synthesis are all mentioned, discussed and in part
illustrated, the reader still has to turn to other sources when
wanting to know how e.g. formant synthesis works (the distinction
between source and filter parameters, the cascade and the parallel
branch, etc.). And although the most important correlates of emotion
in speech that have been reported in the literature are summarized in
the form of tables (pp. 55, 115), TM essentially do not provide a
literature overview on this topic (by mentioning the original sources
such as Williams and Stevens 1972 and many others) but cite a few
secondary sources, one of which a probably not very accessible
Ph.D. thesis, to which the interested reader can turn for further
literature. [We want to mention at this point that there has also been
some interesting work on emotion in speech in Germany including
Tischer (1993; with extensive literature review up to that date),
Klasmeyer and Sendlmeier (2000), Burkhardt (2001; with special
reference to emotion in speech synthesis), and Kienast (2002).]
The importance of phonology and prosody are mentioned throughout the
book, but except for a few remarks on the Firthian prosodic framework,
metrical phonology and articulatory phonology (pp. 21f.), their theory
of Cognitive Phonetics (p. 209, 334 etc.), or on the limitations of
Pierrehumbert's intonation model and the ToBI system for speech
synthesis (p. 118), it is not really clear what the model of phonology
it is that TM have in mind as background for their work on expression
in speech (e.g. in their production model mentioned above) or whether
they think a combination of models is best for the practical goals at
hand. In our opinion, for example, it would be too harsh a judgment to
question the usefulness of autosegmental phonology for the purpose of
speech synthesis, if this is what TM have in mind (see Clements and
Hertz 1996 for the autosegmental ''Delta'' model of speech synthesis
and its phonological motivation). The unfamiliar reader would need a
few phonology textbooks and perhaps an introduction to the history of
linguistics explaining the differences between British and American
linguistic traditions (e.g. Anderson 1985) to get a perspective.
Regarding psycholinguistics, it would have been useful had TM
explained how their speech production model is similar to or differs
from at least the one of Levelt (1989). On the positive side, TM
mention quite a bit of literature on the biology and psychology of
emotions. For that purpose they also provide a bibliography (p. 411f.)
following their list of references.
''Expression in Speech'' is written in a clear and explicit style,
avoiding as much technical language as possible. It also focuses in on
some topics and explains them in quite some detail (e.g. what the
syllable-internal constituents are and how hierarchical syllable
structure can be expressed in XML; p. 372-374). These aspects make the
book again more textbook- than monograph-like, and it has the positive
consequence that it will be understood by many interested persons
outside the specialized emotion-in- speech-synthesis community, which
corresponds to the announcement in the text on the book cover that the
book will be of interest for researchers in linguistics, speech
science, pathology, technology and behavioral or cognitive science. In
some instances, however, clarity and explicit style turns into
redundancy. The book contains 16 chapters not all of which dealing
with separate topics. TM have the habit of bringing up a topic and
explaining some aspects of it, then bringing it up again in a
different chapter with a certain shift in detail or perspective. Some
readers will enjoy this way of arranging the book - and it can be a
way of ultimately grasping the subject matter better than with a more
redundancy-free style - but other readers, who cannot invest the same
amount of time or may wish to concentrate on some aspects while
leaving others, might find it difficult to extract the information
they need without missing something important that occurs elsewhere in
the book (TM provide a subject and author index however).
We have two technical comments on speech synthesis. First, to our
knowledge, the HLsyn system by the Sensimetrics company is based on
the revised and expanded parameter set described in Klatt and Klatt
(1990) and not the 1980 model of the Klatt formant synthesizer
(p. 239). Second, it is essentially correct that formant frequencies
and amplitudes (including correlates of articulatory precision) as
well as voice quality parameters cannot be modified with signal
processing methods in concatenative synthesis (see table on
p. 237). However, there has been research and development into that
direction, and it is probably increasing strongly in the future,
enhanced in part by the motivation to enable synthesizers to speak
with different individual voices (see e.g. Quatieri and McAulay 1986,
d'Alessandro and Doval 1998, Kain and Macon 1998, Stylianou
2001). [Thanks to Karlheinz St�bery"r discussion on that subject
and for giving us information on literature.]
The few critical comments we made here are essentially about issues of
style and the selection and organization of background information.
They leave untouched our central impression of the book: that it is
extremely useful as a guide to anyone working on the interface between
emotion in speech and speech synthesis. Tatham and Morton offer a far-
sighted perspective to this topic and make explicit many issues the
developer of synthesis systems might not think about at all. In this
sense the book is also a very good example of how the linguist and
phonetician can make valuable contributions to speech technology, and
that in the end the best results will be obtained if speech
technologists and linguists/phoneticians work together.
REFERENCES
Anderson, S. R. (1985) Phonology in the twentieth century: theories of
rules and theories of representations, Chicago: The University of
Chicago Press.
Burkhardt, F. (2001) Simulation emotionaler Sprechweise mit
Sprachsynthesesystemen, Aachen: Shaker Verlag.
Clements, G. N. and Hertz, S. R. (1996) An integrated approach to
phonology and phonetics. In Durand, J. and Laks, B. (eds.) Current
trends in phonology: models and methods, pp. 143-173, University of
Salford, European Studies Research Institute.
d'Alessandro C. and Doval, B. (1998) Experiments in voice quality
modification of natural speech signals: the spectral approach. In: The
Third ESCA/COCOSDA Workshop on Speech Synthesis (on CD).
Kain, A. and Macon, M. (1998) Personalizing a speech synthesizer by
voice adaptation. In: The Third ESCA/COCOSDA Workshop on Speech
Synthesis (on CD).
Kienast, M. (2002) Phonetische Ver�nderungen in emotionaler
Sprechweise, Aachen: Shaker Verlag.
Klasmeyer, G. and Sendlmeier, W. F. (2000) Voice and emotional states.
In R.D. Kent and M. J. Ball (eds.) Voice quality measurement, pp. 339-
357, San Diego: Singular Publishing Group.
Klatt, D. H. and Klatt, L. C. (1990) Analysis, synthesis, and
perception of voice quality variations among females and male talkers,
Journal of the Acoustical Society of America 87, pp. 820-857.
Levelt, W. J. M. (1989) Speaking: from intention to articulation.
Cambridge, MA: The MIT Press.
Quatieri T. F. and McAulay, R. J. (1986) Speech transformations based
on sinusoidal representation, IEEE Transactions on Acoustics, Speech,
and Signal Processing, ASSP-34, pp. 1449-1464.
Stylianou, Y. (2001) Applying the harmonic plus noise model in
concatenative speech synthesis, IEEE Transactions on Speech and Audio
Processing, 9, 1, pp. 21-29.
Tischer, B. (1993) Die vokale Kommunikation von Gef�hlen. Weinheim:
Beltz.
Williams C. and Stevens, K. N. (1972) Emotions and speech: some
acoustical correlates, Journal of the Acoustical Society of America
52, pp.1238-1250.
ABOUT THE REVIEWER
Marianne Jessen is a lecturer at the Department of Logopedicsy"
Europa- Fachhochschule Fresenius in Idstein, Germany - the first
academically- based program in Logopedics in Germany - where she is
responsible for the section on voice. Her interests include speech
under stress, voice quality, and dysphagia. Michael Jessen works at
the Forensic Speech and Audio Department of the Bundeskriminalamt
(Federal Criminal Police Office) in Wiesbaden, Germany. His interests
include voicing and voice quality, laboratory phonology, and speaker
identification.