Session 10 - Expression, Speaking Style and Focus
Presentations: Tsinghua University: Toward Synthesizing Expressive Mandarin Speech
France Télécom: Toward Synthesis of Focus in Mandarin TTS System
Discussion:
How is focus different from focus? emphasis is the way that a speaker indicates focus.
Not necessarily [France Télécom speaker gives an example where emphasis isn't focus]
Focus is about semantics, emphasis is about rendering.
The real question is how much anntation to put in SSML.
We have to know how we would want the TTS Engine to render the focus.
There are 2 main levels: logical description vs. rendering. It's the
same thing in HTML, with H1 for instance.
So should a focus, or general logical structure markup, element be added to SSML?
Is something missing in the rendering controls that must be added through focus markup?
Focus can be realised by emphasis and pause. Maybe what's missing is
that more controls like pause, which the current spec doesn't mention
for expressing emphasis.
** Conclusion: we note that when we revisit the topic of semantic vs
rendering level, then we consider focus as a topic.
Speaking styles: "news", "story", "sport", etc.
Styles could be mapped on paragraphs and sentences to have more
information. Possible attributes would give more information about the
way this piece of information needs to be rendered. Maybe there isn't
anything missing.
SSML is a crossing of different semantic levels.
The usefulness of adding semantic markup to help the synthesizer speak
the text better.
Determining important categories is going to be very hard, just like POS.
Opinions differ among vendors regarding the use of adding markup for
styles. Some think SS with style is too far away. Needs more
research. Others have think it;s nearer that that and have implemented
some.
It could be an optional feature.
There is less agreement on this feature than on some other. The question is
maybe this is the right time to standardize.
So what shoiuld we do about expressive elements?
Is there enough agreement on how it should be represented and is there
enopugh agreement on how it should be rendered. e.g. how many basic
emotions: Can we agree to name 6? Then how do you describe what
happens when you use them. Is "anger" ok, or do you need an anger
intensity. There is enough interest.
We should revisit this to understand whether it's ready to be standardised.
There may be an issue with providing support in all voices in all languages.
Optional behaviour, etc.
There are all those levels from semantics to lexical. The problem is whether
SSML must remain on the levels it's at now, or whether we want it to change,
for new purposes like research.