The Speech Interface Framework working group has sought to
develop standards to enable access to the web using spoken
interaction. The Speech Synthesis Markup Language Specification
is part of this set of new markup specifications for voice
browsers, and is designed to provide a rich, XML-based markup
language for assisting the generation of synthetic speech in web
and other applications. The essential role of the markup language
is to provide authors of synthesizable content a standard way to
control aspects of speech such as pronunciation, volume, pitch,
rate and etc. across different synthesis-capable platforms.

Status of this Document

This section describes the status of this document at the
time of its publication. Other documents may supersede this
document. The latest status of this document series is maintained
at the W3C.

This is the 3 January 2001 last call Working Draft of the
"Speech Synthesis Markup Language Specification". This last call
review period ends 31 January 2001. You are encouraged to
subscribe to the public discussion list <www-voice@w3.org>
and to mail in your comments before the review period ends. To
subscribe, send an email to <www-voice-request@w3.
org> with the word subscribe in the subject line
(include the word unsubscribe if you want to
unsubscribe). A public
archive is available online.

This specification describes markup for generating synthetic
speech via a speech synthesizer, and forms part of the proposals
for the W3C Speech Interface Framework. This document has been
produced as part of the W3C
Voice Browser Activity, following the procedures set out for
the W3C
Process. The authors of this document are members of the Voice Browser Working
Group (W3C Members only).

To help the Voice Browser working group build an implementation
report, (as part of advancing the document on the W3C
Recommendation Track), you are encouraged to implement this
specification and to indicate to W3C which features have been
implemented, and any problems that arose.

Publication as a Working Draft does not imply endorsement by
the W3C Membership. This is a draft document and may be updated,
replaced or obsoleted by other documents at any time. It is
inappropriate to cite W3C Working Drafts as other than "work in
progress". A list
of current public W3C Working Drafts can be found at http://www.w3.org/TR/.

This W3C Standard is known as the Speech Synthesis Markup
Language Specification and is based upon the JSML specification, which
is owned by Sun Microsystems, Inc., California, U.S.A.

The Speech Synthesis Markup Language specification is part of
a larger set of markup specifications for voice browsers
developed through the open processes of the W3C. It is designed
to provide a rich, XML-based markup language for assisting the
generation of synthetic speech in web and other applications. The
essential role of the markup language is to give authors of
synthesizable content a standard way to control aspects of speech
output such as pronunciation, volume, pitch, rate and etc. across
different synthesis-capable platforms.

A Text-To-Speech (TTS) system that supports the Speech
Synthesis Markup Language will be responsible for rendering a
document as spoken output and for using the information contained
in the markup to render the document as intended by the
author.

Document creation: A text document provided as input
to the TTS system may be produced automatically, by human
authoring, or through a combination of these forms. The Speech
Synthesis markup language defines the form of the document.

Document processing: The following are the six major
processing steps undertaken by a TTS system to convert marked-up
text input into automatically generated voice output. The markup
language is designed to be sufficiently rich so as to allow
control over each of the steps described below so that the
document author (human or machine) can control the final voice
output.

XML Parse: An XML parser is used to extract the
document tree and content from the incoming text document. The
structure, tags and attributes obtained in this step influence
each of the following steps.

Structure analysis: The structure of a document
influences the way in which a document should be read. For
example, there are common speaking patterns associated with
paragraphs and sentences.

- Non-markup behavior: In documents and parts of
documents where these elements are not used, the TTS system is
responsible for inferring the structure by automated analysis of
the text, often using punctuation and other language-specific
data.

Text normalization: All written languages have special
constructs that require a conversion of the written form
(orthographic form) into the spoken form. Text normalization is
an automated process of the TTS system that performs this
conversion. For example, for English, when "$200" appears in a
document it may be spoken as "two hundred dollars". Similarly,
"1/2" may be spoken as "half", "January second", "February
first", "one of two" and so on.

- Markup support: The "say-as"
element can be used in the input document to explicitly
indicate the presence and type of these constructs and to resolve
ambiguities. The set of constructs that can be marked includes
dates, times, numbers, acronyms, current amounts and more. The
set covers many of the common constructs that require special
treatment across a wide number of languages but is not and cannot
be a complete set.

- Non-markup behavior: For text content that is not
marked with the "say-as" element the TTS system is expected to
make a reasonable effort to automatically locate and convert
these constructs to a speakable form. Because of inherent
ambiguities (such as the "1/2" example above) and because of the
wide range of possible constructs in any language, this process
may introduce errors in the speech output and may cause different
systems to render the same document differently.

Text-to-phoneme conversion: Once the system has
determined the set of words to be spoken it must convert those
words to a string of phonemes. A phoneme is the basic
unit of sound in a language. Each language (and sometimes each
national or dialect variant of a language) has a specific phoneme
set: e.g. most US English dialects have around 45 phonemes. In
many languages this conversion is ambiguous since the same
written word may have many spoken forms. For example, in English,
"read" may be spoken as "reed" (I will read the book) or "red" (I
have read the book). Another issue is the handling of words with
non-standard spellings or pronunciations. For example, an English
TTS system will often have trouble determining how to speak some
non-English-origin names; e.g. "Tlalpachicatl" which has a
Mexican/Aztec origin.

- Markup support: The "phoneme"
element allows a phonemic sequence to be provided for any
word or word sequence. This provides the content creator with
explicit control over pronunciations. The "say-as" element may also be used to indicate
that text is a proper name that may allow a TTS system to apply
special rules to determine a pronunciation.

- Non-markup behavior: In the absence of a "phoneme" element the TTS system must apply
automated capabilities to determine pronunciations. This is
typically achieved by looking up words in a pronunciation
dictionary and applying rules to determine other pronunciations.
Most TTS systems are expert at performing text-to-phoneme
conversions so most words of most documents can be handled
automatically.

Prosody analysis: Prosody is the set of features of
speech output that includes the pitch (also called intonation or
melody), the timing (or rhythm), the pausing, the speaking rate,
the emphasis on words and many other features. Producing
human-like prosody is important for making speech sound natural
and for correctly conveying the meaning of spoken language.- Markup support: The "emphasis"
element, "break" element and "prosody" element may all be used by document
creators to guide the TTS system is generating appropriate
prosodic features in the speech output. The "lowlevel" element (under Future Study) could
provide particularly precise control of the prosodic
analysis.

- Non-markup behavior: In the absence of these
elements, TTS systems are expert (but not perfect) in
automatically generating suitable prosody. This is achieved
through analysis of the document structure, sentence syntax, and
other information that can be inferred from the text input.

Waveform production: The phonemes and prosodic
information are used by the TTS system in the production of the
audio waveform. There are many approaches to this processing step
so there may be considerable platform-specific variation.

- Markup support: The TTS markup does not provide
explicit controls over the generation of waveforms. The "voice" element allows the document creator to
request a particular voice or specific voice qualities (e.g. a
young male voice). The "audio" element
allows for insertion of recorded audio data into the output
stream.

There are many classes of document creator that will produce
marked-up documents to be spoken by a TTS system. Not all
document creators (including human and machine) have access to
information that can be used in all of the elements or in each of
the processing steps described in the previous
section. The following are some of the common cases.

The document creator has no access to information to mark up
the text. All processing steps in the TTS system must be
performed fully automatically on raw text. The document
requires only the containing "speak" element
to indicate the content is to be spoken.

When marked text is generated programmatically the creator may
have specific knowledge of the structure and/or special text
constructs in some or all of the document. For example, an email
reader can mark the location of the time and date of receipt of
email. Such applications may use elements that affect structure,
text normalization, prosody and possibly text-to-phoneme
conversion.

Some document creators make considerable effort to mark as
many details of the document to ensure consistent speech quality
across platforms and to more precisely specify output qualities.
In these cases, the markup may use any or all of the available
elements to tightly control the speech output. For example,
prompts generated in telephony and voice browser applications may
be fine-tuned to maximize the effectiveness of the overall
system.

The most advanced document creators may skip the higher-level
markup (structure, text normalization, text-to-phoneme
conversion, and prosody analysis) and produce low-level TTS markup for segments of documents
or for entire documents. This typically requires tools to
generate sequences of phonemes, plus pitch and timing
information. For instance, tools that do "copy synthesis" or
"prosody transplant" try to emulate human speech by copying
properties from recordings.

The following are important instances of architectures or
designs from which marked-up TTS documents will be generated. The
language design is intended to facilitate each of these
approaches.

Dialog language: It is a requirement that it should
be possible to include documents marked with the speech synthesis
markup language into the dialog description document to be
produced by the Voice Browser Working Group.

Application-specific style-sheet processing: As
mentioned above, there are classes of application that have
knowledge of text content to be spoken and this can be
incorporated into the speech synthesis markup to enhance
rendering of the document. In many cases, it is expected that the
application will use style-sheets to perform transformations of
existing XML documents to speech synthesis markup. This is
equivalent to the use of ACSS with HTML and once again the speech
synthesis markup language is the "final form" representation to
be passed to the speech synthesis engine.

Following the XML
convention, languages are indicated by an
"xml:lang"attribute on the enclosing element
with the value following RFC 1766 to
define language codes. Language information is inherited down the
document hierarchy, i.e. it has to be given only once if the
whole document is in one language, and language information
nests, i.e. inner attributes overwrite outer attributes.

Usage note 1: The speech output platform determines
behavior in the case that a document requires speech output in a
language not supported by the speech output platform. This is
currently only one of two allowed exceptions to the conformance
criteria.

Usage note 2: There may be variation across
conformant platforms in the implementation of "xml:lang" for
different markup elements. A document author should beware that
intra-sentential language changes may not be supported on all
platforms.

Usage note 3: A language change often necessitates a
change in the voice. Where the platform does not have the same
voice in both the enclosing and enclosed languages it should
select a new voice with the inherited voice attributes. Any
change in voice will reset the prosodic attributes to the default
values for the new voice of the enclosed text. Where the
"xml:lang" value is the same as the inherited value there is no
need for any changes in the voice or prosody.

Usage note 4: All elements should process their
contents specific to the enclosing language. For instance, the
phoneme, emphasis and break element should each be rendered in a
manner that is appropriate to the current language.

Usage note 5: Unsupported languages on a conforming
platform could be handled by specifying nothing and relying on
platform behavior, issuing an event to the host environment, or
by providing substitute text in the ML.

A "paragraph" element represents the
paragraph structure in text. A "sentence"
element represents the sentence structure in text. A paragraph
contains zero or more sentences.

<paragraph>
<sentence>This is the first sentence of the paragraph.</sentence>
<sentence>Here's another sentence.</sentence>
</paragraph>

Usage note 1: For brevity, the markup also supports
<p> and <s> as exact equivalents of <paragraph>
and <sentence>. (Note: XML requires that the opening and
closing elements be identical so <p> text
</paragraph> is not legal.). Also note that <s> means
"strike-out" in HTML 4.0 and earlier, and in
XHTML-1.0-Transitional but not in XHTML-1.0-Strict.

Usage note 2: The use of paragraph and sentence
elements is optional. Where text occurs without an enclosing
paragraph or sentence elements the speech output system should
attempt to determine the structure using language-specific
knowledge of the format of plain text.

The "say-as" element indicates the type of
text construct contained within the element. This information is
used to help specify the pronunciation of the contained text.
Defining a comprehensive set of text format types is difficult
because of the variety of languages that must be considered and
because of the innate flexibility of written languages. The
"say-as" element has been specified with a
reasonable set of format types. Text substitution may be utilized
for unsupported constructs.

The "type" attribute is a required attribute
that indicates the contained text construct. The format is a text
type optionally followed by a colon and a format. The base set of
type values, divided according to broad functionality, is as
follows:

Pronunciation Types

acronym: contained text is an acronym. The characters
in the contained text string are pronounced as individual
characters.

<say-as type="acronym"> USA </say-as>
<!-- U. S. A. -->

Numerical Types

number: contained text contains integers, fractions,
floating points, Roman numerals or some other textual format that
can be interpreted and spoken as a number in the current
language. Format values for numbers are:
"ordinal", where the contained text should be
interpreted as an ordinal. The content may be a digit sequence or
some other textual format that can be interpreted and spoken as
an ordinal in the current language; and
"digits", where the contained text is to be read
as a digit sequence, rather than as a number.

<say-as type="date:ymd"> 2000/1/20 </say-as>
<!-- January 20th two thousand -->
Proposals are due in <say-as type="date:my"> 5/2001 </say-as>
<!-- Proposals are due in May two thousand and one -->
The total is <say-as type="currency">$20.45</say-as>
<!-- The total is twenty dollars and forty-five cents -->

When multi-field quantities are specified ("dmy", "my", etc.),
it is assumed that the fields are separated by single,
non-alphanumeric character.

Usage note 1: The conversion of the various types of
text and text markup to spoken forms is language and
platform-dependent. For example, <say-as type="date:ymd">
2000/1/20 </say-as> may be read as "January twentieth two
thousand" or as "the twentieth of January two thousand" and so
on. The markup examples above are provided for usage illustration
purposes only.

Usage note 2: It is assumed that pronunciations
generated by the use of explicit text markup always take
precedence over pronunciations produced by a lexicon.

The "phoneme" element provides a phonetic
pronunciation for the contained text. The
"phoneme" element may be empty. However, it is
recommended that the element contain human-readable text that can
be used for non-spoken rendering of the document. For example,
the content may be displayed visually for users with hearing
impairments.

The "alphabet" attribute is an optional
attribute that specifies the phonetic alphabet. The
"ph" attribute is a required attribute that
specifies the phoneme string:

worldbet: The specified phonetic string is composed of
symbols from the
Worldbet (Postscript) phonetic alphabet.

xsampa: The specified phonetic string is composed of
symbols from the X-SAMPA
phonetic alphabet.

<phoneme alphabet="ipa" ph="t&#x252;m&#x251;to&#x28A;"> tomato </phoneme>
<!-- This is an example of IPA using character entities -->

Usage note 1: Characters composing many of the IPA
phonemes are known to display improperly on most platforms.
Additional IPA limitations include the fact that IPA is difficult
to understand even when using ASCII equivalents, IPA is missing
symbols required for many of the world's languages, and IPA
editors and fonts containing IPA characters are not widely
available.

Usage note 2: Entity definitions may be used for
repeated pronunciations. For example:

variant: optional attribute indicating a preferred
variant of the other voice characteristics to speak the contained
text. (e.g. the second or next male child voice). Acceptable
values are of type (integer).

name: optional attribute indicating a platform-specific
voice name to speak the contained text. The value may be a
space-separated list of names ordered from top preference down.
Acceptable values are of the form (voice-name-list).

Usage note 3: A change in voice resets the prosodic
parameters since different voices have different natural pitch
and speaking rates. Volume is the only exception. It may be
possible to preserve prosodic parameters across a voice change by
employing a style sheet. Characteristics specified as "+" or "-"
voice attributes with respect to absolute voice attributes would
not be preserved.

Usage note 4: The "xml:lang" attribute may be used
specially to request usage of a voice with a specific dialect or
other variant of the enclosing language.

<voice xml:lang="en-cockney">Try a Cockney voice
(London area).</voice>
<voice xml:lang="en-brooklyn">Try one New York
accent.</voice>

The "emphasis" element requests that the
contained text be spoken with emphasis (also referred to as
prominence or stress). The synthesizer determines how to render
emphasis since the nature of emphasis differs between languages,
dialects or even voices. The attributes are:

level: the "level" attribute
indicates the strength of emphasis to be applied. Defined values
are "strong", "moderate",
"none" and "reduced". The
default level is "moderate". The meaning of
"strong" and "moderate"
emphasis is interpreted according to the language being spoken
(languages indicate emphasis using a possible combination of
pitch change, timing changes, loudness and other acoustic
differences). The "reduced" level is effectively
the opposite of emphasizing a word. For example, when the phrase
"going to" is reduced it may be spoken as "gonna". The
"none" level is used to prevent the speech
synthesizer from emphasizing words that it might typically
emphasize.

That is a <emphasis> big </emphasis> car!
That is a <emphasis level="strong"> huge </emphasis>
bank account!

The "break" element is an empty element that
controls the pausing or other prosodic boundaries between words.
The use of the break element between any pair of words is
optional. If the element is not defined, the speech synthesizer
is expected to automatically determine a break based on the
linguistic context. In practice, the "break"
element is most often used to override the typical automatic
behavior of a speech synthesizer. The attributes are:

size: the "size" attribute
is an optional attribute having one of the following relative
values: "none", "small",
"medium" (default value), or
"large". The value "none"
indicates that a normal break boundary should be used. The other
three values indicate increasingly large break boundaries between
words. The larger boundaries are typically accompanied by
pauses.

time: the "time" attribute
is an optional attribute indicating the duration of a pause in
seconds or milliseconds. It follows the "Times" attribute format
from the Cascading Style Sheet Specification. e.g. "250ms",
"3s".

Take a deep breath <break/>
then continue.
Press 1 or wait for the tone. <break time="3s"/>
I didn't hear you!

Usage note 1: Using the "size" attribute is generally
preferable to the "time" attribute within normal speech. This is
because the speech synthesizer will modify the properties of the
break according to the speaking rate, voice and possibly other
factors. As an example, a fixed 250ms pause (placed with the
"time" attribute) sounds much longer in fast speech than in slow
speech.

duration: a value in seconds or milliseconds for the
desired time to take to read the element contents. Follows the
Times attribute format from the Cascading Style Sheet
Specification. e.g. "250ms", "3s".

volume: the volume for the contained text in the range
0.0 to 100.0, a relative change or values
"silent", "soft",
"medium", "loud" or
"default".

Relative values

Relative changes for any of the attributes above are specified
as floating-point values: "+10", "-5.5", "+15.2%", "-8.0%". For
the pitch and range attributes, relative changes in semitones are
permitted: "+5st", "-2st". Since speech synthesizers are not able
to apply arbitrary prosodic values, conforming speech synthesis
processors may set platform-specific limits on the values. This
is the second of only two exceptions allowed in the conformance
criteria for an SSML processor.

The price of XYZ is <prosody rate="-10%">
<say-as type="currency">$45</say-as></prosody>

Pitch contour

The pitch contour is defined as a set of targets at specified
intervals in the speech output. The algorithm for interpolating
between the targets is platform-specific. In each pair of the
form (interval,target), the first value is a
percentage of the period of the contained text and the second
value is the value of the "pitch" attribute
(absolute, relative, relative semitone, or descriptive values are
all permitted). Interval values outside 0% to 100% are ignored.
If a value is not defined for 0% or 100% then the nearest pitch
target is copied.

Usage note 1: The descriptive values ("high",
"medium" etc.) may be specific to the platform, to user
preferences or to the current language and voice. As such, it is
generally preferable to use the descriptive values or the
relative changes over absolute values.

Usage note 2: The default value of all prosodic
attributes is no change. For example, omitting the rate attribute
means that the rate is the same within the element as
outside.

Usage note 3: The "duration" attribute takes
precedence over the "rate" attribute. The "contour" attribute
takes precedence over the "pitch" and "range" attributes.

Usage note 4: All prosodic attribute values are
indicative: if a speech synthesizer is unable to accurately
render a document as specified it will make a best effort (e.g.
trying to set the pitch to 1Mhz, or the speaking rate to
1,000,000 words per minute.)

The "audio" element supports the insertion of
recorded audio files and the insertion of other audio formats in
conjunction with synthesized speech output. The audio element may
be empty. If the audio element is not empty then the contents
should be the marked-up text to be spoken if the audio document
is not available. The contents may also be used when rendering
the document to non-audible output and for accessibility. The
required attribute is "src", which is the URI of a
document with an appropriate mime-type.

Usage note 1: The "audio" element is not intended to
be a complete mechanism for synchronizing synthetic speech output
with other audio output or other output media (video etc.).
Instead the "audio" element is intended to support the common
case of embedding audio files in voice output.

Usage note 2: The alternative text may contain
markup. The alternative text may be used when the audio file is
not available, when rendering the document as non-audio output,
or when the speech synthesizer does not support inclusion of
audio files.

A "mark" element is an empty element that
places a marker into the output stream for asynchronous
notification. When audio output of the TTS document reaches the
mark, the speech synthesizer issues an event that includes the
required "name" attribute of the element. The
platform defines the destination of the event. The
"mark" element does not affect the speech output
process.

Go from <mark name="here"/> here, to <mark name="there"/> there!

Usage note 1: When supported by the implementation,
requests can be made to pause and resume at document locations
specified by the mark values.

Usage note 2: The mark names are not required to be
unique within a document.

If a non-validating XML parser is used, an arbitrary XML
element can be included in documents to expose platform-specific
capabilities. If a validating XML parser is used, then
engine-specific elements can be included if they are defined in
an extended schema within the document. These extension elements
are processed by engines that understand them and ignored by
other engines.

Usage note 1: When engines support non-standard
elements and attributes it is good practice for the name to
identify the feature as non-standard, for example, by using a "x"
prefix or a company name prefix.

The Voice Browser Working Group is considering the additional
support of the UNIPA phonetic alphabet developed by Lernout and
Hauspie Speech Products. UNIPA was designed to reflect a
one-to-one ASCII representation of existing IPA symbols, greater
ease of use and readability, and ease of portability across
platforms. Issues with UNIPA surround the fact that the symbols
were not specifically designed for use in XML attribute
statements. The use of double quotes, ampersand, and less-than
characters is currently incompatible with SSML DTD usage.

All of the phoneme alphabets currently supported by SSML
suffer from the same defect in that they contain phonemic symbols
not specifically designed for expression within XML documents.
The design of a new, XML-optimal phoneme alphabet is currently
under study.

A future incarnation of the "audio" element
could include a "mode" attribute. If equal to
"insertion"(the default), the speech output is
temporarily paused, the audio is played then speech is resumed.
If equal to "background", the audio is played
along with speech output. Currently unresolved are the mechanics
of how to specify audio playback behaviors like playback
termination, etc.

There has been discussion that the "mark"
element should be an XML identifier ("id" attribute) with values
being unique within the scope of the document. In addition,
future study needs to ensure that events generated by a mark
element are consistent with existing event models in other
specifications (e.g. DOM, SMIL and the dialog markup
language).

The "lowlevel" element is a container for a sequence of
phoneme and pitch controls: "ph" and "f0" elements respectively.
The attributes of the "lowlevel" container element are:

Optional "alt" attribute that provides a human-readable
string that is equivalent to the contained phonemic
sequence.

Optional "pitch" attribute with values of "absolute",
"relative" and "percentage" that indicate how to interpret the
values on the contained pitch elements.

Tentative: Optional "alphabet" attribute with a default value
of "ipa" and alternative values of "sampa" and "worldbet". This
indicates which phonetic alphabet is used for the phonetic string
values.

The "ph" and "f0" elements may be interleaved or placed in
separate sequences (as in the example below).

"ph" Element: Phoneme with Duration

A "lowlevel" element may contain a sequence of zero or more
"ph" elements. The "ph" element is empty. The "p" attribute is
required and has a value that is a phoneme symbol. The optional
"d" attribute is the duration in seconds or milliseconds (seconds
as default) for the phoneme. If the "d" attribute is omitted a
platform-specific default is used.

"f0" Element: Timed Pitch Targets

A "lowlevel" element may contain a sequence of zero or more
"f0" elements. The "f0" element is empty. The "v" (value)
attribute is required and should be in the form of an integer or
simple floating point number (no exponentials). The value
attribute is interpreted according to the value of the "pitch"
attribute of the enclosing "lowlevel" element. The optional "t"
attribute indicates the time offset from the preceding "f0"
element and has a value of seconds or milliseconds (seconds as
default). If the "t" attribute is omitted on the first "f0"
element in a "lowlevel" container, the specified "f0" target
value is aligned with the start of the first non-silent
phoneme.

Usage note 1: It is anticipated that low-level markup
will be generated by automated tools, so compactness is given
priority over readability.

Issues:

There is an unresolved request to require that the "fo" and
"ph" elements be interleaved within the "lowlevel" element so
that they are in exact temporal order. This change is simple to
require but requires that the duration attributes be interpreted
consistently. It has been proposed that for the "ph" element the
"d" attribute be an offset from the prior "ph" element but that
for the "f0" element it should be an offset from the previous
"ph" or "f0" element. A diagram would help here.

The attribute names for this element set need to be similar,
identical, or somehow consistent with those of the "prosody"
element.

Would "pi" or "fr" be preferable to "f0": i.e. pitch or
frequency vs. the technical abbreviation for fundamental
frequency.

The "phoneme" element and "lowlevel" are inconsistent in that
the phone string is an attribute in "phoneme" and part of the
content for "lowlevel". Also, the alternative text is the
contents of the "phoneme" element but an attribute of "lowlevel".
Perhaps these inconsistencies are unavoidable?

This element should track changes in the "phoneme" element.
e.g. if "phoneme" adds an "alphabet" attribute that allows the
specification of IPA, WorldBet or possibly other phonemic
alphabets, then a similar attribute should be added to the
"lowlevel" element.

The existing specification supports many ways by which a
document author can affect the intonational rendering of speech
output. In part, this reflects the broad communicate role of
intonation in spoken language: it reflects document structure
(see the paragraph and sentence elements),
prominence (see the emphasis element), and
prosodic boundaries (see the break element).
Intonation also reflects emotion and many less definable
characteristics that are not planned for inclusion in this
specification.

The specification could be enhanced to provide specific
intonational controls at boundaries and at points of
emphasis. In both cases there are existing elements to
which intonational attributes could be added. The issues that
need to be addressed are:

Determining the form that the attributes should take,

Ensuring that the attributes are applicable to a wide set of
languages,

Ensuring that use of the attributes does not require
specialized knowledge of intonation theory.

Intonational boundaries: The existing specification
allows a document to mark major boundaries and structures using
the paragraph and sentence elements and the
break element. The break element explicitly
marks a boundary whereas boundaries implicitly occur at both the
start and end of paragraphs and sentences. For each of these
boundary locations we could specify intonational patterns such as
a rise, fall, flat, low-rising, high-falling and some more
complex patterns. Proposals received to date include use of
labeling systems from intonational theory or use of punctuation
symbols such as '?', '!' and '.'.

Emphasis tones: The emphasis
element can be used to explicitly mark any word or word
sequence as emphasized. Each spoken language has patterns by
which emphasis is marked intonationally. For example, for
English, the more common emphasis tones are high, low,
low-rising, and high-downstep. Our challenge is to determine a
set of tones that has sufficient coverage of the tones of many
spoken languages to be useful, but which does not require
extensive theoretical knowledge.

A "value" element has been proposed that permits substitution
of a variable into the text stream. The variable's value must be
defined separately, either by a "set" element (not yet defined)
earlier in the document or in the host environment (e.g. in a
voice browser). The value is a plain text string (markup may be
ignored).

name: the name of the variable to be inserted in the
text stream.

type: same format as the "type" attribute of the
"say-as" element allowing the text to be marked as a phone
number, date, time etc.

The time is <value name="currentTime"/>.

Issues:

The "value" element is equivalent to the "value" element of
the VoiceXML specification. Unlike the Voice Browser which
interprets VoiceXML, a speech synthesizer does not typically have
persistent variables and would not normally have access to the
internal variable of a Voice Browser. One proposal is for the
Dialog ML to define a "value" element in its namespace and to
convert that element to normal speech synthesis markup before
passing the document to a speech synthesis. This is consistent
with the spirit of the speech synthesis markup language as a
"final form" representation. A downside of this approach is that
the DTD for speech synthesis in the Dialog ML would be
inconsistent with this specification.

The following is an example of reading headers of email
messages. The paragraph and sentence elements
are used to mark the text structure. The say-as
element is used to indicate text constructs such as the time
and proper name. The break element is placed
before time and has the effect of marking the time as important
information for the listener to pay attention to. The prosody element is used to slow the speaking
rate of the email subject so that the user has extra time to
listen and write down the details.

The following example combines audio files and different
spoken voices to provide information on a collection of
music.

<?xml version="1.0"?>
<speak>
<paragraph><voice gender="male">
<sentence>Today we preview the latest romantic music
from the W3C.</sentence>
<sentence>Hear what the Software Reviews said about
Tim Lee's newest hit.</sentence>
</voice></paragraph>
<paragraph><voice gender="female">
He sings about issues that touch us all.
</voice></paragraph>
<paragraph><voice gender="male">
Here's a sample. <audio src="http://www.w3c.org/music.wav">
Would you like to buy it?</voice></paragraph>
</speak>

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in this conformance section are to be interpreted as
described in RFC
2119

A speech synthesis markup document fragment is a
Conforming XML Document Fragment if it adheres to the
specification described in this document including the DTD (see
Document Type Definition) and also:

if all non-speech synthesis namespace elements and attributes
and all xmlns attributes which
refer to non-speech synthesis namespace elements are removed from
the given document, and if an appropriate XML declaration (i.e.,
<?xml...?>) is included at the top of the
document, and if an appropriate document type declaration which
points to the speech synthesis DTD is included immediately
thereafter, the result is a valid XML
document.

(if any namespaces other than speech synthesis markup are
used in the document) Namespaces in
XML.

The Speech Synthesis Markup Language or these conformance
criteria provide no designated size limits on any aspect of
speech synthesis markup documents. There are no maximum values on
the number of elements, the amount of character data, or the
number of characters in attribute values.

A Speech Synthesis Markup Language processor is a program that
can parse and process Speech Synthesis Markup Language
documents.

In a Conforming Speech Synthesis Markup Language
Processor, the XML parser must be able to parse and process
all XML constructs defined within XML 1.0 and XML
Namespaces.

A Conforming Speech Synthesis Markup Language Processor
must correctly understand and apply the command logic
defined for each markup element as described by this document.
Exceptions to this requirement are allowed when an xml:lang
attribute is utilized to specify a language not present on a
given platform, and when a non-enumerated attribute value is
specified that is out-of-range for the platform. The response of
the Conforming Speech Synthesis Markup Language Processor in both
cases would be platform-dependent.

A Conforming Speech Synthesis Markup Language Processor
should inform its hosting environment if it encounters an
element, element attribute, or syntactic combination of elements
or attributes that it is unable to support. A Conforming Speech
Synthesis Markup Language Processor should also inform its
hosting environment if it encounters an illegal speech synthesis
document or unknown XML entity reference.

(http://www.research.att.com/~rws/Sable.v1_0.htm)
SABLE is a markup language for controlling text to speech
engines. It has evolved out of work on combining three existing
text to speech languages: SSML, STML and JSML. Implementations
are available for the Bell Labs synthesizer and in the Festvial
speech synthesizer. The following are two of the papers written
about SABLE and its applications:

(http://www.voicexml.com/)
The Voice XML specification for dialog systems development
includes a set of prompt elements for generating speech synthesis
and other audio output that are very similar to elements of JSML
and SABLE.

8. Acknowledgements

This document was written with the participation of the
members of the W3C Voice Browser Working Group (listed in
alphabetical order):