This document defines syntax for representing grammars for use
in speech recognition so that developers can specify the words and
patterns of words to be listened for by a speech recognizer. The
syntax of the grammar format is presented in two forms, an
Augmented BNF Form and an XML Form. The specification makes the two
representations mappable to allow automatic transformations between
the two forms.

Status of this Document

This section describes the status of this document at
the time of its publication. Other documents may supersede this
document. A list of current W3C publications and the latest
revision of this technical report can be found in the
W3C technical reports index
at http://www.w3.org/TR/.

This document has been reviewed by W3C Members and other
interested parties, and it has been endorsed by the Director
as a W3C
Recommendation. W3C's role in making the Recommendation is to
draw attention to the specification and to promote its widespread
deployment. This enhances the functionaility and interoperability
of the Web.

1. Introduction

This document defines the syntax for grammar representation. The
grammars are intended for use by speech recognizers and other
grammar processors so that
developers can specify the words and patterns of words to be
listened for by a speech recognizer.

The syntax of the grammar format is presented in two forms, an
Augmented BNF (ABNF) Form and an XML Form. The
specification ensures that the two representations are semantically mappable to allow automatic
transformations between the two forms.

Augmented BNF syntax (ABNF): this is a plain-text
(non-XML) representation which is similar to traditional BNF
grammar and to many existing BNF-like representations commonly used
in the field of speech recognition including the JSpeech Grammar
Format [JSGF] from which this
specification is derived. Augmented BNF should not be confused with
Extended BNF which is used in DTDs for XML and SGML.

XML: This syntax uses XML elements to represent the
grammar constructs and adapts designs from the PipeBeach grammar,
TalkML [TALKML] and a
research XML variant of the JSpeech Grammar Format [JSGF].

Both the ABNF Form and XML Form have the expressive power of a
Context-Free Grammar (CFG). A grammar
processor that does not support recursive grammars has the
expressive power of a Finite State Machine (FSM) or regular
expression language. For definitions of CFG, FSM, regular
expressions and other formal computational language theory see, for
example, [HU79]. This form of
language expression is sufficient for the vast majority of speech
recognition applications.

This W3C standard is known as the Speech Recognition Grammar
Specification and is modelled on the JSpeech Grammar Format
specification [JSGF], which is
owned by Sun Microsystems, Inc., California, U.S.A.

A grammar processor is any entity that accepts as input
grammars as described in this specification.

A user agent is a grammar processor that accepts user
input and matches that input against a grammar to produce a
recognition result that represents the detected input.

As the specification title implies, speech recognizers are an
important class of grammar processor. Another class of grammar
processor anticipated by this specification is a Dual-Tone Multi-Frequency (DTMF)
detector. The type of input accepted by a user agent is determined
by the mode or modes
of grammars it can process: e.g. speech input for "voice" mode grammars and DTMF input for "dtmf" mode grammars.

For simplicity, throughout this document references to a speech
recognizer apply to other types of grammar processor unless
explicitly stated otherwise.

A speech recognizer is a user agent with the following
inputs and outputs:

Input: A grammar or multiple grammars as defined by this
specification. These grammars inform the recognizer of the words
and patterns of words to listen for.

Input: An audio stream that may contain speech content that
matches the grammar(s).

Output: Descriptions of results that indicate details
about the speech content detected by the speech recognizer. The
format and details of the content of the result are outside the
scope of this specification. For informative purposes, most
practical recognizers will include at least a transcription of any
detected words.

Output: Error and other performance information may be provided
to the host environment: e.g. to a voice browser that incorporates
a grammar processor. The method of interaction with the host
environment is outside the scope of this document. The
specification does, however, require that a conformant grammar processor inform the environment of
errors in parsing and other processing of grammar documents.

The primary use of a speech recognizer grammar is to permit a
speech application to indicate to a recognizer what it should
listen for, specifically:

Words that may be spoken,

Patterns in which those words may occur,

Spoken language of each word.

Speech recognizers may also support the Stochastic Language
Models (N-Gram) Specification [NGRAM]. Both specifications define ways to set up a
speech recognizer to detect spoken input but define the word and
patterns of words by different and complementary means. Some
recognizers permit cross-references between grammars in the two
formats. The rule reference
element of this specification describes how to reference an N-gram
document.

The grammar specification does not address a number of
other issues that affect speech recognition performance. Most of
the following capabilities are addressed by the context in which a
grammar is referenced or invoked: for example, through VoiceXML 2.0
[VXML2] or through a speech
recognizer API.

Speaker adaptation data: Some speech recognizers
support the ability to dynamically adjust to the voice of a speaker
and often the ability to store adaptation data for that voice for
future use. The speaker data may also include lists of words more
often spoken by the user. The grammar format does not explicitly
address these capabilities.

Speech recognizer configuration: The grammar format
does not incorporate features for setting recognizer features such
as timeouts, recognition thresholds, search sizes or N-best result
counts.

Lexicon: The grammar format does not address the
loading of lexicons or the pronunciation of words referenced by the
grammar. The W3C Voice Browser Working Group is considering the
development of a standard lexicon format. If and when a format is
developed appropriate updates will be made to this grammar
specification.

Other speech processing capabilities: Speech
processing technology exists for language identification, speaker
verification (also known as voice printing), speaker recognition
(also known as speaker identification) amongst many other
capabilities. Although these technologies may be associated with a
speech recognizer they are outside the scope of this
specification.

The ABNF Form and XML Form are specified to ensure
that the two representations are semantically mappable. It should
be possible to automatically convert an ABNF Form grammar to an XML
Form grammar (or the reverse) so that the semantic performance of
the grammars are identical. Equivalence of semantic performance
implies that:

Both grammars accept the same language as input and reject the
same language as input

Both grammars parse any input string identically

The XSL Transformation document in Appendix F demonstrates automatic conversion from XML to
ABNF. The reverse conversion requires an ABNF parser and a
transformational program.

There are inherent limits to the automatic conversion to and
From ABNF Form and XML Form.

Formatting white
space cannot be preserved so a pretty-printable grammar in one
Form cannot guarantee automatic conversion to a pretty-printable
grammar in the other Form. Note: syntactically significant white
space is preserved.

Some XML constructs have no equivalent in ABNF: XML Schema, DTD,
character and entity declarations and references, processing
instructions, namespaces. The XML parser in a conforming grammar processor should expand all
character and entity references as defined in XML 1.0 [XML] prior to conversion to ABNF;
other constructs are lost. RDF [RDF-SYNTAX] represents metadata as XML within XML Form
grammar but could not be effectively utilized in ABNF Form grammars
and so is not supported.

A speech recognizer is capable of matching audio input against a
grammar to produce a raw text transcription (also known as
literal text) of the detected input. A recognizer may be
capable of, but is not required to, perform subsequent processing
of the raw text to produce a semantic interpretation of
the input.

For example, the natural language utterance "I want to book
a flight from Prague to Paris" could result in the following
XML data structure. To perform this additional interpretation step
requires semantic processing instructions that may be contained
within a grammar that defines the legal spoken input or in an
associated document.

The Speech Recognition Grammar Specification provides syntactic
support for limited semantic interpretation. The tag construct and the tag-formatand tag declarations provide a
placeholder for instructions to a semantic processor.

The W3C Voice
Browser Working Group is presently developing the Semantic
Interpretation for Speech Recognition specification [SEM]. That specification defines a
language that can be embedded in tags within SRGS grammars to
perform the interpretation process. The semantic processing is
defined with respect to the logical parse structure for grammar processing
(see Appendix H). Other tag
formats could be used but are outside the scope of the W3C
activities.

For examples of semantic interpretation in the latest working
draft see [SEM].

The output of the semantic interpretation processor may be
represented using the Natural Language Semantics Markup
Language[NLSML]. This
XML representation of interpreted spoken input can be used to
transmit the result, as input to VoiceXML 2.0[VXML2] processing or in other
ways.

The semantic interpretation carried out in the speech
recognition process is typically characterized by:

Restricted context: the interpretation does not
resolve deictic or anaphoric references or other language forms
that span more than a single utterance. Example: if the utterance
"I want to book a flight from Prague to Paris" were
followed later by "I want to continue from there to
London" the reference to "there" could be resolved to
"Paris". This requires analysis spanning more than one
utterance and is typically outside the scope of the speech
recognizer, but in scope for a dialog manager (e.g. a VoiceXML
application).

Domain-specific: a speech recognition grammar is
typically restricted to a narrow domain of input (e.g. collect
flight booking data). Within this domain semantic interpretation is
an achievable task whereas semantic interpretation for an entire
language is an extraordinarily complex task.

Language-specific: because each language has unique
linguistic structures the process of converting from a raw text to
a semantic result is necessarily language-specific.

It is this restricted form of semantic interpretation that this
approach is intended to support. A VoiceXML application that
receives a speech result with semantic interpretation will
typically process the user input to carry out a dialog. The
application may also perform deeper semantic analysis, for example
resolving deictic or anaphoric references.

The Speech Recognition Grammar Specification is designed to
permit ABNF Form and XML Form grammars to be embedded into other
documents. For example, VoiceXML 1.0 [VXML1] and VoiceXML 2.0 [VXML2] permit inline
grammars[VXML2
§3.1.1.1] in which an ABNF Form grammar or XML Form
grammar is contained within a VoiceXML document.

Embedding an XML Form grammar within an XML document can be
achieved with XML namespaces [XMLNS] or by incorporating the grammar XML Schema definition or DTD
into to enclosing document's schema or DTD.

An ABNF Form grammar may be embedded into any XML document as
character data. ABNF grammars will often contain angle brackets
which require special handling within XML. A CDATA section [XML
§2.7] or the escape sequences of "&lt;"
and "&gt;" may be required to create well-formed
XML. Note: angle brackets ('<' and '>') are used in ABNF to
delimit any URI, media type or repeat operator.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL"
in this document are to be interpreted as described in [RFC2119]. However, for
readability, these words do not appear in all uppercase letters in
this specification.

A URI is a unifying syntax for the expression of names and
addresses of objects on the network as used in the World Wide Web.
A URI is defined as any legal 'anyURI ' primitive as defined in XML Schema Part
2: Datatypes [SCHEMA2
§3.2.17]. The XML Schema definition follows [RFC2396] and [RFC2732]. The syntax
representation of a URI differs between the ABNF Form and the XML
Form. Any relative URI reference must be resolved according to the
rules given in Section 4.9.1.

ABNF URI: in the ABNF Form of this specification a URI
is delimited by angle brackets ('<' '>'). For example,
<http://www.example.com/file-path>

XML URI: in the XML Form of this specification any URI
is provided as an attribute to an element; for example the ruleref and lexicon elements.

A media type (defined in [RFC2045] and [RFC2046]) specifies the nature of a linked resource.
Media types are case insensitive. A list of registered media types
is available for download [TYPES]. In places where a URI can be specified a media
type may be provided to indicate the content type of URI.

[See Appendix G for information
on media types for the ABNF and XML Forms of the Speech Recognition
Grammar Specification.]

ABNF URI with Media Type: in the ABNF Form a media type
may be attached as a postfix to any URI. The media type is
delimited by angle brackets ('<' '>') and the URI and media
type are separated by a tilde character ('~') without intervening
white space. For
example,<http://example.com/file-path>~<media-type>

XML URI with Media Type: in the XML Form any element
that carries a URI attribute may carry a type
attribute.

A language identifier labels information content as
being of a particular human language variant. Following the
XML specification for language identification[XML §2.12] a legal
language identifier in ABNF Form grammars and XML Form grammars is
identified by an RFC 3066 [RFC3066] code. A language code is required by RFC 3066.
A country code or other subtag identifier is optional by RFC 3066.
A grammar's language declaration
declares the language of a grammar. Additionally a legal rule
expansion may be labeled by its language content.

A token (a.k.a. a terminal symbol) is the part of a
grammar that defines words or other entities that may be spoken.
Any legal token is a legal
expansion.

For speech recognition, a token is typically an orthographic
entity of the language being
recognized. However, a token may be any string that the speech
recognizer can convert to a phonetic representation.

XML Form only: a <token> element may contain character
data only. The character data is treated as a single unnormalized
token. The character data must not contain any double quote
characters.

Any token in ABNF Form or XML Form (except within <token>
element in XML Form) may be delimited by double quotes. The text
contained within the double quotes is an unnormalized token. The
text must not contain any double quote characters. A token
delimited by double quotes may contain white space.

Any token content not delimited by a <token> element or
double quotes is treated as a sequence of white-space-delimited
tokens. Each token contained in the token content is delimited at
the start and at the end by any white space character or any
syntactic construct that delimits a token content span. The
syntactic constructs that delimit token content are different for
the ABNF Form and XML Form. These tokens cannot contain white space
characters.

Token type

Form

Example

Single unquoted token

ABNF & XML

hello

Single unquoted token:
non-alphabetic

ABNF & XML

2

Single quoted token: including white
space

ABNF & XML

"San Francisco"

Single quoted token: no white
space

ABNF & XML

"hello"

Two tokens delimited by white
space

ABNF & XML

bon voyage

Four tokens delimited by white
space

ABNF & XML

this is a test

Single XML token in <token>

XML Only

<token>San
Francisco</token>

White Space Normalization: White space must be
normalized when contained in any token delimited by a <token>
elements or by double quotes. Leading and trailing white space
characters are stripped. Any token-internal white space character
or sequence is collapsed to a single space character (#x20). For
example, the following are all normalized to the same string, "San
Francisco".

"San Francisco"
" San Francisco "
"San
Francisco"
" San Francisco "

Because the presence of white space within a token is
significant the following are distinct tokens.

"San Francisco"
"SanFrancisco"
"San_Francisco"

Token Normalization: Other normalization processes are
applied to the white space normalized token according to the
language and the capabilities of the speech recognizer.

Pronunciation Lookup: To match spoken (audio) input to
a grammar a speech recognition must be capable of modelling the
audio patterns of any token in a grammar. Speech recognizers employ
a diverse set of techniques for performing this key recognition
process. The following is an informative description of techniques
that a speech recognizer may apply based on conventional large
vocabulary speech recognition technology.

A large vocabulary speech recognizer converts each normalized
token to a phoneme sequence or a set of possible phoneme sequences.
Conversion of an orthographic form (token) to the spoken form
(phonemes) is a highly language-specific process. In many cases the
conversion is even specific to a national variant, regional dialect
or other variant of the language. For example, for some tokens
Parisian French, Quebec French and Swiss French will each convert
to different pronunciations.

The text-to-phoneme conversion in a large vocabulary speech
recognizer may involve some or all of the following
sub-processes.

Pronunciation lexicon lookup: One of possibly many
lexicons available to a recognizer can provide the phoneme sequence
for a token. Both the ABNF Form and XML Form permit a grammar to
specify one or more lexicon
documents. Recognizers typically provide a built-in lexicon for
each supported language though the coverage will vary between
recognizers. The algorithm by which the lookup resolves a token to
a pronunciation is defined by the lexicon format and/or the speech
recognizer and may be language-specific. Case-insensitive string
matching is recommended.

Morphological analysis: a recognizer may be capable of
determining the transformation from a base token and phoneme string
to a morphological variant and its pronunciation. For example given
the pronunciation for "Hyundai" a rule could infer the
pronunciation for the pluralized form "Hyundai's".

Automatic text-to-phoneme conversion: for many, but
not all, languages and scripts there are rules that automatically
convert a token into a phoneme sequence. For example, in English
most but not all words ending with the letter sequence "ise" end
with the phoneme sequence "ai z". A speech recognizer may use
automated conversion to infer pronunciations for tokens that cannot
be looked up in a lexicon.

Any language is likely to have other specialized processes for
determining a pronunciation for a token. For example, for Japanese
special techniques are required for Kanji and each Kana form.

For any language and recognizer there may be variation in
coverage and completeness of the language's tokens.

When a grammar processor handles a grammar containing a token
that it cannot convert to phonemic form or otherwise use in the
speech recognition processing of audio it should inform the hosting
environment.

Limitations of token handling: the following is
informative guidance to grammar developers.

The Pronunciation Lexicon activity [LEX] of the W3C Voice Browser Working Group will
provide guidance on the token-handling processes outlined
above.

Token handling will vary between recognizers and will vary
between languages.

Grammar authors can improve document portability by avoiding
characters and forms in tokens that do not have obvious
pronunciations in the language. For English, the following are ways
to handle some orthographic forms:

Acronyms should be avoided. Alphabetic characters should be
widely available. For example, replace "USA" by "u s a"; replace
"W3C" by "w three c"; replace "IEEE" by "i triple e".

Abbreviations should be replaced by the unabbreviated form. For
example, replace "Dr." by "drive" or "doctor".

Most punctuation should be expanded to a spelled form. For
example replace "&" by "ampersand" or "and"; replace "+" by
"plus"; replace "<" by "less than" or "open angle bracket".

A grammar processor should support digits (e.g. "0" though "9"
for European scripts). Other natural numbers should be replaced by
spelled forms. For example, for US English replace "10" by "ten"
and "1000" by "thousand".

Grammar authors should consider the possibility that a grammar
will be used to interpret input in a non-speech recognition device.
For example, grammars can be used to process text strings from
keyboard input, text telephone services, pen input and other text
modalities. To facilitate text input a grammar should contain
standard orthographic tokens of the language. That is, to
facilitate non-speech recognition input the grammar should contain
standard spellings of natural language words to the greatest extent
possible.

A language attachment may
be provided for any token. When attached to a token the language
modifies the handling of that token only.

Informative

The rule expansion of a rule
definition is delimited at the start and end by equals sign
('=') and semicolon (';') respectively. Any leading plain text of
the rule expansion is delimited by ('=') and similarly any final
plain text is closed by semicolon.

Within a rule expansion the following symbols have syntactic
function and delimit plain text.

Rulenames: Every rule
definition has a local name that must be unique within the
scope of the grammar in which it is defined. A rulename must match
the "Name" Production of XML 1.0 [XML §2.3] and be a legal XML ID. Section 3.1 documents the rule definition
mechanism and the legal naming of rules.

This table summarizes the various forms of rule reference that
are possible within and across grammar documents.

Note: an XML Form grammar document must provide one and only one
of the uri or special attributes on a
ruleref element. There is no equivalent constraint in
ABNF since the syntactic forms are distinct.

When referencing rules defined locally (defined in the same
grammar as contains the reference), always use a simple rulename
reference which consists of the local rulename only. The ABNF Form
and XML Form have a different syntax for representing a simple
rulename reference.

ABNF Form

The simple rulename reference is prefixed by a "$"
character.

$city
$digit

XML Form

The ruleref element is an empty element with a
uri attribute that specifies the rule reference as a
same-document reference URI[RFC2396]: that is, the attribute consists only of the
number sign ('#') and the fragment identifier that indicates the
locally referenced rulename.

References to rules defined in other grammars are legal under
the conditions defined in Section 3.
The external reference must identify the external grammar by
URI and may identify a
specific rule within that grammar. If the fragment identifier that
would indicate a rulename is omitted, then the reference
implicitly targets the root rule of the external grammar.

Any externally-referenced rule may be activated for
recognition. That is it may define the top-level syntax of spoken
input. For instance, VoiceXML [VXML2] grammar activation may explicitly reference one
or more public rules (see Section
3.2) and/or implicitly reference the root rule (see Section 4.7).

A URI reference is illegal if the referring document and
referenced document have different modes. For instance, it is
illegal to reference a "dtmf" grammar from a "voice" grammar. (See
Section 4.6 for additional detail
on modes).

A resource indicated by an URI reference may be available in one or more media types. The grammar author
may specify the preferred media-type via the type
attribute (XML form) or in angle braces following the URI (ABNF
form).When the content represented by a URI is available in
many data formats, a grammar processor may use the preferred
media-type to influence which of the multiple formats is
used. For instance, on a server implementing HTTP content
negotiation, the processor may use the preferred
media-type to order the preferences in the negotiation.

The resource representation delivered by dereferencing the URI
reference may be considered in terms of two types. The
declared media-type is the asserted value for the resource
and the actual media-type is the true format of its content.
The actual media-type should be the same as the declared
media-type, but this is not always the case (e.g. a misconfigured
HTTP server might return text/plain for an
application/srgs+xml document). A specific URI scheme
may require that the resource owner always, sometimes, or never
return a media-type. The declared media-type is the value returned
by the resource owner or, if none is returned, the preferred media
type given in the grammar. There may be no declared media-type if
the resource owner does not return a value and no preferred type is
specified. Whenever specified, the declared media-type is
authoritative.

Three special cases may arise. The declared media-type may not
be supported by the processor; this is an error. The declared
media-type may be supported but the actual media-type may not
match; this is also an error. Finally, there may be no declared
media-type; the behavior depends on the specific URI scheme and the
capabilities of the grammar processor. For instance, HTTP 1.1
allows document introspection (see RFC 2616, section 7.2.1), the data scheme falls back to
a default media type, and local file access defines no guidelines.
The following table provides some informative examples:

HTTP 1.1 request

Local file access

Media-type returned by the
resource owner

text/plain

application/srgs+xml

<none>

<none>

Preferred media-type
appearing in the grammar

Not applicable; the returned type takes
precedence

application/srgs+xml

<none>

Declared media-type

text/plain

application/srgs+xml

application/srgs+xml

<none>

Behavior if the actual
media-type is application/srgs+xml

Error; the declared and
actual types do not match

The declared and actual types match;
success if application/srgs+xml is supported by the grammar
processor; otherwise an error

Scheme specific; the grammar processor
might introspect the document to determine the type.

See Appendix G for a summary of
the status for media types for ABNF Form and XML Form grammars.

Note: the media type of "application/srgs" has been
requested for ABNF Form grammars. See Appendix G for details.

XML Form

An XML rule reference is represented by a ruleref
element with a uri attribute that defines the URI of the referenced grammar and rule
within it. If a fragment identifier is appended then the identifier
indicates a specific rulename being referenced. If the fragment
identifier is omitted then the reference is
(implicitly) to the root rule of the referenced
grammar.

The optional type attribute specifies the media type of the grammar
containing the reference.

Several rulenames are defined to have specific interpretation
and processing by a speech recognizer. A grammar must not redefine
these rulenames.

In the ABNF Form a special rule reference is syntactically
identical to a local rule
reference. However, the names of the special rules are
reserved to prevent a rule
definition with the same name.

In the XML Form a special rulename is represented with the
special attribute on a ruleref element.
It is illegal to provide both the special and the
uri attributes.

NULL

Defines a rule that is automatically matched: that is, matched
without the user speaking any word.

ABNF Form: $NULL
XML Form: <ruleref special="NULL"/>

VOID

Defines a rule that can never be spoken. Inserting VOID into a
sequence automatically makes that sequence unspeakable.

ABNF Form: $VOID
XML Form: <ruleref special="VOID"/>

GARBAGE

Defines a rule that may match any speech up until the next rule
match, the next token or until the end of spoken input. A grammar
processor must accept grammars that contain special references to
GARBAGE. The behavior GARBAGE rule is implementation-specific. A
user agent should be capable of matching arbitrary spoken input up
to the next token but may treat GARBAGE as equivalent to NULL
(match no spoken input).

ABNF Form: $GARBAGE
XML Form: <ruleref special="GARBAGE"/>

Informative example: given suitable definitions of US cities
and states, a speech recognizer may implement the following ABNF
and XML rule definitions to match "Philadelphia in the great
state of Pennsylvania" as well as simply "Philadelphia
Pennsylvania".

The W3C Voice Browser Working Group has released a Working Draft
for the Stochastic Language Models (N-Gram) Specification [NGRAM]. These two specifications
represent different and complementary ways of informing a speech
recognizer of which words and patterns of words to listen for.

A speech recognizer may choose to support the Speech Recognition
N-Gram Grammar Specification in addition to the speech recognition
grammar defined in this document.

If a speech recognizer supports both grammar representations it
may optionally support references between the two formats. Grammars
defined in the ABNF Form or XML Form may reference start symbols of
N-Gram documents and vice versa.

The syntax for referencing an N-Gram is the same as referencing
externally defined ABNF Form or XML Form grammar documents. A media
type is recommended on a reference to an N-gram document. The
Working Group has not yet applied for a type on N-gram documents so
no example is given. The fragment identifier (a rulename when
referencing ABNF Form and XML Form grammars) identifies a start
symbol as defined by the N-Gram specification. If the start
symbol is absent the N-Gram, as a whole, is referenced as defined
in the N-Gram specification.

ABNF Form

URI references to N-Gram
documents follow the same syntax as references to other ABNF or XML
Form grammar documents. The following are examples of references to
an N-Gram document via an explicit
rule reference and an implicit reference to the
root rule.

XML Form

URI references to N-Gram
documents follow the same syntax as reference to other ABNF Form
and XML Form grammar documents. The following are examples of
references to an N-Gram document via an explicit rule reference and an implicit
reference to the root rule.

The sequence of rule expansions implies the temporal order in
which the expansions must be detected by the user agent. This constraint applies to sequences of
tokens, sequences of rule references, sequences of tags,
parentheticals and all combinations of these rule expansions.

ABNF Form

A sequence of legal expansions separated by white space is a legal
expansion.

A legal expansion surrounded by parentheses ('(' and ')') is a
legal expansion.

this is a test // sequence of tokens
$action $object // sequence of rule references
the $object is $color // sequence of tokens and rule references
(fly to $city) // parentheses for encapsulation

Special cases

An empty parenthetical is legal as is a parenthetical containing
only white space; e.g.
'()' or '( )'. Both forms are equivalent to $NULL and a grammar processor will behave as if
the parenthetical were not present.

// equivalent sequences
phone home
phone ( ) home

XML Form

A sequence of XML rule expansion elements (
<ruleref>, <item>,
<one-of>, <token><tag>) and CDATA sections containing space
separated tokens must be recognized in temporal sequence. (The only
exception is where one or more "item" elements appear within a
one-of element.)

An item element can surround any expansion to
permit a repeat attribute or
language identifier to be
attached. The weight attribute of item is
ignored unless the element appears within a one-of
element.

Any set of alternative legal rule expansions is itself
a legal rule expansion. For input to
match a set of alternative rule expansions it must match one of the
set of alternative expansions. A set of alternatives must contain
one or more alternatives.

A weight may be optionally provided for any number of
alternatives in an alternative expansion. Weights are simple
positive floating point values without exponentials. Legal formats
are "n", "n.", ".n" and
"n.n" where "n" is a sequence of one or
many digits.

A weight is nominally a multiplying factor in the likelihood
domain of a speech recognition search. A weight of 1.0 is
equivalent to providing no weight at all. A weight greater than
"1.0" positively biases the alternative and a weight less than
"1.0" negatively biases the alternative.

[JEL98] and [RAB93] are informative references on
the topic of speech recognition technology and the underlying
statistical framework within which weights are applied.

Grammar authors and speech recognizer developers should be aware
of the following limitations upon the definition and application of
weights as outlined above.

The application of weights to a speech recognition search is
under the internal control of the recognizer. There is no normative
or informative algorithm for applying weights. Furthermore, speech
recognition is a statistical process so consistent behavior cannot
be guaranteed.

Appropriate weights are difficult to determine for any specific
grammar and recognizer. Guessing weights does not always improve
speech recognition performance.

Effective weights are best obtained by study of real speech
input to a grammar. For example, a reasonable technique for
developing portable weights is to use weights that are correlated
with the occurrence counts of a set of alternatives.

Tuning weights for a particular recognizer does not guarantee
improved recognition performance on other speech recognizers.

ABNF Form

A set of alternative choices is identified as a list of legal
expansions separated by the vertical bar symbol. If necessary, the
set of alternative choices may be delimited by parentheses.

Michael | Yuriko | Mary | Duke | $otherNames
(1 | 2 | 3)

A weight is
surrounded by forward slashes and placed before each item in the
alternatives list.

Special Cases

It is legal for an alternative to be a reference to $NULL, an empty parenthetical or a
single tag. In each case the input is equivalent to matching $NULL
and as a result the other alternatives are optional.

XML Form

The one-of element identifies a set of alternative
elements. Each alternative expansion is contained in a
item element. There must be at least one
item element contained within a one-of
element. Weights are optionally
indicated by the weight attribute on the
item element.

Operators are provided that define a legal rule expansion as
being another sub-expansion that is optional, that is repeated zero
or more times, that is repeated one or more times, or that is
repeated some range of times.

ABNF
Form
Example

XML
Form
Example

Behavior

<n>
<6>

repeat="n"
repeat="6"

The contained expansion is repeated
exactly "n" times. "n" must be "0" or a positive integer.

<m-n>
<4-6>

repeat="m-n"
repeat="4-6"

The contained expansion is repeated
between "m" and "n" times (inclusive). "m" and "n" must both be "0"
or a positive integer and "m" must be less than or equal to
"n".

<m->
<3->

repeat="m-"
repeat="3-"

The contained expansion is repeated "m"
times or more (inclusive). "m" must be "0" or a positive integer.
For example, "3-" declares that the contained expansion can occur
three, four, five or more times.

<0-1>
[...]

repeat="0-1"

The contained expansion is
optional.

Common Repeats

As indicated in the table above, an expansion that can occur 0-1
times is optional. Because optionality is such a common form the
ABNF syntax provides square brackets as a special operator for
representing optionality.

A repeat of "0-" indicates that an expansion can occur zero
times, once or any number of multiple times. In regular expression
languages this is often represented by the Kleene star
('*') which is reserved but not used in ABNF.

A repeat of "1-" indicates that an expansion can occur once or
any number of multiple times. In regular expression languages this
is often represented by the positive closure ('+') which
is reserved but not used in ABNF.

Although both ABNF and XML support a grammar that permits an
unbounded number of input tokens it is not the case that users will
speak indefinitely. Speech recognition can perform more effectively
if the author indicates a more limited range of repeat
occurrences.

Special Cases

Where a number of possible repetitions (e.g. <m-> or
<m-n> (n > 0) but not <0>) is expressed on a
construct whose only content is one or more tag elements, the behavior of the grammar processor is
not defined and will be specific to individual implementations.

Any number of non-optional repetitions (e.g., <m-n>;
m>0) of VOID is equivalent to
a single VOID.

The behavior of a grammar processor in handling any number of
repetitions of NULL is not
defined and will be specific to individual implementations.

If the number of repetitions for any expansion can be only zero
(i.e. <0> or <0-0>) then the expansion is equivalent to
NULL.

Any repeat operator may specify an optional repeat probability.
The value indicates the probability of successive repetition of the
repeated expansion.

A repeat probability value must be in the floating pointing
range of "0.0" to "1.0" (inclusive). Values outside this range are
illegal. The floating point format is one of "n", "n.", "n.nnnn",
".nnnn" (with any number of digits after the dot).

Note: repeat probabilities and weights are different logical entities and have a
different impact upon a speech recognition search.

Informative example: A simple example is an optional expansion
(zero or one occurrences) with a probability — say "0.6". The
grammar indicates that the chance that the expansion will be
matched is 60% and that the chance that the expansion will not be
present is 40%.

When no maximum is specified in a range (m-) the probabilities
decay exponentially.

Grammar authors and speech recognizer developers should be aware
of the following limitations upon the definition and application of
repeat probabilities as outlined above.

The application of repeat probabilities to a speech recognition
search is under the internal control of the recognizer. There is no
specified algorithm for applying repeat probabilities in a speech
recognition processor so consistent behavior cannot be
guaranteed.

Appropriate repeat probabilities are often difficult to
determine for any specific grammar and recognizer. Guessing repeat
probabilities does not always improve speech recognition
performance.

Appropriate repeat probabilities are best obtained by study of
statistical patterns of real speech input. Tuning repeat
probabilities for a particular recognizer does not guarantee
improved recognition performance on other speech recognizers.

Useful references on statistical models of speech recognition
include [JEL98] and [RAB93].

ABNF Form

The following are postfix operators: <m-n>
<m-> <m>. A postfix operator is logically
attached to the preceding expansion. Postfix operators have high
precedence and so are tightly bound to the immediately preceding
expansion (see Section 2.8).

Optional expansions may be delimited by square brackets:
[expansion]. Alternatively, an optional expansion is
indicated by the postfix operator "<0-1>".

The following symbols are reserved for future use in
ABNF: '*', '+', '?'. These symbols must not be used at any place in
a grammar where the syntax currently permits a repeat operator.

// the token "very" is optional
[very]
very <0-1>
// the rule reference $digit can occur zero, one or many times
$digit <0->
// the rule reference $digit can occur one or more times
$digit <1->
// the rule reference $digit can occur four, five or six times
$digit <4-6>
// the rule reference $digit can occur ten or more times
$digit <10->
// Examples of the following expansion
// "pizza"
// "big pizza with pepperoni"
// "very big pizza with cheese and pepperoni"
[[very] big] pizza ([with | and] $topping) <0->

Repeat probabilities are only supported in the range form. The
probability is delimited by slash characters and contained within
the angle brackets: <m-n /prob/> and
<m- /prob/>.

// the token "very" is optional and is 60% likely to occur
// and 40% likely to be absent in input
very <0-1 /0.6/>
// the rule reference $digit must occur two to four times
// with 80% probability of recurrence
$digit <2-4 /.8/>

XML Form

The item element has a repeat
attribute that indicates the number of times the contained
expansion may be repeated. The following example illustrates the
accepted values of the attribute.

The repeat-prob on the item element carries the
repeat probability. Repeat probabilities are supported on any item
element but are ignored if the repeat attribute is not also
specified.

<-- The token "very" is optional and is 60% likely to occur. -->
<-- Means 40% chance that "very" is absent in input -->
<item repeat="0-1" repeat-prob="0.6">very</item>
<-- The rule reference to digit must occur two to four times -->
<-- with 80% probability of recurrence. -->
<item repeat="2-4" repeat-prob=".8">
<ruleref uri="#digit"/>
</item>

Special Cases

ABNF Form

A tag is delimited by either a pair of opening and
closing curly brackets — '{' and '}' — or by the following
3-character sequences which are considered very unlikely to occur
within a tag — '{!{' and '}!}'. A tag delimited by single curly
brackets cannot contain the single closing curly bracket character
('}'). A tag delimited by the 3-character sequence cannot contain
the closing 3-character sequence ('}!}').

The tag content is all text between the opening and closing
character sequences including leading and trailing white space. The contents of the
tag are not parsed by the grammar processor.

Tag precedence is the same as for rule references and tokens. In
the first example below there is a sequence of six space-separated
expansions (3 tokens, a tag, a token and a tag). In the second
example, the alternative is a choice between a sequence containing
a token and a tag or a sequence containing a rule reference and a
tag.

The language declaration for a rule expansion affects only the
contained content. Moreover, the language declaration affects only
the handling of tokens in the
contained content and does not affect tags or rule
references. The application of language to token handling and
particularly to pronunciation lookup is described in Section 2.1.

In situations where applications target a multilingual user
community, grammars that contain words in more than one language
may be needed. For example, in response to a prompt such as:
"Do you want to talk to André Prévost?" (a
combination of an English sentence with a French name), the
response may be either "yes" or "oui".

The Speech Recognition Grammar Specification permits one grammar
to collect input from more than one language. The specification
also permits multiple grammars each with a separate single language
to be used in parallel. The specification also permits a single
input utterance to contain more than one language. Finally, the
specification permits any combination of the above: for example,
parallel grammars each with multi-lingual capability.

Not all user agents are required to support all languages, or
indeed any or all of the multi-lingual capabilities. The
conformance requirements regarding multi-lingual support for XML
Form grammar processors and ABNF Form grammar processors are the
same and are laid out in Section
5.4 and Section 5.6
respectively.

There is a related challenge for multilingual applications that
deal with proper names (people, streets, companies, etc.) that may
be spoken with different pronunciations or accents depending upon
the language of origin and the speaking language. It is often
impossible to predict the language that users will use to pronounce
certain tokens. In fact, users may actually use different languages
for different words in the same sentence, and in unpredictable
ways. For instance, the name "Robert Jones" might be pronounced by
a French-speaking user using the French pronunciation for "Robert"
but an English pronunciation for "Jones", whereas a mono-lingual
English speaker would use the English pronunciation for both
words.

Language scoping: language declarations are scoped
locally to a document and to a rule definition. In XML terminology,
the language attribute is inherited down the document tree. Where a
language change encompasses a reference to another grammar, the
referenced rule and its containing grammar define the language of
the reference expansion. The language in effect at the point of the
rule reference does not have any effect upon the referenced
rule.

Language and results: The language used in the
recognition of a token is not considered a part of the speech
result even in the case that a language declaration is associated
with a token.

XML 1.0 [XML §2.12]
defines the xml:lang attribute for language
identification. The attribute provides a single language identifier for the
content of the element on which it appears. The
xml:lang attribute may be attached to one-of , token and item. It applies the token handling of
scoped tokens.

This section defines the precedence of the ABNF rule expansion
syntax. Because XML documents explicitly indicate structure there
is no ambiguity and thus a precedence definition is not required.
The precedence definitions for the ABNF Form are intended to
minimize the need for parentheses.

ABNF Form

The following is the ordering of precedence of rule expansions.
Parentheses may be used to explicitly control rule structure.

Repeat operator (e.g.
"<0-1>") and language attachment (e.g. "!en-AU") apply to the
tightest immediate preceding rule expansion. (To apply them to a
sequence or to alternatives, use `()' or `[]' for grouping.)

XML Form

A rule definition associates a legal rule expansion with a rulename. The rule
definition is also responsible for defining the scope of the rule definition: whether it
is local to the grammar in which it is defined or whether it may be
referenced within other grammars. Finally, the rule definition may
additionally include documentation comments and other
pragmatics.

The rulename for each rule definition must be unique within a
grammar. The same rulename may be used in multiple grammars.

A rule definition is referenced by a URI in a rule reference with the rulename being represented
as the fragment identifier.

Defined rulenames must be unique within a grammar. The schema enforces this by declaring the
rulename as an XML ID.

Rulenames are case-sensitive in both XML and ABNF grammars.
Exact string comparison is used to resolve rulename references.

A legal rulename cannot be one of the special rules: specifically "NULL", "VOID" or
"GARBAGE".

ABNF Form

The rule definition consists of an optional scoping declaration
(explained in the next section) followed by a legal rule name, an
equals sign, a legal rule expansion and a closing semicolon. The
rule definition has one of the following legal forms:

XML Form

A rule definition is represented by the rule
element. The id attribute of the element indicates the
name of the rule and must be unique within the grammar (this is
enforced by XML). The contents of the rule element may
be any legal rule expansion defined in Section 2. The scope attribute is explained
in the next section.

Each defined rule has a scope. The scope is either "private" or
"public". If not explicitly declared in a rule definition then the
scope defaults to "private".

A public-scoped rule may be explicitlyreferenced(using the
fragment identifier syntax of a URI) in the rule definitions
of other grammars and in other non-grammar documents.
A private-scoped rule cannot be so referenced
and is directly accessible only within its containing
grammar. A private rule may be explicitly referenced only by other
rules within the same grammar.

Informative: grammar authors may consider the following guidance
when scoping the rules of a grammar.

Grammar authoring shares many properties of programming.
Establishing contracts of an API is analogous to defining a set of
grammars and defining the public rules of a grammar each with
defined language behavior.

Consistent design and implementation of public rules promotes
grammar re-use and facilitates creation of grammar libraries.

Natural language grammars often require creation of many
internal "working" rules to create a smaller number of useful
external rules. Hiding working rules with private scope allows
revision of those rules without affecting other grammars that
reference the grammar. Hiding working rules also prevents
accidental mis-use of a working rule.

It is often desirable to include examples of phrases that match
rule definitions along with the definition. Zero, one or many
example phrases may be provided for any rule definition. Because
the examples are explicitly marked, automated tools can be used for
regression testing and for generation of grammar documentation.

ABNF Form

A documentation comment is a
C/C++/Java comment that starts with the sequence of characters
/** and which immediately precedes the relevant rule
definition. Zero or more @example tags may be
contained at the end of the documentation comment. The syntax
follows the Tagged Paragraph of a documentation comment of
the Java Programming Language [JAVA §18.4]. The tokenization of the example
follows the tokenization and sequence rules defined in Section 2.1 and Section 2.3 respectively.

/**
* A simple directive to execute an action.
*
* @example open the window
* @example close the door
*/
public $command = $action $object;

XML Form

Any number of "example" elements may be provided as the initial
content within a "rule" element. The tokenization of the example
follows the tokenization and sequence rules defined in Section 2.1 and Section 2.3 respectively.

A conforming stand-alone grammar document consists of a legal header followed by a body consisting
of a set of legal rule definitions.
All rules defined within that grammar are scoped within the
grammar's rulename namespace and each rulename must be legal and unique.

It is legal for a grammar to define no rules. The grammar cannot
be used for processing input since it defines no patterns for
matching user input.

A legal stand-alone grammar header consists of a number of
required declarations and other optional declarations. In addition,
the ABNF Form and XML Form each have additional requirements and
capabilities of the header that are specific to each syntactic
form. The ordering of header declarations is also specific to the
two forms.

The table summarizes the information declared in a grammar
header and the appropriate representation in the ABNF Form and XML
Form.

A grammar that complies to this specification must declare the
version to be "1.0".

Note: the grammar version indicates the version of the
specification implemented by the grammar and is not for versioning
of the grammar content. A meta or metadata declaration may be used for
content versioning.

The ABNF self-identifying header must be present in any
legal stand-alone ABNF Form grammar document.

The first character of an ABNF document must be the "#" symbol
(x23) unless preceded by an optional XML 1.0 byte order mark[XML §4.3.3]. The ABNF byte order mark follows the
XML definition and requirements. For example, documents encoded in
UTF-16 must begin with the byte order mark.

The optional byte order mark and required "#" symbol must be
followed immediately by the exact string "ABNF" (x41 x42 x4d x46)
or the appropriate equivalent for the document's encoding (e.g. for
UTF-16 little-endian: x23 x00 x41 x00 x42 x00 x4d x00 x46 x00). If
the byte order mark is absent on a grammar encoded in UTF-16 then
the grammar processor should perform auto-detection of character encoding in a manner
analogous to auto-detection of character encoding in XML [XML §F].

Next follows a single space character (x20) and the required
version number which is "1.0" for this specification
(x31 x2e x30).

Next follows an optional character
encoding. Section 4.4 defines character encodings in more
detail. If present, there must be a single space character (x20)
between the version number and the character encoding.

The self-identifying header is finalized with a semicolon (x3b)
followed immediately by a newline. The semicolon must be the first
character following the version number or the character encoding if
is present.

For the remaining declarations of the ABNF header white space is not
significant.

The XML prolog in an XML Form grammar comprises the XML
declaration and an optional DOCTYPE declaration referencing the
grammar DTD. It is followed by the root grammar
element. The XML prolog may also contain XML comments, processor
instructions and other content permitted by XML in a prolog.

The version number of the XML declaration indicates which
version of XML is being used. The version number of the
grammar element indicates which version of the grammar
specification is being used — "1.0" for this
specification. The grammar version is a required attribute.

The grammar element must designate the grammar namespace. This
can be achieved by declaring an xmlns attribute or an
attribute with an "xmlns" prefix. See [XMLNS] for details. Note that when the xmlns attribute
is used alone, it sets the default namespace for the element on
which it appears and for any child elements. The namespace for XML
Form grammars is defined as http://www.w3.org/2001/06/grammar.

It is recommended that the grammar element also indicate the
location of the grammar schema (see Appendix C) via the xsi:schemaLocation
attribute from [SCHEMA1].
Although such indication is not required, to encourage it this
document provides such indication on all of the examples:

The character encoding declaration indicates the scheme used for
encoding character data in the document. For example, for US
applications it would be common to use US-ASCII, UTF-8 (8-bit
Unicode) or ISO-8859-1. For Japanese grammars, character encodings
such as EUC-JP and UTF-16 (16-bit Unicode) could be used.

Except for the different syntactic representation, the ABNF Form
follows the character encoding handling defined for XML. XML
grammar processors must accept both the UTF-8 and UTF-16 encodings
of ISO/IEC 10646 and may support other character encodings. This
follows from an XML grammar processor being a compliant XML
processor and thus required to support those character encodings.
For consistency, ABNF grammar processor must also accept both the
UTF-8 and UTF-16 encodings of ISO/IEC 10646 and may support other
character encodings.

For both XML Form and ABNF Form grammars the declaration of the
character encoding is optional but strongly recommended. XML
defines behavior for XML processors that receive an XML document
without a character encoding declaration. For consistency an ABNF
grammar processor must follow the same behavior (with adjustments
for the different syntax). (Note the character encoding declaration
is optional only in cases where it is optional for a legal XML
document.)

ABNF Form

The character encoding declaration is part of the
self-identifying grammar header defined in Section 4.1 and is processed in combination with the
byte order mark, if present, using the same procedure as XML 1.0
[XML §4.3.3].

The following are examples of ABNF self-identifying grammar
headers with and without the character encoding declaration.

Note: the ABNF Form syntax does not provide a character
reference syntax for entry of a specific character, for example,
one not directly accessible from available input devices. This
contrasts with XML 1.0 syntax for character references[XML §4.1]. For development
requiring character references the XML Form of the specification is
recommended.

#ABNF 1.0 ISO-8859-1;

#ABNF 1.0 EUC-JP;

#ABNF 1.0;

XML Form

XML declares character encodings as part of the document's XML
declaration on the first line of the document.

The following are examples of XML headers with and without the
character encoding declaration.

The language declaration of a grammar provides the language identifier that
indicates the primary language contained by the document and
optionally indicates a country or other variation. Additionally,
any legal rule expansion may be labeled with a language identifier.

The language declaration is required for all speech recognition
grammars: i.e. all grammars for which the mode is "voice". (Note that mode defaults
to voice if there is no explicit mode declaration in ABNF or
mode attribute in XML.)

If an XML Form grammar is incorporated within another XML
document — for example, as supported by VoiceXML 2.0 — then the
xml:lang attribute is optional on the
grammar element and the xml:lang
attribute must be inherited from the enclosing document.

The conformance definition in Section
5 defines the behavior of a grammar processor when it
encounters a language variant that it does not support.

ABNF Form

The ABNF header must contain zero or one language
declaration. It consists of the keyword
"language", white space, a legal language identifier, optional white space and a terminating semicolon character
(';').

The mode of a grammar indicates the type of input that the user
agent should be detecting. The default mode is "voice"
for speech recognition grammars. An alternative input mode defined
in Appendix E is
"dtmf" input.

The mode attribute indicates how to interpret the
tokens contained by the grammar.
Speech tokens are expected to be detected as speech audio that
sounds like the token. Behavior with DTMF input, if supported, is
defined in Appendix E.

It is often the case that a different user agent is used for
detecting DTMF tones than for speech recognition. The same may be
true for other modes defined in future revisions of the
specification.

The specification does not define a mechanism by which a single
grammar can mix modes: that is, a representation for a mixed
"voice" and "dtmf" grammar is not
defined. Moreover, it is illegal for a rule reference in one
grammar to reference any grammar with a different mode.

A user agent may, however, support the simultaneous activation
of more than one grammar including both "voice" and
"dtmf" grammars. This is necessary, for example, for
DTMF-enabled VoiceXML browsers [VXML2]. (Note: parallel activation implies disjunction
at the root level of the grammars rather than mixing of modes
within the structure of the grammars.)

ABNF Form

The ABNF header must contain zero or one mode
declaration. It consists of the keyword "mode",
white space, either
"voice" or "dtmf" optional white space and a terminating
semicolon character (';'). If the ABNF header does not declare the
mode then it defaults to voice.

mode voice;

mode dtmf;

XML Form

The mode declaration is provided as an optional
mode attribute on the root grammar
element. Legal values are "voice" and
"dtmf". If the mode attribute is omitted then the
value defaults to voice.

Both the XML Form and ABNF Form permit the grammar header to
optionally declare a single rule to be the root rule of the
grammar. The rule declared as the root rule must be defined within
the scope of the grammar. The rule declared as the root rule may be
scoped as either
public or private.

Animplicit rule reference to the root rule of
a grammar is legal. The syntax for implicitly
referencing root rules is defined in Section 2.2. It is an error to reference a grammar
implicitly by its root if that grammar does not
declare a legal root rule.

Although a grammar is not required to declare a root rule it is
good practice to declare the root rule of any grammar.

ABNF Form

The ABNF header must contain zero or one root rule
declaration. It consists of the keyword "root",
white space, the
legal rulename of a rule defined
within the grammar prefixed by the dollar sign ('$'), optional
white space and a
terminating semicolon character (';'). If the ABNF header does not
declare the root rule then it is not legal to
implicitly reference the grammar by its root.

root $rulename;

XML Form

The root rulename declaration is provided as an
optional root attribute on the grammar
element. The root declaration must identify one rule
defined elsewhere within the same grammar. The value of the root
attribute is an XML IDREF (not a URI) and must not include the
number sign ('#').

The tag-format declaration is an optional
declaration of a tag-format identifier that indicates the
content type of all rule
tags and header tags
contained within a grammar.

The tag-format identifier is a URI. It is recommended that the tag format identifier
indicate both the content type and a version. Tags typically
contain content for a semantic
interpretation processor and in such cases the identifier, if
present, should indicate the semantic processor to use.

Tag-format identifier values beginning with the string
"semantics/x.y" (where x and y are digits) are reserved for use by
the W3C Semantic Interpretation for Speech Recognition
specification [SEM] or future
versions of the specification.

Grammar processor handling of tags is undefined if the tag
format declaration is omitted.

ABNF Form

The ABNF header must contain zero or one tag format
declaration. It consists of the keyword
"tag-format", white space, a tag format identifier (an ABNF URI), optional white space and a terminating
semicolon character (';').

Informative example ("semantics/1.0" is a reserved
identifier) :

tag-format <semantics/1.0>;

XML Form

The tag-format is an optional attribute of the
grammar element and contains a tag format
identifier.

Relative URIs are resolved according to a base URI,
which may come from a variety of sources. The base URI declaration
allows authors to specify a document's base URI explicitly. See
Section 4.9.1 for details on the
resolution of relative URIs.

The path information specified by the base URI declaration only
affects URIs in the document where the element appears.

The base URI declaration is permitted but optional in both the
XML Form and the ABNF Form.

Note: the base URI may be declared in a meta declaration but the explicit base declaration
is recommended for both the ABNF Form and XML Form.

ABNF Form

The ABNF header must contain zero or one base URI
declaration. It consists of the keyword "base",
white space, a legal
ABNF URI, optional white space and a terminating
semicolon character (';').

base <http://www.example.com/base-file-path>;

base <http://www.example.com/another-base-file-path>;

XML Form

The base URI declaration follows [XML-BASE] and is indicated by a xml:base
attribute on the root grammar element.

The base URI is given by metadata discovered during a protocol
interaction, such as an HTTP header (see [RFC2616]).

By default, the base URI is that of the current document. Not
all grammar documents have a base URI (e.g., a valid grammar
document may appear in an email and may not be designated by a
URI). Such grammar documents are not valid if they contain relative
URIs and rely on a default base URI.

4.10
Pronunciation Lexicon

A grammar may optionally reference one or more external
pronunciation lexicon documents. A lexicon document is identified
by a URI with an optional
media type.

The pronunciation information contained within a lexicon
document is used only for tokens defined within the enclosing
grammar.

The W3C Voice Browser Working Group is developing the
Pronunciation Lexicon Markup Language [LEX]. The specification will address the matching
process between tokens and lexicon entries and the mechanism by
which a speech recognizer handles multiple pronunciations from
internal and grammar-specified lexicons. Pronunciation handling
with proprietary lexicon formats will necessarily be specific to
the speech recognizer.

Pronunciation lexicons are necessarily language-specific.
Pronunciation lookup in a lexicon and pronunciation inference for
any token may use an algorithm that is language-specific. (See
Section 2.1 for additional
information on token handling and pronunciations.)

ABNF Form

The ABNF header may contain any number of pronunciation lexicon
declarations (zero, one or many). The lexicon declaration consists
of the "lexicon" keyword followed by white space, an ABNF URI or ABNF URI with media type, optional white space and
a closing semicolon (';'). (Note that a lexicon URI is not preceded
by a dollar sign as is the case for ABNF rule references.)
Example:

XML Form

Any number of lexicon elements may occur as
immediate children of the grammar element. The
lexicon element must have a uri attribute
specifying a URI that
identifies the location of the pronunciation lexicon document.

The lexicon element may have a type
attribute that specifies the media type of the pronunciation lexicon document.

Grammar documents let authors specify metadata — information
about a document rather than document content — in a number of
ways.

A meta
declaration in either the ABNF Form or XML Form may be used to
express metadata information in both XML Form and ABNF Form
grammars or to reference metadata available in an external
resource. The XML Form also supports a metadata element that provides a more
general and powerful treatment of metadata information than
meta. Since metadata requires an XML
metadata schema which cannot be expressed in ABNF, there is no
equivalent of metadata in the ABNF Form of
grammars.

A meta declaration in either ABNF Form or the XML
Form associates a string to declared meta property or declares
"http-equiv" content.

The seeAlso property is the only defined meta
property name. It is used to specify a resource that might provide
additional metadata information about the containing grammar. This
property is modelled on the rdfs:seeAlso property of Resource
Description Framework (RDF) Schema Specification 1.0 [RDF-SCHEMA §2.3.4].

It is recommended that for general metadata properties that
grammar authors follow the metadata properties defined in the
Dublin Core Metadata Initiative [DC]. For example, "Creator" to identify the entity
primarily responsible for making the content of the grammar, "Date"
to indicate creation date, or "Source" to indicate the resource
From which a grammar is derived (e.g. when converting an XML Form
grammar to the ABNF Form, use "Source" to provide the URI for the
original document.)

ABNF Form

The ABNF header may contain any number of meta declarations and
http-equiv declarations (zero, one or many). Each declaration
consists of the "meta" or "http-equiv"
keyword followed by white
space, the name string delimited by quotes, the keyword
"is", white
space, the content string delimited by quotes, optional white
space and a closing semicolon (';').

The name string and the content string must be delimited by
either a matching pair of double quotes ('"') or a matching pair of
single quotes ("'").

XML Form

A metadata property is declared with a meta
element. Either a name or http-equiv
attribute is required. It is illegal to provide both
name and http-equiv attributes. A
content attribute is required. The meta,
metadata and lexicon elements must occur
before all rule elements contained with the root
grammar element. There are no constraints on the
ordering of the meta, metadata and
lexicon elements.

The metadata element is container in which
information about the document can be placed using a metadata
schema. Although any metadata schema can be used with
metadata, it is recommended that the Resource
Description Format (RDF) schema [RDF-SCHEMA] is used in conjunction with the general
metadata properties defined in the Dublin Core Metadata Initiative
[DC].

RDF is a declarative language and provides a standard way for
using XML to represent metadata in the form of statements about
properties and relationships of items on the Web. Content creators
should refer to W3C metadata Recommendations [RDF-SYNTAX] and [RDF-SCHEMA] when deciding which
metadata RDF schema to use in their documents. Content creators
should also refer to the Dublin Core Metadata Initiative [DC], which is a set of generally
applicable core metadata properties (e.g., Title, Creator, Subject,
Description, Copyrights, etc.).

This specification only defines an XML representation for this
form of metadata declaration. There is no ABNF equivalent for
metadata. A conversion of an XML Form grammar to the
ABNF Form may extract the XML metadata into a separate document
that is referenced with a "seeAlso" meta declaration in the ABNF
document. Note: an agent that searches XML documents for metadata
represented with RDF would be unable to locate RDF even if it were
represented in ABNF. Thus, support for RDF in ABNF was considered
low utility.

XML Form

Document properties declared with metadata element
can use any metadata schema. The metadata,
meta, and lexicon elements must occur
before all rule elements contained with the root
grammar element. There are no constraints on the
ordering of the metadata, meta and
lexicon elements.

Informative: This is an example of how metadata can
be included in an XML grammar document using the Dublin Core
version 1.0 RDF schema [DC]
describing general document information such as title, description,
date, and so on:

XML Form

The fetching and caching behavior of both ABNF Form and XML Form
grammar documents is defined primarily by the environment in which
the grammar processor operates. For instance, VoiceXML 1.0 and
VoiceXML 2.0 define certain fetching and caching behaviors that
apply to grammars activated by a VoiceXML browser. Similarly, any
API for a recognizer that supports ABNF Form or XML Form grammars
may apply fetching and caching behaviors.

Grammar processors are recommended to support the following
interpretation of "rendering" a grammar for the purpose of
determining document freshness.

Activation of a grammar is the point at which the
recognizer begins detection of user input matching the grammar and
is therefore analogous to the action of visual or audio rendering
of system output. As with output rendering, grammar freshness
should be checked close to the moment of grammar activation.

The XML Form grammar specification and these conformance
criteria provide no designated size limits on any aspect of grammar
documents. There are no maximum values on the number of elements,
the amount of character data, or the number of characters in
attribute values.

The grammar namespace may be used with other XML namespaces as
per the Namespaces in XML Recommendation [XMLNS]. Future work by W3C will address ways to
specify conformance for documents involving multiple
namespaces.

An XML Form grammar processor is a program that can parse and
process XML Form grammar documents. Examples include speech
recognizers and DTMF detectors that accept the XML Form.

In a Conforming XML Form Grammar Processor, the XML
parser must be able to parse and process all XML constructs defined
by XML 1.0 [XML] and Namespaces
in XML [XMLNS]. This XML
parser is not required to perform validation of a grammar document
as per its schema or DTD; this implies that during processing of an
XML Form grammar document it is optional to apply or expand
external entity references defined in an external DTD.

A Conforming XML Form Grammar Processor must correctly
understand and apply the semantics of each possible grammar feature
defined by this document.

A Conforming XML Form Grammar Processor must meet the following
requirements for handling of languages:

A Conforming Grammar Processor is required to parse all legal
language declarations successfully.

A Conforming Grammar Processor should inform its hosting
environment if it encounters a language that it can not
support.

A Conforming Grammar Processor that can support a given
language, must be able to activate the root, any single public
rule, or any set of public rules or roots of one or many grammars
where each rule or root and all directly or indirectly referenced
sub-rules are for this same given language.

A Conforming Grammar Processor may activate a part (i.e., the
root, any single public rule, or any set of public rules) of one or
many grammars where the parts contain multiple languages, with one or
more languages in each part. When a processor is able to support each
language in the set but is unable to handle them concurrently it
should inform the hosting environment. When the set includes one or
more languages that are not supported by the processor it should
inform the hosting environment.

A Conforming Grammar Processor may implement languages by
approximate substitutions according to a documented,
platform-specific behavior. For example, using a US English speech
recognizer to process British English input.

When a Conforming XML Form Grammar Processor encounters elements
or attributes in a non-grammar namespace it may:

ignore the non-standard elements and/or attributes

or, process the non-standard elements and/or attributes

or, reject the document containing those elements and/or
attributes

A Conforming XML Form Grammar Processor is not required to
support recursive grammars, that is, grammars in which rule references include direct or indirect
self-reference.

There is, however, no conformance requirement with respect to
performance characteristics of the XML Form Grammar Processor. For
instance, no statement is required regarding the accuracy, speed or
other characteristics of a speech recognizer or DTMF detector. No
statement is made regarding the size of grammar or size of grammar
vocabulary that an XML Form Grammar Processor must support.

An ABNF Grammar Processor is a program that can parse and
process ABNF grammar documents. Examples include speech recognizers
and DTMF detectors that accept the ABNF Form.

A Conforming ABNF Grammar Processor must correctly understand
and apply the semantics of each possible grammar feature defined by
this document.

A Conforming ABNF Grammar Processor must follow the same
language handling requirements as outlined in Section 5.4 for Conforming XML Form Grammar
Processors.

A Conforming ABNF Grammar Processor should inform its hosting
environment if it encounters an illegal grammar document or other
grammar content that it is unable to process.

A Conforming ABNF Grammar Processor is not required to support
recursive grammars, that is, grammars in which rule references include direct or indirect
self-reference.

There is, however, no conformance requirement with respect to
performance characteristics of the ABNF Grammar Processor. For
instance, no statement is required regarding the accuracy, speed or
other characteristics of a speech recognizer or DTMF detector. No
statement is made regarding the size of grammar or size of grammar
vocabulary that an ABNF Grammar Processor must support.

Is capable of determining when a sequence of user input exactly
matches a grammar,

Is capable of producing an output representation that indicates
how the input matches the grammar.

Current speech recognition technology is statistically based.
Since the output is not deterministic and cannot be guaranteed to
be a correct representation of the input there is no conformance
requirement regarding accuracy. A conformance test may, however,
require some examples of correct recognition of speech input to
determine conformance.

The lexical grammar defines the lexical tokens of the ABNF
format and has single characters as its terminal symbols. As a
consequence neither white space characters nor ABNF comments are allowed in lexical tokens unless
explicitly specified.

SelfIdentHeader ::=
'#ABNF' #x20 VersionNumber (#x20 CharEncoding)? ';'
[Additional constraints:
- The semicolon (';') must immediately be followed
by an end-of-line.
]
VersionNumber ::=
'1.0'
CharEncoding ::=
Nmtoken
BaseURI ::=
ABNF_URI
LanguageCode ::=
Nmtoken
[Additional constraints:
- The language code must be a valid language identifier.
]
RuleName ::=
'$' ConstrainedName
ConstrainedName ::=
Name - (Char* ('.' | ':' | '-') Char*)
TagFormat ::=
ABNF_URI
LexiconURI ::=
ABNF_URI | ABNF_URI_with_Media_Type
SingleQuotedCharacters ::=
''' [^']* '''
DoubleQuotedCharacters ::=
'"' [^"]* '"'
QuotedCharacters ::=
SingleQuotedCharacters | DoubleQuotedCharacters
Weight ::=
'/' Number '/'
Repeat ::=
[0-9]+ ('-' [0-9]*)?
[Additional constraints:
- A number to the right of the hyphen must not be
greater than the number to the left of the hyphen.
]
Probability ::=
'/' Number '/'
[Additional constraints:
- The float value must be in the range of "0.0"
to "1.0" (inclusive).
]
Number ::=
[0-9]+ | [0-9]+ '.' [0-9]* | [0-9]* '.' [0-9]+
ExternalRuleRef ::=
'$' ABNF_URI | '$' ABNF_URI_with_Media_Type
[Additional constraints:
- The referenced grammar must have the same mode
("voice" or "dtmf") as the referencing grammar.
- If the URI reference contains a fragment
identifier, the referenced rule must be a
public rule of another grammar.
- If the URI reference does not contain a fragment
identifier, i.e. if it is an implicit root rule reference,
then the referenced grammar must declare a root
rule.
]
Token ::=
Nmtoken | DoubleQuotedCharacters
LanguageAttachment ::=
'!' LanguageCode
Tag ::=
'{' [^}]* '}'
| '{!{' (Char* - (Char* '}!}' Char*)) '}!}'
------------------------------------------------------------
ABNF_URI and ABNF_URI_with_Media_Type are defined
in Section 1.6 Terminology.
Name is defined by the XML Name production [XML §2.3].
Nmtoken is defined by the XML Nmtoken production [XML §2.3].
NameChar is defined by the XML NameChar production [XML §2.3].
Char is defined by the XML Char production [XML §2.2].

Note: As mentioned in Section
2.5 the symbols "*", "+" and "?", which are often used in
regular expression languages, are reserved for future use in ABNF
and must not be used at any place in a grammar where the syntax
currently permits a repeat operator.

Syntactic Grammar for ABNF

The syntactic grammar has lexical tokens defined by the lexical
grammar as its terminal symbols. Between two lexical tokens any
number of white spaces
or ABNF comments may appear.

This section defines a normative representation of a grammar
consisting of DTMF tokens. A
DTMF grammar can be used by a DTMF detector to determine sequences
of legal and illegal DTMF events. All grammar processors that
support grammars of mode "dtmf" must implement this
Appendix. However, not all grammar processors are required to
support DTMF input.

If the grammar mode is declared
as "dtmf" then tokens contained by
the grammar are treated as DTMF tones (rather than the default of
speech tokens).

There are sixteen (16) DTMF tones. Of these twelve (12) are
commonly found on telephone sets as the digits "0" through "9" plus
"*" (star) and "#" (pound). The four DTMF tones not typically
present on telephones are "A", "B", "C", "D".

Each of the DTMF symbols is a legal DTMF token in a DTMF
grammar. As in speech grammars, tokens must be separated by
white space in a DTMF
grammar. A space-separated sequence of DTMF symbols represents a
temporal sequence of DTMF entries.

In the ABNF Form the "*" symbol is reserved so double quotes
must always be used to delimit "*" when defining an ABNF DTMF
grammar. It is recommended that the "#" symbol also be quoted. As
an alternative the tokens "star" and "pound" are acceptable
synonyms.

This section defines an informative representation of a parsed
result of speech recognition or other user agent processing. This
representation may be used as the basis for subsequent processing
of user input, in particular, semantic
interpretation. For instance, the W3C Semantic Interpretation
for Speech Recognition specification [SEM] is defined around the logical parse structure.

This Appendix adopts the terminology and nomenclature of
Introduction to Automata Theory, Languages, and
Computation[HU79].

Denote the tokens of the alphabet of all tokens
accepted by a grammar as t1, t2....

An input or output token sequence is a space separated string of
tokens. The logical parse structure contains white-space-normalized tokens. The tokens in the logical
parse structure are optionally delimited by double quotes so that
white space and others
characters can be parsed unambiguously. e.g. t1,t2,"t3 with
space". (For consistency, all examples in this Appendix include
double quotes.)

Let ε (epsilon) or "" denote the unique string of
length 0, also known as the empty string.

Denote the tags of the alphabet of all tags accepted by
a grammar as {tag1}, {tag2}, ....

The expressive power of a rule expansion is a Regular
Expression (see HU79) and has an equivalent Finite
Automaton (see HU79). [The handling of rule references
requires special treatment: see Section H.2.] The expressive power of the grammar
specification consists of:

Tokens: a finite automaton transition with symbol

Tag: a finite automaton transition on ε

Sequence: concatenation operation on finite automaton

Alternatives: union operation on finite automaton

Repeats: representable by combinations of concatenation,
closure and union.

We formalize the logical parse structure by creating a
Finite Automaton with Output (see HU79). This construct is
also referred to as a Finite State Transducer.

We define the transitions for tokens and tags as producing an
output symbol.

Token: transition that accepts token t and produces as
output token t.
In the notation of HU79: t/t

Tag: transition that accepts ε (no token) and produces
as output {!{tag}!}
In the notation of HU79: ε/{!{tag}!}

We represent parse output as an ordered array of output
entities: [e1,e2,e3,...].

Special Cases

A $NULL reference is equivalent to a transition that accepts as
input ε and produces as output ε. In the notation
of HU79: ε/ε.

A $VOID reference is logically equivalent to a missing
transition. It accepts no input and produces no output.

A $GARBAGE reference is equivalent to a transition that accepts
platform specific input and produces as output ε.

Ambiguity

An ambiguity occurs when for a specified sequence of
input tokens matched to a specified rule of a grammar there is more
than one distinct logical parse structure that can be produced.

An ambiguity can occur at points of disjunction
(choice) in a grammar. Disjunction exists with the use of alternatives and repeats.

A grammar processor may preserve any number of ambiguous logical
parse structures to create a set of alternative logical parse
structures for the input. It is legal for a grammar processor to
maintain all possible logical parse structures or to dispose of all
but one of the alternatives. There is no specified behavior for
selection of ambiguities amongst possibilities by a grammar
processors. As a result grammars that contain ambiguity do not
guarantee portability of performance. Developers and grammar tools
should be ambiguity-aware.

This Appendix does not illustrate all forms of ambiguous
expansions but provides examples of some of the form common
forms.

Examples

Matching a token to a token produces an array of 1 token.

Expansion

t1

Input

t1

Output

["t1"]

A $NULL reference is matched by an empty input sequence and
output is an empty array.

Expansion

$NULL

Input

""

Output

[]

A tag is matched by an empty input sequence and output is an
array of 1 tag.

Expansion

{tag} or {!{tag}!}

Input

""

Output

[{!{tag}!}]

Concatenation: An expansion consisting of a token and a tag is
matched by input containing the token and produces as output a
token, tag array.

Expansion

t1 {tag1}

Input

t1

Output

["t1",{!{tag1}!}]

Concatenation: an expansion consisting of a sequence of tokens,
tags and $NULLs is matched by input that consists of the contained
tokens. Output consists of the sequence of tokens and tags with
order preserved. e.g.

Expansion

t1 $NULL {tag1} t2 {tag2} t3

Input

t1 t2 t3

Output

["t1",{!{tag1}!},"t2",{!{tag2}!},"t3"]

Parenthetical structure is not preserved in the result. The
following is the same sequence as the previous example but with
parentheticals added to the expansion definition.

Expansion

((t1) $NULL) {tag1} (t2 {tag2} t3)

Input

t1 t2 t3

Output

["t1",{!{tag1}!},"t2",{!{tag2}!},"t3"]

Alternatives: a set of many alternative tokens is matched by
input of a single token and produces as output a single token.

Expansion

t1 | t2 |t3

Input

t2

Output

["t2"]

Alternatives: if any single expansion in a set of alternatives
can be matched by null input then the set of alternatives may be
matched by null input and the output is the output of
null-accepting expansion. ($NULL, {tag} and repeat counts of zero
all permit null input.)

Expansion

t1 | t2 | $NULL

Input

""

Output

[]

With a different null-accepting expansion:

Expansion

t1 | t2 | {tag}

Input

""

Output

[{!{tag}!}]

Alternatives and ambiguity: several examples of ambiguous
expansions with the ambiguity arising from alternatives that accept
the same input but produce different output.

Expansion

t1 {tag1} | t1 {tag2} | t2

Input

t1

Output 1

["t1",{!{tag1}!}]

Output 2

["t1",{!{tag2}!}]

In this example null input is ambiguous.

Expansion

{tag1} | {tag2} | $NULL

Input

""

Output 1

[{!{tag1}!}]

Output 2

[{!{tag2}!}]

Output 3

[]

The following is not ambiguous because the different paths
through the expansion produce the same output.

Expansion

t1 | t1 | t2

Input

t1

Output 1

["t1"]

Output 2

["t1"]

Repeats: an optional expansion can be either matched by an empty
token sequence or by any token sequence that matches the expansion
contained within the optional.

Expansion

t1 <0-1>

Input 1

""

Output 1

[]

Input 2

t1

Output 2

["t1"]

Repeats: order is preserved upon multiple expansions.

Expansion

(t1 {tag1}) <0-3>

Input 1

""

Output 1

[]

Input 2

t1

Output 2

["t1",{!{tag1}!}]

Input 3

t1,t1,t1

Output 3

["t1",{!{tag1}!},"t1",{!{tag1}!},"t1",{!{tag1}!}]

Repeats and null input: If the contents of an optional expansion
can be matched by an empty input sequence AND the output of
matching the contained expansion is always an empty array then the
output of matching the optional expansion by an empty sequence is
also an empty array.

Expansion

$NULL <0-1>

Input

""

Output

[]

Ambiguous repeats: If a repeated or optional expansion can be
matched by an empty input sequence BUT the output of matching the
contained expansion may contain tags then the parse is ambiguous.
It is recommended that the parse be minimal: Output 1 is
preferred.

Expansion

{tag} <0->

Input

""

Output 1

[]

Output 2

[{!{tag}!}]

Output 3

[{!{tag}!},{!{tag}!}]

Output N

[{!{tag}!},{!{tag}!},{!{tag}!},...]

A similar ambiguity arises if the repeated expansion contains a
alternative expansion that has a null-accepting expansion.

Expansion

(t1 | {tag}) <0-3>

Input

t1

Output 1

["t1"]

Output 2

["t1",{!{tag}!}]

Output 3

[{!{tag}!},"t1"]

Output 4

["t1",{!{tag}!},{!{tag}!}]

Output 5

[{!{tag}!},"t1",{!{tag}!}]

Output 6

[{!{tag}!},{!{tag}!},"t1"]

A sequence with two repeat expansion can be ambiguous if the two
repeated expansions can accept the same input but produce different
output.

We denote output obtained by matching the token sequence
"t1,t2,..." against the expansion $rulename as
$rulename[e1,e2,...] where "e1,e2,..." is the entity
sequence obtained by matching that token sequence against the rule
expansion defined for $rulename. Where a rule reference to an
external rule is used the ABNF syntax for the rule reference is
used (without any media type). For example,
$<http://www.example.com/grammar.grxml#rulename">[e1,e2,...]
or an implicit root rule reference
$<http://www.example.com/grammar.grxml">[e1,e2,...].
For brevity, all the examples below use only local rule
references.

The rulename of the top-level rule should enclose the logical
parse structure.

A distinct structure for matching rule references maintains the
parse tree for the result. This structure may be utilized in the
semantic interpretation process or other computational processes
that derive from the parse output structure.

The following two grammars are XML Form grammars with Korean
yes/no content. The first represents the Korean symbols as Unicode
characters and has UTF-8 encoding. The second represents the same
Unicode characters using character escaping.

The following two grammars are XML Form grammars with Chinese
number content. The first represents the Chinese symbols as Unicode
characters with the UTF-8 encoding. The second represents the same
Unicode characters using character escaping.