Abstract

XML Schema: Datatypes is part 2 of the specification of the XML
Schema language. It defines facilities for defining datatypes to be used
in XML Schemas as well as other XML specifications.
The datatype language, which is itself represented in
XML 1.0, provides a superset of the capabilities found in XML 1.0
document type definitions (DTDs) for specifying datatypes on elements
and attributes.

Status of this Document

This section describes the status of this document at the
time of its publication. Other documents may supersede this document.
A list of current W3C publications and the latest
revision of this technical report can be found in the W3C technical reports index at
http://www.w3.org/TR/.

This is a W3C
Recommendation, which forms part of the Second Edition of XML
Schema. This document has been reviewed by W3C Members and
other interested parties and has been endorsed by the Director as a
W3C Recommendation. It is a stable document and may be used as
reference material or cited as a normative reference
from another document.
W3C's role in making the Recommendation is to draw attention
to the specification and to promote its widespread deployment. This
enhances the functionality and interoperability of the Web.

This document has been produced by the W3C XML Schema Working Group
as part of the W3C XML
Activity. The goals of the XML Schema language are discussed in
the XML Schema
Requirements document. The authors of this document are the
members of the XML Schema Working Group. Different parts of this
specification have different editors.

This second edition is not a new version,
it merely incorporates the changes dictated by the corrections to
errors found in the first
edition as agreed by the XML Schema Working Group, as a
convenience to readers. A separate list of all such corrections is
available at http://www.w3.org/2001/05/xmlschema-errata.

1 Introduction

1.1 Purpose

The [XML 1.0 (Second Edition)] specification defines limited
facilities for applying datatypes to document content in that documents
may contain or refer to DTDs that assign types to elements and attributes.
However, document authors, including authors of traditional
documents and those transporting data in XML,
often require a higher degree of type checking to ensure robustness in
document understanding and data interchange.

The table below offers two typical examples of XML instances
in which datatypes are implicit: the instance on the left
represents a billing invoice, the instance on the
right a memo or perhaps an email message in XML.

The invoice contains several dates and telephone numbers, the postal
abbreviation for a state
(which comes from an enumerated list of sanctioned values), and a ZIP code
(which takes a definable regular form). The memo contains many
of the same types of information: a date, telephone number, email address
and an "importance" value (from an enumerated
list, such as "low", "medium" or "high"). Applications which process
invoices and memos need to raise exceptions if something that was
supposed to be a date or telephone number does not conform to the rules
for valid dates or telephone numbers.

In both cases, validity constraints exist on the content of the
instances that are not expressible in XML DTDs. The limited datatyping
facilities in XML have prevented validating XML processors from supplying
the rigorous type checking required in these situations. The result
has been that individual applications writers have had to implement type
checking in an ad hoc manner. This specification addresses
the need of both document authors and applications writers for a robust,
extensible datatype system for XML which could be incorporated into
XML processors. As discussed below, these datatypes could be used in other
XML-related standards as well.

1.2 Requirements

The [XML Schema Requirements] document spells out
concrete requirements to be fulfilled by this specification,
which state that the XML Schema Language must:

allow creation of user-defined datatypes, such as
datatypes that are derived from existing datatypes and which
may constrain certain of its properties (e.g., range,
precision, length, format).

1.3 Scope

This portion of the XML Schema Language discusses datatypes that can be
used in an XML Schema. These datatypes can be specified for element
content that would be specified as
#PCDATA and attribute
values of various
types in a DTD. It is the intention of this specification
that it be usable outside of the context of XML Schemas for a wide range
of other XML-related activities such as [XSL] and
[RDF Schema].

1.4 Terminology

The terminology used to describe XML Schema Datatypes is defined in the
body of this specification. The terms defined in the following list are
used in building those definitions and in describing the actions of a
datatype processor:

(Of strings or names:) Two strings or names being compared must be
identical. Characters with multiple possible representations in ISO/IEC 10646 (e.g.
characters with both precomposed and base+diacritic forms) match only if they have
the same representation in both strings. No case folding is performed. (Of strings and
rules in the grammar:) A string matches a grammatical production if it belongs to the
language generated by that production.

Constraints expressed by schema components which information
items ·must· satisfy to be schema-valid. Largely
to be found in Datatype components (§4).

2 Type System

This section describes the conceptual framework behind the type system
defined in this specification. The framework has been influenced by the
[ISO 11404] standard on language-independent datatypes as
well as the datatypes for [SQL] and for programming
languages such as Java.

The datatypes discussed in this specification are computer
representations of well known abstract concepts such as
integer and date. It is not the place of this
specification to define these abstract concepts; many other publications
provide excellent definitions.

2.3 Lexical space

In addition to its ·value space·, each datatype also
has a lexical space.

[Definition:] A
lexical space is the set of valid literals
for a datatype.

For example, "100" and "1.0E2" are two different literals from the
·lexical space· of float which both
denote the same value. The type system defined in this specification
provides a mechanism for schema designers to control the set of values
and the corresponding set of acceptable literals of those values for
a datatype.

Note:
The literals in the ·lexical space·s defined in this specification
have the following characteristics:

Interoperability:

The number of literals for each value has been kept small; for many
datatypes there is a one-to-one mapping between literals and values.
This makes it easy to exchange the values between different systems.
In many cases, conversion from locale-dependent representations will
be required on both the originator and the recipient side, both for
computer processing and for interaction with humans.

Basic readability:

Textual, rather than binary, literals are used.
This makes hand editing, debugging, and similar activities possible.

Ease of parsing and serializing:

Where possible, literals correspond to those found in common
programming languages and libraries.

2.3.1 Canonical Lexical Representation

While the datatypes defined in this specification have, for the most part,
a single lexical representation i.e. each value in the datatype's
·value space· is denoted by a single literal in its
·lexical space·, this is not always the case. The
example in the previous section showed two literals for the datatype
float which denote the same value. Similarly, there
·may· be
several literals for one of the date or time datatypes that denote the
same value using different timezone indicators.

[Definition:] A canonical lexical representation
is a set of literals from among the valid set of literals
for a datatype such that there is a one-to-one mapping between literals
in the canonical lexical representation and
values in the ·value space·.

The facets of a datatype serve to distinguish those aspects of
one datatype which differ from other datatypes.
Rather than being defined solely in terms of a prose description
the datatypes in this specification are defined in terms of
the synthesis of facet values which together determine the
·value space· and properties of the datatype.

Facets are of two types: fundamental facets that define
the datatype and non-fundamental or constraining
facets that constrain the permitted values of a datatype.

A prototypical example of a ·union· type is the
maxOccurs attribute on the
element element
in XML Schema itself: it is a union of nonNegativeInteger
and an enumeration with the single member, the string "unbounded", as shown below.

The order in which the ·memberTypes· are specified in the
definition (that is, the order of the <simpleType> children of the <union>
element, or the order of the QNames in the memberTypes
attribute) is significant.
During validation, an element or attribute's value is validated against the
·memberTypes· in the order in which they appear in the
definition until a match is found. The evaluation order can be overridden
with the use of xsi:type.

Example

For example, given the definition below, the first instance of the <size> element
validates correctly as an integer (§3.3.13), the second and third as
string (§3.2.1).

Note:
A datatype which is ·atomic· in this specification
need not be an "atomic" datatype in any programming language used to
implement this specification. Likewise, a datatype which is a
·list· in this specification need not be a "list"
datatype in any programming language used to implement this specification.
Furthermore, a datatype which is a ·union· in this
specification need not be a "union" datatype in any programming
language used to implement this specification.

2.5.2 Primitive vs. derived datatypes

[Definition:] Primitive
datatypes are those that are not defined in terms of other datatypes;
they exist ab initio.

[Definition:] Derived
datatypes are those that are defined in terms of other datatypes.

For example, in this specification, float is a well-defined
mathematical
concept that cannot be defined in terms of other datatypes, while
a integer is a special case of the more general datatype
decimal.

The datatypes defined by this specification fall into both
the ·primitive· and ·derived·
categories. It is felt that a judiciously chosen set of
·primitive· datatypes will serve the widest
possible audience by providing a set of convenient datatypes that
can be used as is, as well as providing a rich enough base from
which the variety of datatypes needed by schema designers can be
·derived·.

Note:
A datatype which is ·primitive· in this specification
need not be a "primitive" datatype in any programming language used to
implement this specification. Likewise, a datatype which is
·derived· in this specification need not be a
"derived" datatype in any programming language used to implement
this specification.

2.5.3 Built-in vs. user-derived datatypes

[Definition:] User-derived datatypes are those ·derived·
datatypes that are defined by individual schema designers.

Conceptually there is no difference between the
·built-in··derived· datatypes
included in this specification and the ·user-derived·
datatypes which will be created by individual schema designers.
The ·built-in··derived· datatypes
are those which are believed to be so common that if they were not
defined in this specification many schema designers would end up
"reinventing" them. Furthermore, including these
·derived· datatypes in this specification serves to
demonstrate the mechanics and utility of the datatype generation
facilities of this specification.

Note:
A datatype which is ·built-in· in this specification
need not be a "built-in" datatype in any programming language used
to implement this specification. Likewise, a datatype which is
·user-derived· in this specification need not
be a "user-derived" datatype in any programming language used to
implement this specification.

3 Built-in datatypes

Each built-in datatype in this specification (both
·primitive· and
·derived·) can be uniquely addressed via a
URI Reference constructed as follows:

Additionally, each facet definition element can be uniquely
addressed via a URI constructed as follows:

the base URI is the URI of the XML Schema namespace

the fragment identifier is the name of the facet

For example, to address the maxInclusive facet, the URI is:

http://www.w3.org/2001/XMLSchema#maxInclusive

Additionally, each facet usage in a built-in datatype definition
can be uniquely addressed via a URI constructed as follows:

the base URI is the URI of the XML Schema namespace

the fragment identifier is the name of the datatype, followed
by a period (".") followed by the name of the facet

For example, to address the usage of the maxInclusive facet in
the definition of int, the URI is:

http://www.w3.org/2001/XMLSchema#int.maxInclusive

3.1 Namespace considerations

The ·built-in· datatypes defined by this specification
are designed to be used with the XML Schema definition language as well as other
XML specifications.
To facilitate usage within the XML Schema definition language, the ·built-in·
datatypes in this specification have the namespace name:

http://www.w3.org/2001/XMLSchema

To facilitate usage in specifications other than the XML Schema definition language,
such as those that do not want to know anything about aspects of the
XML Schema definition language other than the datatypes, each ·built-in·
datatype is also defined in the namespace whose URI is:

3.2.2.2 Canonical representation

3.2.2.3 Constraining facets

3.2.3 decimal

[Definition:] decimal
represents a subset of the real numbers, which can be represented by decimal numerals.
The ·value space· of decimal
is the set of
numbers that can be obtained by multiplying an integer by a non-positive
power of ten, i.e., expressible as i × 10^-n
where i and n are integers
and n >= 0.
Precision is not reflected in this value space;
the number 2.0 is not distinct from the number 2.00.
The ·order-relation· on decimal
is the order relation on real numbers, restricted
to this subset.

Note:
All ·minimally conforming· processors ·must·
support decimal numbers with a minimum of 18 decimal digits (i.e., with a
·totalDigits· of 18). However,
·minimally conforming· processors ·may·
set an application-defined limit on the maximum number of decimal digits
they are prepared to support, in which case that application-defined
maximum number ·must· be clearly documented.

3.2.3.1 Lexical representation

decimal has a lexical representation
consisting of a finite-length sequence of decimal digits (#x30-#x39) separated
by a period as a decimal indicator.
An optional leading sign is allowed.
If the sign is omitted, "+" is assumed. Leading and trailing zeroes are optional.
If the fractional part is zero, the period and following zero(es) can
be omitted.
For example: -1.23, 12678967.543233, +100000.00, 210.

3.2.3.2 Canonical representation

The canonical representation for decimal is defined by
prohibiting certain options from the
Lexical representation (§3.2.3.1). Specifically, the preceding
optional "+" sign is prohibited. The decimal point is required. Leading and
trailing zeroes are prohibited subject to the following: there must be at least
one digit to the right and to the left of the decimal point which may be a zero.

3.2.3.4 Derived datatypes

3.2.4 float

[Definition:] float
is patterned after the IEEE single-precision 32-bit floating point type
[IEEE 754-1985]. The basic ·value space· of
float consists of the values
m × 2^e, where m
is an integer whose absolute value is less than
2^24, and e is an integer
between -149 and 104, inclusive. In addition to the basic
·value space· described above, the
·value space· of float also contains the
following
three
special values:
positive and negative infinity and not-a-number
(NaN).
The ·order-relation· on float
is: x < y iff y - x is positive
for x and y in the value space.
Positive infinity is greater than all other non-NaN values.
NaN equals itself but is ·incomparable· with (neither greater than nor less than)
any other value in the ·value space·.

Note:
"Equality" in this Recommendation is defined to be "identity" (i.e., values that
are identical in the ·value space· are equal and vice versa).
Identity must be used for the few operations that are defined in this Recommendation.
Applications using any of the datatypes defined in this Recommendation may use different
definitions of equality for computational purposes; [IEEE 754-1985]-based computation systems
are examples. Nothing in this Recommendation should be construed as requiring that
such applications use identity as their equality relationship when computing.

Any value ·incomparable· with the value used for the four bounding facets
(·minInclusive·, ·maxInclusive·,
·minExclusive·, and ·maxExclusive·) will be
excluded from the resulting restricted ·value space·. In particular,
when "NaN" is used as a facet value for a bounding facet, since no other
float values are ·comparable· with it, the result is a ·value space·
either having NaN as its only member (the inclusive cases) or that is empty
(the exclusive cases). If any other value is used for a bounding facet,
NaN will be excluded from the resulting restricted ·value space·;
to add NaN back in requires union with the NaN-only space.

This datatype differs from that of [IEEE 754-1985] in that there is only one
NaN and only one zero. This makes the equality and ordering of values in the data
space differ from that of [IEEE 754-1985] only in that for schema purposes NaN = NaN.

A literal in the ·lexical space· representing a
decimal number d maps to the normalized value
in the ·value space· of float that is
closest to d in the sense defined by
[Clinger, WD (1990)]; if d is
exactly halfway between two such values then the even value is chosen.

3.2.4.1 Lexical representation

float values have a lexical representation
consisting of a mantissa followed, optionally, by the character
"E" or "e", followed by an exponent. The exponent ·must·
be an integer. The mantissa must be a decimal number. The representations
for exponent and mantissa must follow the lexical rules for
integer and decimal. If the "E" or "e" and
the following exponent are omitted, an exponent value of 0 is assumed.

The special values
positive
and negative infinity and not-a-number have lexical representations
INF, -INF and
NaN, respectively.
Lexical representations for zero may take a positive or negative sign.

For example, -1E4, 1267.43233E12, 12.78e-2, 12, -0, 0
and INF are all legal literals for float.

3.2.4.2 Canonical representation

The canonical representation for float is defined by
prohibiting certain options from the
Lexical representation (§3.2.4.1). Specifically, the exponent
must be indicated by "E". Leading zeroes and the preceding optional "+" sign
are prohibited in the exponent.
If the exponent is zero, it must be indicated by "E0".
For the mantissa, the preceding optional "+" sign is prohibited
and the decimal point is required.
Leading and trailing zeroes are prohibited subject to the following:
number representations must
be normalized such that there is a single digit
which is non-zero
to the left of the decimal point and at least a single digit to the
right of the decimal point
unless the value being represented is zero. The canonical
representation for zero is 0.0E0.

3.2.4.3 Constraining facets

3.2.5 double

[Definition:] The double
datatype
is patterned after the
IEEE double-precision 64-bit floating point
type [IEEE 754-1985]. The basic ·value space·
of double consists of the values
m × 2^e, where m
is an integer whose absolute value is less than
2^53, and e is an
integer between -1075 and 970, inclusive. In addition to the basic
·value space· described above, the
·value space· of double also contains
the following
three
special values:
positive and negative infinity and not-a-number
(NaN).
The ·order-relation· on double
is: x < y iff y - x is positive
for x and y in the value space.
Positive infinity is greater than all other non-NaN values.
NaN equals itself but is ·incomparable· with (neither greater than nor less than)
any other value in the ·value space·.

Note:
"Equality" in this Recommendation is defined to be "identity" (i.e., values that
are identical in the ·value space· are equal and vice versa).
Identity must be used for the few operations that are defined in this Recommendation.
Applications using any of the datatypes defined in this Recommendation may use different
definitions of equality for computational purposes; [IEEE 754-1985]-based computation systems
are examples. Nothing in this Recommendation should be construed as requiring that
such applications use identity as their equality relationship when computing.

Any value ·incomparable· with the value used for the four bounding facets
(·minInclusive·, ·maxInclusive·,
·minExclusive·, and ·maxExclusive·) will be
excluded from the resulting restricted ·value space·. In particular,
when "NaN" is used as a facet value for a bounding facet, since no other
double values are ·comparable· with it, the result is a ·value space·
either having NaN as its only member (the inclusive cases) or that is empty
(the exclusive cases). If any other value is used for a bounding facet,
NaN will be excluded from the resulting restricted ·value space·;
to add NaN back in requires union with the NaN-only space.

This datatype differs from that of [IEEE 754-1985] in that there is only one
NaN and only one zero. This makes the equality and ordering of values in the data
space differ from that of [IEEE 754-1985] only in that for schema purposes NaN = NaN.

3.2.5.1 Lexical representation

double values have a lexical representation
consisting of a mantissa followed, optionally, by the character "E" or
"e", followed by an exponent. The exponent ·must· be
an integer. The mantissa must be a decimal number. The representations
for exponent and mantissa must follow the lexical rules for
integer and decimal. If the "E" or "e"
and the following exponent are omitted, an exponent value of 0 is assumed.

The special values
positive
and negative infinity and not-a-number have lexical representations
INF, -INF and
NaN, respectively.
Lexical representations for zero may take a positive or negative sign.

For example, -1E4, 1267.43233E12, 12.78e-2, 12, -0, 0
and INF
are all legal literals for double.

3.2.5.2 Canonical representation

The canonical representation for double is defined by
prohibiting certain options from the
Lexical representation (§3.2.5.1). Specifically, the exponent
must be indicated by "E". Leading zeroes and the preceding optional "+" sign
are prohibited in the exponent.
If the exponent is zero, it must be indicated by "E0".
For the mantissa, the preceding optional "+" sign is prohibited
and the decimal point is required.
Leading and trailing zeroes are prohibited subject to the following:
number representations must
be normalized such that there is a single digit
which is non-zero
to the left of the decimal point and at least a single digit to the
right of the decimal point
unless the value being represented is zero. The canonical
representation for zero is 0.0E0.

3.2.5.3 Constraining facets

3.2.6 duration

[Definition:] duration represents a duration of time.
The ·value space· of duration is
a six-dimensional space where the coordinates
designate the Gregorian year, month, day, hour, minute, and second components defined in
§ 5.5.3.2 of [ISO 8601],
respectively. These components are ordered
in their significance by their order of appearance i.e. as year, month, day,
hour, minute, and second.

Note:

All ·minimally conforming· processors ·must·
support year values with a minimum of 4 digits (i.e., YYYY) and a minimum fractional second precision of milliseconds or three decimal digits (i.e. s.sss). However,
·minimally conforming· processors ·may·
set an application-defined limit on the maximum number of digits
they are prepared to support in these two cases, in which case that application-defined
maximum number ·must· be clearly documented.

3.2.6.1 Lexical representation

The lexical representation for duration is the
[ISO 8601] extended format PnYn
MnDTnH nMnS, where
nY represents the number of years, nM the
number of months, nD the number of days, 'T' is the
date/time separator, nH the number of hours,
nM the number of minutes and nS the
number of seconds. The number of seconds can include decimal digits
to arbitrary precision.

The values of the
Year, Month, Day, Hour and Minutes components are not restricted but
allow an arbitrary
unsigned integer, i.e., an integer that
conforms to the pattern [0-9]+..
Similarly, the value of the Seconds component
allows an arbitrary unsigned decimal.
Following [ISO 8601], at least one digit must
follow the decimal point if it appears. That is, the value of the Seconds component
must conform to the pattern [0-9]+(\.[0-9]+)?.
Thus, the lexical representation of
duration does not follow the alternative
format of § 5.5.3.2.1 of [ISO 8601].

An optional preceding minus sign ('-') is
allowed, to indicate a negative duration. If the sign is omitted a
positive duration is indicated. See also ISO 8601 Date and Time Formats (§D).

For example, to indicate a duration of 1 year, 2 months, 3 days, 10
hours, and 30 minutes, one would write: P1Y2M3DT10H30M.
One could also indicate a duration of minus 120 days as:
-P120D.

Reduced precision and truncated representations of this format are allowed
provided they conform to the following:

If the number of years, months, days, hours, minutes, or seconds in any
expression equals zero, the number and its corresponding designator ·may·
be omitted. However, at least one number and its designator ·must·
be present.

The designator 'T' must
be absent if and only if all of the time items are absent.
The designator 'P' must always be present.

For example, P1347Y, P1347M and P1Y2MT2H are all allowed;
P0Y1347M and P0Y1347M0D are allowed. P-1347M is not allowed although
-P1347M is allowed. P1Y2MT is not allowed.

3.2.6.2 Order relation on duration

In general, the ·order-relation· on duration
is a partial order since there is no determinate relationship between certain
durations such as one month (P1M) and 30 days (P30D).
The ·order-relation·
of two duration values x and
y is x < y iff s+x < s+y
for each qualified dateTime s
in the list below. These values for s cause the greatest deviations in the addition of
dateTimes and durations. Addition of durations to time instants is defined
in Adding durations to dateTimes (§E).

1696-09-01T00:00:00Z

1697-02-01T00:00:00Z

1903-03-01T00:00:00Z

1903-07-01T00:00:00Z

The following table shows the strongest relationship that can be determined
between example durations. The symbol <> means that the order relation is
indeterminate. Note that because of leap-seconds, a seconds field can vary
from 59 to 60. However, because of the way that addition is defined in
Adding durations to dateTimes (§E), they are still totally ordered.

Relation

P1Y

> P364D

<> P365D

<> P366D

< P367D

P1M

> P27D

<> P28D

<> P29D

<> P30D

<> P31D

< P32D

P5M

> P149D

<> P150D

<> P151D

<> P152D

<> P153D

< P154D

Implementations are free to optimize the computation of the ordering relationship. For example, the following table can be used to
compare durations of a small number of months against days.

Months

1

2

3

4

5

6

7

8

9

10

11

12

13

...

Days

Minimum

28

59

89

120

150

181

212

242

273

303

334

365

393

...

Maximum

31

62

92

123

153

184

215

245

276

306

337

366

397

...

3.2.6.3 Facet Comparison for durations

3.2.6.4 Totally ordered durations

Certain derived datatypes of durations can be guaranteed have a total order. For
this, they must have fields from only one row in the list below and the time zone
must either be required or prohibited.

year, month

day, hour, minute, second

For example, a datatype could be defined to correspond to the
[SQL] datatype Year-Month interval that required a four digit
year field and a two digit month field but required all other fields to be unspecified. This datatype could be defined as below and would have a total order.

3.2.6.5 Constraining facets

3.2.7 dateTime

[Definition:] dateTime values may be viewed as objects with integer-valued
year, month, day, hour and minute properties, a decimal-valued second property,
and a boolean timezoned property.
Each such object also has one decimal-valued
method or computed property, timeOnTimeline, whose value is always a decimal
number; the values are dimensioned in seconds, the integer 0 is
0001-01-01T00:00:00 and the value of timeOnTimeline for other dateTime
values is computed using the Gregorian algorithm as modified for leap-seconds.
The timeOnTimeline values form two related "timelines", one for timezoned
values and one for non-timezoned values.
Each timeline is a copy of the ·value space· of decimal,
with integers given units of seconds.

The ·value space· of
dateTime is closely related to the dates and times described in ISO 8601.
For clarity, the text above specifies a particular origin point for the
timeline.
It should be noted, however, that schema processors need not expose the
timeOnTimeline value to schema users, and there is no requirement that a
timeline-based implementation use the particular origin described here in
its internal representation.
Other interpretations of the ·value space· which lead to the
same results (i.e., are isomorphic) are of course acceptable.

All timezoned times are Coordinated Universal Time (UTC, sometimes called
"Greenwich Mean Time"). Other timezones indicated in lexical representations
are converted to UTC during conversion of literals to values.
"Local" or untimezoned times are presumed to be the time in the timezone of some
unspecified locality as prescribed by the appropriate legal authority;
currently there are no legally prescribed timezones which are durations
whose magnitude is greater than 14 hours. The value of each numeric-valued property
(other than timeOnTimeline) is limited to the maximum value within the interval
determined by the next-higher property. For example, the day value can never be 32,
and cannot even be 29 for month 02 and year 2002 (February 2002).

Note:

The date and time datatypes described in this recommendation were inspired
by [ISO 8601]. '0001' is the lexical representation of the year 1 of the Common Era
(1
CE, sometimes written "AD 1" or "1 AD"). There is no year 0, and '0000' is not a valid lexical representation. '-0001' is the lexical representation of the year 1 Before
Common Era (1 BCE, sometimes written "1 BC").

Those using this (1.0) version of this Recommendation to
represent negative years should be aware that the interpretation of lexical
representations beginning with a '-' is likely to change in
subsequent versions.

[ISO 8601]
makes no mention of the year 0; in [ISO 8601:1998 Draft Revision]
the form '0000' was disallowed and this recommendation disallows it as well.
However, [ISO 8601:2000 Second Edition], which became available just as we were completing version
1.0, allows the form '0000', representing the year 1 BCE. A number of external commentators
have also suggested that '0000' be
allowed, as the lexical representation for 1 BCE, which is the normal usage in
astronomical contexts.
It is the intention of the XML Schema
Working Group to allow '0000' as a lexical representation in the
dateTime, date, gYear, and
gYearMonth datatypes in a subsequent version
of this Recommendation. '0000' will be the lexical representation of 1
BCE (which is a leap year), '-0001' will become the lexical representation of 2
BCE (not 1 BCE as in this (1.0) version), '-0002' of 3 BCE, etc.

Note: See the conformance note in (§3.2.6) which
applies to this datatype as well.

3.2.7.1 Lexical representation

'-'? yyyy is a four-or-more digit optionally negative-signed
numeral that represents the year; if more than four digits, leading zeros
are prohibited, and '0000' is prohibited (see the Note above (§3.2.7); also note that a plus sign is not permitted);

the remaining '-'s are separators between parts of the date portion;

the first mm is a two-digit numeral that represents the month;

dd is a two-digit numeral that represents the day;

'T' is a separator indicating that time-of-day follows;

hh is a two-digit numeral that represents the hour; '24' is permitted if the
minutes and seconds represented are zero, and the dateTime value so
represented is the first instant of the following day (the hour property of a
dateTime object in the ·value space· cannot have
a value greater than 23);

':' is a separator between parts of the time-of-day portion;

the second mm is a two-digit numeral that represents the minute;

ss is a two-integer-digit numeral that represents the
whole seconds;

'.' s+ (if present) represents the
fractional seconds;

zzzzzz (if present) represents the timezone (as described below).

For example, 2002-10-10T12:00:00-05:00 (noon on 10 October 2002, Central Daylight
Savings Time as well as Eastern Standard Time in the U.S.) is 2002-10-10T17:00:00Z,
five hours later than 2002-10-10T12:00:00Z.

3.2.7.2 Canonical representation

Except for trailing fractional zero digits in the seconds representation,
'24:00:00' time representations, and timezone (for timezoned values), the mapping
from literals to values is one-to-one. Where there is more than
one possible representation, the canonical representation is as follows:

The 2-digit numeral representing the hour must not be '24';

The fractional second string, if present, must not end in '0';

for timezoned values, the timezone must be
represented with 'Z'
(All timezoned dateTime values are UTC.).

3.2.7.3 Timezones

Timezones are durations with (integer-valued) hour and minute properties
(with the hour magnitude limited to at most 14, and the minute magnitude
limited to at most 59, except that if the hour magnitude is 14, the minute
value must be 0); they may be both positive or both negative.

The lexical representation of a timezone is a string of the form:
(('+' | '-') hh ':' mm) | 'Z',
where

hh is a two-digit numeral (with leading zeros as required) that
represents the hours,

mm is a two-digit numeral that represents the minutes,

'+' indicates a nonnegative duration,

'-' indicates a nonpositive duration.

The mapping so defined is one-to-one, except that '+00:00', '-00:00', and 'Z'
all represent the same zero-length duration timezone, UTC; 'Z' is its canonical
representation.

When a timezone is added to a UTC dateTime, the result is the date
and time "in that timezone". For example, 2002-10-10T12:00:00+05:00 is
2002-10-10T07:00:00Z and 2002-10-10T00:00:00+05:00 is 2002-10-09T19:00:00Z.

3.2.7.4 Order relation on dateTime

dateTime value objects on either timeline are totally ordered by their timeOnTimeline
values; between the two timelines, dateTime value objects are ordered by their
timeOnTimeline values when their timeOnTimeline values differ by more than
fourteen hours, with those whose difference is a duration of 14 hours or less
being ·incomparable·.

In general, the ·order-relation· on dateTime
is a partial order since there is no determinate relationship between certain
instants. For example, there is no determinate
ordering between
(a)
2000-01-20T12:00:00 and (b) 2000-01-20T12:00:00Z. Based on
timezones currently in use, (c) could vary from 2000-01-20T12:00:00+12:00 to
2000-01-20T12:00:00-13:00. It is, however, possible for this range to expand or
contract in the future, based on local laws. Because of this, the following
definition uses a somewhat broader range of indeterminate values: +14:00..-14:00.

The following definition uses the notation S[year] to represent the year
field of S, S[month] to represent the month field, and so on. The notation (Q
& "-14:00") means adding the timezone -14:00 to Q, where Q did not
already have a timezone. This is a logical explanation of the process. Actual
implementations are free to optimize as long as they produce the same results.

The ordering between two dateTimes P and Q is defined by the following
algorithm:

B. If P and Q either both have a time zone or both do not have a time
zone, compare P and Q field by field from the year field down to the
second field, and return a result as soon as it can be determined. That is:

For each i in {year, month, day, hour, minute, second}

If P[i] and Q[i] are both not specified, continue to the next i

If P[i] is not specified and Q[i] is, or vice versa, stop and return
P <> Q

If P[i] < Q[i], stop and return P < Q

If P[i] > Q[i], stop and return P > Q

Stop and return P = Q

C.Otherwise, if P contains a time zone and Q does not, compare
as follows:

P < Q if P < (Q with time zone +14:00)

P > Q if P > (Q with time zone -14:00)

P <> Q otherwise, that is, if (Q with time zone +14:00) < P < (Q with time zone -14:00)

D. Otherwise, if P does not contain a time zone and Q does, compare
as follows:

P < Q if (P with time zone -14:00) < Q.

P > Q if (P with time zone +14:00) > Q.

P <> Q otherwise, that is, if (P with time zone +14:00) < Q < (P with time zone -14:00)

Examples:

Determinate

Indeterminate

2000-01-15T00:00:00 < 2000-02-15T00:00:00

2000-01-01T12:00:00 <>
1999-12-31T23:00:00Z

2000-01-15T12:00:00 < 2000-01-16T12:00:00Z

2000-01-16T12:00:00 <>
2000-01-16T12:00:00Z

2000-01-16T00:00:00 <> 2000-01-16T12:00:00Z

3.2.7.5 Totally ordered dateTimes

Certain derived types from dateTime
can be guaranteed have a total order. To
do so, they must require that a specific set of fields are always specified, and
that remaining fields (if any) are always unspecified. For example, the date
datatype without time zone is defined to contain exactly year, month, and day.
Thus dates without time zone have a total order among themselves.

3.2.7.6 Constraining facets

3.2.8 time

[Definition:] time
represents an instant of time that recurs every day. The
·value space· of time is the space
of time of day values as defined in § 5.3 of
[ISO 8601]. Specifically, it is a set of zero-duration daily
time instances.

Since the lexical representation allows an optional time zone
indicator, time values are partially ordered because it may
not be able to determine the order of two values one of which has a
time zone and the other does not. The order relation on
time values is the
Order relation on dateTime (§3.2.7.4) using an arbitrary date. See also
Adding durations to dateTimes (§E). Pairs of time values with or without time zone indicators are totally ordered.

Note: See the conformance note in (§3.2.6) which
applies to the seconds part of this datatype as well.

3.2.8.1 Lexical representation

The lexical representation for time is the left
truncated lexical representation for dateTime:
hh:mm:ss.sss with optional following time zone indicator. For example,
to indicate 1:20 pm for Eastern Standard Time which is 5 hours behind
Coordinated Universal Time (UTC), one would write: 13:20:00-05:00. See also
ISO 8601 Date and Time Formats (§D).

3.2.8.2 Canonical representation

The canonical representation for time is defined
by prohibiting certain options from the
Lexical representation (§3.2.8.1). Specifically, either the time zone must
be omitted or, if present, the time zone must be Coordinated Universal
Time (UTC) indicated by a "Z".
Additionally, the canonical representation for midnight is 00:00:00.

3.2.8.3 Constraining facets

3.2.9 date

[Definition:]
The ·value space· of date
consists of top-open intervals of exactly one day in length on the timelines of
dateTime, beginning on the beginning moment of each day (in
each timezone), i.e. '00:00:00', up to but not including '24:00:00' (which is
identical with '00:00:00' of the next day). For nontimezoned values, the top-open
intervals disjointly cover the nontimezoned timeline, one per day. For timezoned
values, the intervals begin at every minute and therefore overlap.

A "date object" is an object with year, month, and day properties just like those
of dateTime objects, plus an optional timezone-valued
timezone property. (As with values of dateTime timezones are a
special case of durations.)
Just as a dateTime object corresponds to a point on one of the
timelines, a date object corresponds to an interval on one
of the two timelines as just described.

Timezoned date values track the starting moment of their day, as
determined by their timezone; said timezone is generally recoverable for
canonical representations.
[Definition:]
The recoverable timezone is that duration which
is the result of subtracting the first moment (or any moment) of the timezoned
date from the first moment (or the corresponding moment) UTC on the
same date.·recoverable timezone·s are
always durations between '+12:00' and '-11:59'. This "timezone normalization"
(which follows automatically from the definition of the date·value space·) is explained more in
Lexical representation (§3.2.9.1).

For example: the first moment of 2002-10-10+13:00 is 2002-10-10T00:00:00+13,
which is 2002-10-09T11:00:00Z, which is also the first moment of 2002-10-09-11:00.
Therefore 2002-10-10+13:00 is 2002-10-09-11:00; they are the same interval.

Note:
For most timezones, either the first moment or last moment of the day (a
dateTime value, always UTC) will have a date portion
different from that of the date itself!
However, noon of that date (the midpoint of the interval) in that
(normalized) timezone will always have the same date portion as the
date itself, even when that noon point in time is normalized to
UTC. For example, 2002-10-10-05:00 begins during 2002-10-09Z and 2002-10-10+05:00
ends during 2002-10-11Z, but noon of both 2002-10-10-05:00 and 2002-10-10+05:00
falls in the interval which is 2002-10-10Z.

Note: See the conformance note in (§3.2.6) which
applies to the year part of this datatype as well.

3.2.9.1 Lexical representation

For the following discussion, let the "date portion" of a dateTime
or date object be an object similar to a dateTime or
date object, with similar year, month, and day properties, but no
others, having the same value for these properties as the original
dateTime or date object.

The ·lexical space· of date consists of finite-length
sequences of characters of the form:
'-'? yyyy '-' mm '-' dd zzzzzz?
where the date and optional timezone are represented exactly the
same way as they are for dateTime. The first moment of the
interval is that represented by:
'-' yyyy '-' mm '-' dd 'T00:00:00' zzzzzz?
and the least upper bound of the interval is the timeline point represented
(noncanonically) by:
'-' yyyy '-' mm '-' dd 'T24:00:00' zzzzzz?.

Note:
The ·recoverable timezone· of a date will always be
a duration between '+12:00' and '11:59'. Timezone lexical representations, as
explained for dateTime, can range from '+14:00' to '-14:00'.
The result is that literals of dates with very large or very
negative timezones will map to a "normalized" date value with a
·recoverable timezone· different from that represented in the original
representation, and a matching difference of +/- 1 day in the date itself.

3.2.9.2 Canonical representation

Given a member of the date·value space·, the
date portion of the canonical representation (the entire representation
for nontimezoned values, and all but the timezone representation for timezoned values)
is always the date portion of the dateTime canonical
representation of the interval midpoint (the dateTime representation,
truncated on the right to eliminate 'T' and all following characters).
For timezoned values, append the canonical representation of the ·recoverable timezone·.

3.2.10 gYearMonth

[Definition:] gYearMonth represents a
specific gregorian month in a specific gregorian year. The
·value space· of gYearMonth
is the set of Gregorian calendar months as defined in § 5.2.1 of
[ISO 8601]. Specifically, it is a set of one-month long,
non-periodic instances
e.g. 1999-10 to represent the whole month of 1999-10, independent of
how many days this month has.

Since the lexical representation allows an optional time zone
indicator, gYearMonth values are partially ordered because it may
not be possible to unequivocally determine the order of two values one of
which has a time zone and the other does not. If gYearMonth
values are considered as periods of time, the order relation on
gYearMonth values is the order relation on their starting instants.
This is discussed in Order relation on dateTime (§3.2.7.4). See also
Adding durations to dateTimes (§E). Pairs of gYearMonth
values with or without time zone indicators are totally ordered.

Note:
Because month/year combinations in one calendar only rarely correspond
to month/year combinations in other calendars, values of this type
are not, in general, convertible to simple values corresponding to month/year
combinations in other calendars. This type should therefore be used with caution
in contexts where conversion to other calendars is desired.

Note: See the conformance note in (§3.2.6) which
applies to the year part of this datatype as well.

3.2.10.1 Lexical representation

The lexical representation for gYearMonth is the reduced
(right truncated) lexical representation for dateTime:
CCYY-MM. No left truncation is allowed. An optional following time
zone qualifier is allowed. To accommodate year values outside the range from 0001 to 9999, additional digits
can be added to the left of this representation and a preceding "-" sign is allowed.

3.2.10.2 Constraining facets

3.2.11 gYear

[Definition:] gYear represents a
gregorian calendar year. The ·value space· of
gYear is the set of Gregorian calendar years as defined in
§ 5.2.1 of [ISO 8601]. Specifically, it is a set of one-year
long, non-periodic instances
e.g. lexical 1999 to represent the whole year 1999, independent of
how many months and days this year has.

Since the lexical representation allows an optional time zone
indicator, gYear values are partially ordered because it may
not be possible to unequivocally determine the order of two values one of which has a
time zone and the other does not. If
gYear values are considered as periods of time, the order relation
on gYear values is the order relation on their starting instants.
This is discussed in Order relation on dateTime (§3.2.7.4). See also
Adding durations to dateTimes (§E). Pairs of gYear values with or without time zone indicators are totally ordered.

Note:
Because years in one calendar only rarely correspond to years
in other calendars, values of this type
are not, in general, convertible to simple values corresponding to years
in other calendars. This type should therefore be used with caution
in contexts where conversion to other calendars is desired.

Note: See the conformance note in (§3.2.6) which
applies to the year part of this datatype as well.

3.2.11.1 Lexical representation

The lexical representation for gYear is the reduced (right
truncated) lexical representation for dateTime: CCYY.
No left truncation is allowed. An optional following time
zone qualifier is allowed as for dateTime. To
accommodate year values outside the range from 0001 to 9999, additional
digits can be added to the left of this representation and a preceding
"-" sign is allowed.

3.2.11.2 Constraining facets

3.2.12 gMonthDay

[Definition:] gMonthDay is a gregorian date that recurs, specifically a day of
the year such as the third of May. Arbitrary recurring dates are not
supported by this datatype. The ·value space· of
gMonthDay is the set of calendar
dates, as defined in § 3 of [ISO 8601]. Specifically,
it is a set of one-day long, annually periodic instances.

Since the lexical representation allows an optional time zone
indicator, gMonthDay values are partially ordered because it may
not be possible to unequivocally determine the order of two values one of which has a
time zone and the other does not. If
gMonthDay values are considered as periods of time,
in an arbitrary leap year, the order relation
on gMonthDay values is the order relation on their starting instants.
This is discussed in Order relation on dateTime (§3.2.7.4). See also
Adding durations to dateTimes (§E). Pairs of gMonthDay values with or without time zone indicators are totally ordered.

Note:
Because day/month combinations in one calendar only rarely correspond
to day/month combinations in other calendars, values of this type do not,
in general, have any straightforward or intuitive representation
in terms of most other calendars. This type should therefore be
used with caution in contexts where conversion to other calendars
is desired.

3.2.12.1 Lexical representation

The lexical representation for gMonthDay is the left
truncated lexical representation for date: --MM-DD.
An optional following time
zone qualifier is allowed as for date.
No preceding sign is allowed. No other formats are allowed. See also ISO 8601 Date and Time Formats (§D).

This datatype can be used to represent a specific day in a month.
To say, for example, that my birthday occurs on the 14th of September ever year.

3.2.12.2 Constraining facets

3.2.13 gDay

[Definition:] gDay is a gregorian day that recurs, specifically a day
of the month such as the 5th of the month. Arbitrary recurring days
are not supported by this datatype. The ·value space·
of gDay is the space of a set of calendar
dates as defined in § 3 of [ISO 8601]. Specifically,
it is a set of one-day long, monthly periodic instances.

This datatype can be used to represent a specific day of the month.
To say, for example, that I get my paycheck on the 15th of each month.

Since the lexical representation allows an optional time zone
indicator, gDay values are partially ordered because it may
not be possible to unequivocally determine the order of two values one of
which has a time zone and the other does not. If
gDay values are considered as periods of time,
in an arbitrary month that has 31 days,
the order relation
on gDay values is the order relation on their starting instants.
This is discussed in Order relation on dateTime (§3.2.7.4). See also
Adding durations to dateTimes (§E). Pairs of gDay
values with or without time zone indicators are totally ordered.

Note:
Because days in one calendar only rarely correspond
to days in other calendars, values of this type do not,
in general, have any straightforward or intuitive representation
in terms of most other calendars. This type should therefore be
used with caution in contexts where conversion to other calendars
is desired.

3.2.13.1 Lexical representation

The lexical representation for gDay is the left
truncated lexical representation for date: ---DD .
An optional following time
zone qualifier is allowed as for date. No preceding sign is
allowed. No other formats are allowed. See also ISO 8601 Date and Time Formats (§D).

3.2.13.2 Constraining facets

3.2.14 gMonth

[Definition:] gMonth is a gregorian month that recurs every year.
The ·value space·
of gMonth is the space of a set of calendar
months as defined in § 3 of [ISO 8601]. Specifically,
it is a set of one-month long, yearly periodic instances.

This datatype can be used to represent a specific month.
To say, for example, that Thanksgiving falls in the month of November.

Since the lexical representation allows an optional time zone
indicator, gMonth values are partially ordered because it may
not be possible to unequivocally determine the order of two values one of which has a
time zone and the other does not. If
gMonth values are considered as periods of time, the order relation
on gMonth is the order relation on their starting instants.
This is discussed in Order relation on dateTime (§3.2.7.4). See also
Adding durations to dateTimes (§E). Pairs of gMonth
values with or without time zone indicators are totally ordered.

Note:
Because months in one calendar only rarely correspond
to months in other calendars, values of this type do not,
in general, have any straightforward or intuitive representation
in terms of most other calendars. This type should therefore be
used with caution in contexts where conversion to other calendars
is desired.

3.2.14.1 Lexical representation

The lexical representation for gMonth is the left
and right truncated lexical representation for date: --MM.
An optional following time
zone qualifier is allowed as for date. No preceding sign is
allowed. No other formats are allowed. See also ISO 8601 Date and Time Formats (§D).

3.2.14.2 Constraining facets

3.2.15 hexBinary

[Definition:] hexBinary represents
arbitrary hex-encoded binary data. The ·value space· of
hexBinary is the set of finite-length sequences of binary
octets.

3.2.15.1 Lexical Representation

hexBinary has a lexical representation where
each binary octet is encoded as a character tuple, consisting of two
hexadecimal digits ([0-9a-fA-F]) representing the octet code. For example,
"0FB7" is a hex encoding for the 16-bit integer 4023
(whose binary representation is 111110110111).

3.2.15.2 Canonical Representation

The canonical representation for hexBinary is defined
by prohibiting certain options from the
Lexical Representation (§3.2.15.1). Specifically, the lower case
hexadecimal digits ([a-f]) are not allowed.

3.2.15.3 Constraining facets

3.2.16 base64Binary

[Definition:] base64Binary
represents Base64-encoded arbitrary binary data. The ·value space· of
base64Binary is the set of finite-length sequences of binary
octets. For base64Binary data the
entire binary stream is encoded using the Base64
Alphabet in
[RFC 2045].

The lexical forms of base64Binary values are limited to the 65 characters
of the Base64 Alphabet defined in [RFC 2045], i.e., a-z,
A-Z, 0-9, the plus sign (+), the forward slash (/) and the
equal sign (=), together with the characters defined in [XML 1.0 (Second Edition)] as white space.
No other characters are allowed.

For compatibility with older mail gateways, [RFC 2045] suggests that
base64 data should have lines limited to at most 76 characters in length. This
line-length limitation is not mandated in the lexical forms of base64Binary
data and must not be enforced by XML Schema processors.

The lexical space of base64Binary is given by the following grammar
(the notation is that used in [XML 1.0 (Second Edition)]); legal lexical forms must match
the Base64Binary production.

Note that this grammar requires the number of non-whitespace characters in the lexical
form to be a multiple of four, and for equals signs to appear only at the end of the
lexical form; strings which do not meet these constraints are not legal lexical forms
of base64Binary because they cannot successfully be decoded by base64
decoders.

Note: The above definition of the lexical space is more restrictive than that
given in [RFC 2045] as regards whitespace -- this is not an issue
in practice. Any string compatible with the RFC can occur in
an element or attribute validated by this type, because the ·whiteSpace· facet of this type is fixed
to collapse, which means that all leading and trailing whitespace
will be stripped, and all internal whitespace collapsed to single space
characters, before the above grammar is enforced.

The canonical lexical form of a base64Binary data value is the base64
encoding of the value which matches the Canonical-base64Binary production in the following
grammar:

Note: For some values the canonical form defined above does not conform to
[RFC 2045], which requires
breaking with linefeeds at appropriate intervals.

The length of a base64Binary value is the number of octets it contains.
This may be calculated from the lexical form by removing whitespace and padding characters
and performing the calculation shown in the pseudo-code below:

Note on encoding: [RFC 2045] explicitly references US-ASCII encoding. However,
decoding of base64Binary data in an XML entity is to be performed on the
Unicode characters obtained after character encoding processing as specified by
[XML 1.0 (Second Edition)]

3.2.16.1 Constraining facets

3.2.17 anyURI

[Definition:] anyURI represents a Uniform Resource Identifier Reference
(URI). An anyURI value can be absolute or relative, and may
have an optional fragment identifier (i.e., it may be a URI Reference). This
type should be used to specify the intention that the value fulfills
the role of a URI as defined by [RFC 2396], as amended by
[RFC 2732].

Note:
Section 5.4 Locator Attribute
of [XML Linking Language] requires that relative URI references be absolutized
as defined in [XML Base] before use. This is an XLink-specific
requirement and is not appropriate for XML Schema, since neither the
·lexical space· nor the ·value space·
of the anyURI type are restricted to absolute URIs. Accordingly
absolutization must not be performed by schema processors as part of schema
validation.

Note:
Each URI scheme imposes specialized syntax rules for URIs in
that scheme, including restrictions on the syntax of allowed
fragment
identifiers. Because it is
impractical for processors to check that a value is a
context-appropriate URI reference, this specification follows the
lead of [RFC 2396] (as amended by [RFC 2732])
in this matter: such rules and restrictions are not part of type validity
and are not checked by ·minimally conforming· processors.
Thus in practice the above definition imposes only very modest obligations
on ·minimally conforming· processors.

3.3.1 normalizedString

[Definition:] normalizedString
represents white space normalized strings.
The ·value space· of normalizedString is the
set of strings that do not
contain the carriage return (#xD), line feed (#xA) nor tab (#x9) characters.
The ·lexical space· of normalizedString is the
set of strings that do not
contain the carriage return (#xD),
line feed (#xA)
nor tab (#x9) characters.
The ·base type· of normalizedString is string.

3.3.1.2 Derived datatypes

3.3.2 token

[Definition:] token
represents tokenized strings.
The ·value space· of token is the
set of strings that do not
contain the
carriage return (#xD),
line feed (#xA) nor tab (#x9) characters, that have no
leading or trailing spaces (#x20) and that have no internal sequences
of two or more spaces.
The ·lexical space· of token is the
set of strings that do not contain the
carriage return (#xD),
line feed (#xA) nor tab (#x9) characters, that have no
leading or trailing spaces (#x20) and that have no internal sequences
of two or more spaces.
The ·base type· of token is normalizedString.

3.3.13.1 Lexical representation

integer has a lexical representation consisting of a finite-length sequence
of decimal digits (#x30-#x39) with an optional leading sign. If the sign is omitted,
"+" is assumed. For example: -1, 0, 12678967543233, +100000.

3.3.13.2 Canonical representation

The canonical representation for integer is defined
by prohibiting certain options from the
Lexical representation (§3.3.13.1). Specifically, the preceding optional "+" sign is prohibited and leading zeroes are prohibited.

3.3.14.1 Lexical representation

nonPositiveInteger has a lexical representation consisting of
an optional preceding sign
followed by a finite-length sequence of decimal digits (#x30-#x39).
The sign may be "+" or may be omitted only for
lexical forms denoting zero; in all other lexical forms, the negative
sign ("-") must be present.
For example: -1, 0, -12678967543233, -100000.

3.3.14.2 Canonical representation

The canonical representation for nonPositiveInteger is defined
by prohibiting certain options from the
Lexical representation (§3.3.14.1).
In the canonical form for zero, the sign must be
omitted. Leading zeroes are prohibited.

3.3.16 long

3.3.16.1 Lexical representation

long has a lexical representation consisting
of an optional sign followed by a finite-length
sequence of decimal digits (#x30-#x39). If the sign is omitted, "+" is assumed.
For example: -1, 0,
12678967543233, +100000.

3.3.16.2 Canonical representation

The canonical representation for long is defined
by prohibiting certain options from the
Lexical representation (§3.3.16.1). Specifically, the
the optional "+" sign is prohibited and leading zeroes are prohibited.

3.3.17 int

3.3.17.1 Lexical representation

int has a lexical representation consisting
of an optional sign followed by a finite-length
sequence of decimal digits (#x30-#x39). If the sign is omitted, "+" is assumed.
For example: -1, 0,
126789675, +100000.

3.3.17.2 Canonical representation

The canonical representation for int is defined
by prohibiting certain options from the
Lexical representation (§3.3.17.1). Specifically, the
the optional "+" sign is prohibited and leading zeroes are prohibited.

3.3.18 short

3.3.18.1 Lexical representation

short has a lexical representation consisting
of an optional sign followed by a finite-length sequence of decimal
digits (#x30-#x39). If the sign is omitted, "+" is assumed.
For example: -1, 0, 12678, +10000.

3.3.18.2 Canonical representation

The canonical representation for short is defined
by prohibiting certain options from the
Lexical representation (§3.3.18.1). Specifically, the
the optional "+" sign is prohibited and leading zeroes are prohibited.

3.3.19 byte

3.3.19.1 Lexical representation

byte has a lexical representation consisting
of an optional sign followed by a finite-length
sequence of decimal digits (#x30-#x39). If the sign is omitted, "+" is assumed.
For example: -1, 0,
126, +100.

3.3.19.2 Canonical representation

The canonical representation for byte is defined
by prohibiting certain options from the
Lexical representation (§3.3.19.1). Specifically, the
the optional "+" sign is prohibited and leading zeroes are prohibited.

3.3.20.1 Lexical representation

nonNegativeInteger has a lexical representation consisting of
an optional sign followed by a finite-length
sequence of decimal digits (#x30-#x39). If the sign is omitted,
the positive sign ("+") is assumed.
If the sign is present, it must be "+" except for lexical forms
denoting zero, which may be preceded by a positive ("+") or a negative ("-") sign.
For example:
1, 0, 12678967543233, +100000.

3.3.20.2 Canonical representation

The canonical representation for nonNegativeInteger is defined
by prohibiting certain options from the
Lexical representation (§3.3.20.1). Specifically, the
the optional "+" sign is prohibited and leading zeroes are prohibited.

3.3.25.1 Lexical representation

positiveInteger has a lexical representation consisting
of an optional positive sign ("+") followed by a finite-length
sequence of decimal digits (#x30-#x39).
For example: 1, 12678967543233, +100000.

3.3.25.2 Canonical representation

The canonical representation for positiveInteger is defined
by prohibiting certain options from the
Lexical representation (§3.3.25.1). Specifically, the
optional "+" sign is prohibited and leading zeroes are prohibited.

3.3.25.3 Constraining facets

4 Datatype components

The following sections provide full details on the properties and
significance of each kind of schema component involved in datatype
definitions. For each property, the kinds of values it is allowed to have is
specified. Any property not identified as optional is required to
be present; optional properties which are not present have
absent as their value.
Any property identified as a having a set, subset or ·list·
value may have an empty value unless this is explicitly ruled out: this is
not the same as absent.
Any property value identified as a superset or a subset of some set may
be equal to that set, unless a proper superset or subset is explicitly
called for.

If {final} is the empty set then the type can be used
in deriving other types; the explicit values restriction,
list and union prevent further derivations
by ·restriction·, ·list· and
·union· respectively.

4.1.2 XML Representation of Simple Type Definition Schema Components

The XML representation for a Simple Type Definition schema component
is a <simpleType> element information item. The
correspondences between the properties of the information item and
properties of the component are as follows:

As an example, taken from a typical display oriented text markup language,
one might want to express font sizes as an integer between 8 and 72, or with
one of the tokens "small", "medium" or "large". The ·union·
type definition below would accomplish that.

On every datatype, the operation Equal is defined in terms of the equality
property of the ·value space·: for any values
a, b drawn from the
·value space·, Equal(a,b) is
true if a = b, and false otherwise.

Note:
The fact that this specification does not define an
·order-relation· for some datatype does not
mean that some other application cannot treat that datatype as
being ordered by imposing its own order relation.

4.2.4 cardinality

[Definition:] Every
·value space· has associated with it the concept of
cardinality. Some ·value space·s
are finite, some are countably infinite while still others could
conceivably be uncountably infinite (although no ·value space·
defined by this specification is uncountable infinite). A datatype is
said to have the cardinality of its
·value space·.

It
is sometimes useful to categorize ·value space·s
(and hence, datatypes) as to their cardinality. There are two
significant cases:

Note:
For string and datatypes ·derived· from string,
length will not always coincide with "string length" as perceived
by some users or with the number of storage units in some digital representation.
Therefore, care should be taken when specifying a value for length
and in attempting to infer storage requirements from a given value for
length.

The following is the definition of a ·user-derived·
datatype to represent product codes which must be
exactly 8 characters in length. By fixing the value of the
length facet we ensure that types derived from productCode can
change or set the values of other facets, such as pattern, but
cannot change the length.

4.3.1.2 XML Representation of length Schema Components

The XML representation for a length schema
component is a <length> element information item. The
correspondences between the properties of the information item and
properties of the component are as follows:

Note:
For string and datatypes ·derived· from string,
minLength will not always coincide with "string length" as perceived
by some users or with the number of storage units in some digital representation.
Therefore, care should be taken when specifying a value for minLength
and in attempting to infer storage requirements from a given value for
minLength.

4.3.2.2 XML Representation of minLength Schema Component

The XML representation for a minLength schema
component is a <minLength> element information item. The
correspondences between the properties of the information item and
properties of the component are as follows:

Note:
For string and datatypes ·derived· from string,
maxLength will not always coincide with "string length" as perceived
by some users or with the number of storage units in some digital representation.
Therefore, care should be taken when specifying a value for maxLength
and in attempting to infer storage requirements from a given value for
maxLength.

4.3.3.2 XML Representation of maxLength Schema Components

The XML representation for a maxLength schema
component is a <maxLength> element information item. The
correspondences between the properties of the information item and
properties of the component are as follows:

The following is the definition of a ·user-derived·
datatype which is a better representation of postal codes in the
United States, by limiting strings to those which are matched by
a specific ·regular expression·.

4.3.4.1 The pattern Schema Component

4.3.4.2 XML Representation of pattern Schema Components

The XML representation for a pattern schema
component is a <pattern> element information item. The
correspondences between the properties of the information item and
properties of the component are as follows:

Note:
It is a consequence of the schema representation constraint
Multiple patterns (§4.3.4.3) and of the rules for
·restriction· that ·pattern·
facets specified on the same step in a type
derivation are ORed together, while ·pattern·
facets specified on different steps of a type derivation
are ANDed together.

Thus, to impose two ·pattern· constraints simultaneously,
schema authors may either write a single ·pattern· which
expresses the intersection of the two ·pattern·s they wish to
impose, or define each ·pattern· on a separate type derivation
step.

The following example is a datatype definition for a
·user-derived· datatype which limits the values
of dates to the three US holidays enumerated. This datatype
definition would appear in a schema authored by an "end-user" and
shows how to define a datatype by enumerating the values in its
·value space·. The enumerated values must be
type-valid literals for the ·base type·.

4.3.5.1 The enumeration Schema Component

4.3.5.2 XML Representation of enumeration Schema Components

The XML representation for an enumeration schema
component is an <enumeration> element information item. The
correspondences between the properties of the information item and
properties of the component are as follows:

After the processing implied by replace, contiguous
sequences of #x20's are collapsed to a single #x20, and leading and
trailing #x20's are removed.

Note:
The notation #xA used here (and elsewhere in this specification) represents
the Universal Character Set (UCS) code point hexadecimal A (line feed), which is denoted by
U+000A. This notation is to be distinguished from &#xA;,
which is the XML character reference
to that same UCS code point.

whiteSpace is applicable to all ·atomic· and
·list· datatypes. For all ·atomic·
datatypes other than string (and types ·derived·
by ·restriction· from it) the value of whiteSpace is
collapse and cannot be changed by a schema author; for
string the value of whiteSpace is
preserve; for any type ·derived· by
·restriction· from
string the value of whiteSpace can
be any of the three legal values. For all datatypes
·derived· by ·list· the
value of whiteSpace is collapse and cannot
be changed by a schema author. For all datatypes
·derived· by ·union·whiteSpace does not apply directly; however, the
normalization behavior of ·union· types is controlled by
the value of whiteSpace on that one of the
·memberTypes· against which the ·union·
is successfully validated.

4.3.6.2 XML Representation of whiteSpace Schema Components

The XML representation for a whiteSpace schema
component is a <whiteSpace> element information item. The
correspondences between the properties of the information item and
properties of the component are as follows:

4.3.7.2 XML Representation of maxInclusive Schema Components

The XML representation for a maxInclusive schema
component is a <maxInclusive> element information item. The
correspondences between the properties of the information item and
properties of the component are as follows:

4.3.8.2 XML Representation of maxExclusive Schema Components

The XML representation for a maxExclusive schema
component is a <maxExclusive> element information item. The
correspondences between the properties of the information item and
properties of the component are as follows:

4.3.9.2 XML Representation of minExclusive Schema Components

The XML representation for a minExclusive schema
component is a <minExclusive> element information item. The
correspondences between the properties of the information item and
properties of the component are as follows:

4.3.10.2 XML Representation of minInclusive Schema Components

The XML representation for a minInclusive schema
component is a <minInclusive> element information item. The
correspondences between the properties of the information item and
properties of the component are as follows:

4.3.11 totalDigits

[Definition:] totalDigits
controls the maximum number of values in
the ·value space·
of datatypes ·derived· from decimal,
by restricting it to numbers that are expressible as
i × 10^-n where i
and n are integers such that
|i| < 10^totalDigits and
0 <= n <= totalDigits.
The value of
totalDigits·must· be a
positiveInteger.

The term totalDigits is chosen to reflect the fact that it
restricts the ·value space· to those values that can
be represented lexically using at most totalDigits
digits. Note that it does not restrict the ·lexical space·
directly; a lexical representation that adds
additional leading zero digits or trailing fractional zero digits is
still permitted.

4.3.11.2 XML Representation of totalDigits Schema Components

The XML representation for a totalDigits schema
component is a <totalDigits> element information item. The
correspondences between the properties of the information item and
properties of the component are as follows:

The term fractionDigits is chosen to reflect the fact that it
restricts the ·value space· to those values that can be
represented lexically using at most fractionDigits
to the right of the decimal point. Note that it does not restrict the
·lexical space· directly; a
non-·canonical lexical representation· that adds additional
leading zero digits or trailing fractional zero digits is still permitted.

Example

The following is the definition of a ·user-derived·
datatype which could be used to represent the magnitude
of a person's body temperature on the Celsius scale.
This definition would appear in a schema authored by an "end-user"
and shows how to define a datatype by specifying facet values which
constrain the range of the ·base type·.

4.3.12.2 XML Representation of fractionDigits Schema Components

The XML representation for a fractionDigits schema
component is a <fractionDigits> element information item. The
correspondences between the properties of the information item and
properties of the component are as follows:

5 Conformance

This specification describes two levels of conformance for
datatype processors. The first is
required of all processors. Support for the other will depend on the
application environments for which the processor is intended.

Note:
By separating the conformance requirements relating to the concrete
syntax of XML schema documents, this specification admits processors
which validate using schemas stored in optimized binary representations,
dynamically created schemas represented as programming language data
structures, or implementations in which particular schemas are compiled
into executable code such as C or Java. Such processors can be said to
be ·minimally conforming·
but not necessarily in ·conformance to
the XML Representation of Schemas·.

[ISO 8601] "specifies the representation of dates in the
proleptic Gregorian calendar and times and representations of periods of time".
The proleptic Gregorian calendar includes dates prior to 1582 (the year it came
into use as an ecclesiastical calendar).
It should be pointed out that the datatypes described in this
specification do not cover all the types of data covered by
[ISO 8601], nor do they support all the lexical
representations for those types of data.

[ISO 8601] lexical formats are described using "pictures"
in which characters are used in place of decimal digits.
The allowed decimal digits are (#x30-#x39).
For the primitive datatypes
dateTime, time,
date, gYearMonth, gMonthDay,
gDay, gMonth and gYear.
these characters have the following meanings:

C -- represents a digit used in the thousands and hundreds components,
the "century" component, of the time element "year". Legal values are
from 0 to 9.

Y -- represents a digit used in the tens and units components of the time
element "year". Legal values are from 0 to 9.

M -- represents a digit used in the time element "month". The two
digits in a MM format can have values from 1 to 12.

D -- represents a digit used in the time element "day". The two digits
in a DD format can have values from 1 to 28 if the month value equals 2,
1 to 29 if the month value equals 2 and the year is a leap year, 1 to 30
if the month value equals 4, 6, 9 or 11, and 1 to 31 if the month value
equals 1, 3, 5, 7, 8, 10 or 12.

h -- represents a digit used in the time element "hour". The two digits
in a hh format can have values from 0 to
24.
If the value of the hour element is 24 then the values of the minutes
element and the seconds element must be 00 and 00.

m -- represents a digit used in the time element "minute". The two digits
in a mm format can have values from 0 to 59.

s -- represents a digit used in the time element "second". The two
digits in a ss format can have values from 0 to 60. In the formats
described in this specification the whole number of seconds ·may·
be followed by decimal seconds to an arbitrary level of precision.
This is represented in the picture by "ss.sss". A value of 60 or more is
allowed only in the case of leap seconds.

Strictly speaking, a value of
60 or more is not sensible unless the month and day could
represent March 31, June 30, September 30, or December 31 in UTC.
Because the leap second is added or subtracted as the last second of the day
in UTC time, the long (or short) minute could occur at other times in local
time. In cases where the leap second is used with an inappropriate month
and day it, and any fractional seconds, should considered as added or
subtracted from the following minute.

For all the information items indicated by the above characters, leading
zeros are required where indicated.

In addition to the above, certain characters are used as designators
and appear as themselves in lexical formats.

T -- is used as time designator to indicate the start of the
representation of the time of day in dateTime.

In the lexical format for duration the following
characters are also used as designators and appear as themselves in
lexical formats:

P -- is used as the time duration designator, preceding a data element
representing a given duration of time.

Y -- follows the number of years in a time duration.

M -- follows the number of months or minutes in a time duration.

D -- follows the number of days in a time duration.

H -- follows the number of hours in a time duration.

S -- follows the number of seconds in a time duration.

The values of the
Year, Month, Day, Hour and Minutes components are not restricted but
allow an arbitrary integer. Similarly, the value of the Seconds component
allows an arbitrary decimal. Thus, the lexical format for
duration and datatypes derived from it
does not follow the alternative
format of § 5.5.3.2.1 of [ISO 8601].

D.2 Truncated and Reduced Formats

[ISO 8601] supports a variety of "truncated" formats in
which some of the characters on the left of specific formats, for example,
the
century, can be omitted.
Truncated formats are, in
general, not permitted for the datatypes defined in this specification
with three exceptions. The time datatype uses
a truncated format for dateTime
which represents an instant of time that recurs every day.
Similarly, the gMonthDay and gDay
datatypes use left-truncated formats for date.
The datatype gMonth uses a right and left truncated format for
date.

[ISO 8601] also supports a variety of "reduced" or right-truncated
formats in which some of the characters to the right of specific formats,
such as the
time specification, can be omitted. Right truncated formats are also, in
general,
not permitted for the datatypes defined in this specification
with the following exceptions:
right-truncated representations of dateTime are used as
lexical representations for date, gMonth,
gYear.

D.3.4 Time zone permitted

E Adding durations to dateTimes

Given a dateTime S and a duration D, this
appendix specifies how to compute a dateTime E where E is the
end of the time period with start S and duration D i.e. E = S + D. Such
computations are used, for example, to determine whether a dateTime
is within a specific time period. This appendix also addresses the addition of
durations to the datatypes date,
gYearMonth, gYear, gDay and
gMonth, which can be viewed as a set of dateTimes.
In such cases, the addition is made to the first or starting
dateTime in the set.

This is a logical explanation of the process.
Actual implementations are free to optimize as long as they produce the same
results. The calculation uses the notation S[year] to represent the year
field of S, S[month] to represent the month field, and so on. It also depends on
the following functions:

E.1 Algorithm

Essentially, this calculation is equivalent to separating D into <year,month>
and <day,hour,minute,second> fields. The <year,month> is added to S.
If the day is out of range, it is pinned to be within range. Thus April
31 turns into April 30. Then the <day,hour,minute,second> is added. This
latter addition can cause the year and month to change.

Leap seconds are handled by the computation by treating them as overflows.
Essentially, a value of 60
seconds in S is treated as if it were a duration of 60 seconds added to S
(with a zero seconds field). All calculations
thereafter use 60 seconds per minute.

Thus the addition of either PT1M or PT60S to any dateTime will always
produce the same result. This is a special definition of addition which
is designed to match common practice, and -- most importantly -- be stable
over time.

A definition that attempted to take leap-seconds into account would need to
be constantly updated, and could not predict the results of future
implementation's additions. The decision to introduce a leap second in UTC
is the responsibility of the [International Earth Rotation Service (IERS)]. They make periodic
announcements as to when
leap seconds are to be added, but this is not known more than a year in
advance. For more information on leap seconds, see [U.S. Naval Observatory Time Service Department].

The following is the precise specification. These steps must be followed in
the same order. If a field in D is not specified, it is treated as if it were
zero. If a field in S is not specified, it is treated in the calculation as if
it were the minimum allowed value in that field, however, after the calculation
is concluded, the corresponding field in E is removed (set to unspecified).

Months (may be modified additionally below)

temp := S[month] + D[month]

E[month] := modulo(temp, 1, 13)

carry := fQuotient(temp, 1, 13)

Years (may be modified additionally below)

E[year] := S[year] + D[year] + carry

Zone

E[zone] := S[zone]

Seconds

temp := S[second] + D[second]

E[second] := modulo(temp, 60)

carry := fQuotient(temp, 60)

Minutes

temp := S[minute] + D[minute] + carry

E[minute] := modulo(temp, 60)

carry := fQuotient(temp, 60)

Hours

temp := S[hour] + D[hour] + carry

E[hour] := modulo(temp, 24)

carry := fQuotient(temp, 24)

Days

if S[day] > maximumDayInMonthFor(E[year], E[month])

tempDays := maximumDayInMonthFor(E[year], E[month])

else if S[day] < 1

tempDays := 1

else

tempDays := S[day]

E[day] := tempDays + D[day] + carry

START LOOP

IF E[day] < 1

E[day] := E[day] + maximumDayInMonthFor(E[year], E[month] - 1)

carry := -1

ELSE IF E[day] > maximumDayInMonthFor(E[year], E[month])

E[day] := E[day] - maximumDayInMonthFor(E[year], E[month])

carry := 1

ELSE EXIT LOOP

temp := E[month] + carry

E[month] := modulo(temp, 1, 13)

E[year] := E[year] + fQuotient(temp, 1, 13)

GOTO START LOOP

Examples:

dateTime

duration

result

2000-01-12T12:13:14Z

P1Y3M5DT7H10M3.3S

2001-04-17T19:23:17.3Z

2000-01

-P3M

1999-10

2000-01-12

PT33H

2000-01-13

E.2 Commutativity and Associativity

Time durations are added by simply adding each of their fields, respectively,
without overflow.

The order of addition of durations to instants is significant.
For example, there are cases where:

F Regular Expressions

A ·regular expression·R is a sequence of
characters that denote a set of stringsL(R).
When used to constrain a ·lexical space·, a
regular expressionR asserts that only strings
in L(R) are valid literals for values of that type.

Note:
Unlike some popular regular expression languages (including those
defined by Perl and standard Unix utilities), the regular
expression language defined here implicitly anchors all regular
expressions at the head and tail, as the most common use of
regular expressions in ·pattern· is to match entire literals.
For example, a datatype ·derived· from string such
that all values must begin with the character A (#x41) and end with the character
Z (#x5a) would be defined as follows:

In regular expression languages that are not implicitly anchored at the head and tail,
it is customary to write the equivalent regular expression as:

^A.*Z$

where "^" anchors the pattern at the head and "$" anchors at the tail.

In those rare cases where an unanchored match is desired, including
.* at the beginning and ending of the regular expression will
achieve the desired results. For example, a datatype ·derived· from string such that all values must contain at least 3 consecutive A (#x41) characters somewhere within the value could be defined as follows:

All strings in L(S?) and all strings st
with s in L(S*)
and t in L(S). ( all concatenations
of zero or more strings from L(S) )

S+

All strings st with s in L(S)
and t in L(S*). ( all concatenations
of one or more strings from L(S) )

S{n,m}

All strings st with s in L(S)
and t in L(S{n-1,m-1}). ( All
sequences of at least n, and at most m, strings from L(S) )

S{n}

All strings in L(S{n,n}). ( All
sequences of exactly n strings from L(S) )

S{n,}

All strings in L(S{n}S*) ( All
sequences of at least n, strings from L(S) )

S{0,m}

All strings st with s in L(S?)
and t in L(S{0,m-1}). ( All
sequences of at most m, strings from L(S) )

S{0,0}

The set containing only the empty string

Note:
The regular expression language in the Perl Programming Language
[Perl] does not include a quantifier of the form
S{,m}, since it is logically equivalent to S{0,m}.
We have, therefore, left this logical possibility out of the regular
expression language defined by this specification.

[Definition:]
A quantifier
is one of ?, *, +,
{n,m} or {n,}, which have the meanings
defined in the table above.

[Definition:]
A
normal character is any XML character that is not a
metacharacter. In ·regular expression·s, a normal character is an
atom that denotes the singleton set of strings containing only itself.

F.1 Character Classes

[Definition:]
A
character class is an ·atom·R that identifies a set of charactersC(R). The set of strings L(R) denoted by a
character class R contains one single-character string
"c" for each character c in C(R).

[Definition:]
A
character class expression is a ·character group· surrounded
by [ and ] characters. For all character
groups G, [G] is a valid character class
expression, identifying the set of characters
C([G]) = C(G).

[Definition:]
A positive character group consists of one or more
·character range·s or ·character class escape·s, concatenated
together. A positive character group identifies the set of
characters containing all of the characters in all of the sets identified
by its constituent ranges or escapes.

Note: The grammar for ·character range· as
given above is ambiguous, but the second and third bullets above
together remove the ambiguity.

A ·character range··may· also be written
in the form s-e, identifying the set that contains all XML characters
with UCS code points greater than or equal to the code point
of s, but not greater than the code point of e.

[Definition:] [Unicode Database] specifies a number of possible
values for the "General Category" property
and provides mappings from code points to specific character properties.
The set containing all characters that have property X,
can be identified with a category escape\p{X}.
The complement of this set is specified with the
category escape\P{X}.
([\P{X}] = [^\p{X}]).

Note: [Unicode Database] is subject to future revision. For example, the
mapping from code points to character properties might be updated.
All ·minimally conforming· processors ·must·
support the character properties defined in the version of [Unicode Database]
that is current at the time this specification became a W3C
Recommendation. However, implementors are encouraged to support the
character properties defined in any future version.

The following table specifies the recognized values of the
"General Category" property.

Note:
The properties mentioned above exclude the Cs property.
The Cs property identifies "surrogate" characters, which do not
occur at the level of the "character abstraction" that XML instance documents
operate on.

[Definition:] [Unicode Database] groups code points into a number of blocks
such as Basic Latin (i.e., ASCII), Latin-1 Supplement, Hangul Jamo,
CJK Compatibility, etc.
The set containing all characters that have block name X
(with all white space stripped out),
can be identified with a block escape\p{IsX}.
The complement of this set is specified with the
block escape\P{IsX}.
([\P{IsX}] = [^\p{IsX}]).

Block Escape

[36]

IsBlock

::=

'Is' [a-zA-Z0-9#x2D]+

The following table specifies the recognized block names (for more
information, see the "Blocks.txt" file in [Unicode Database]).

Start Code

End Code

Block Name

Start Code

End Code

Block Name

#x0000

#x007F

BasicLatin

#x0080

#x00FF

Latin-1Supplement

#x0100

#x017F

LatinExtended-A

#x0180

#x024F

LatinExtended-B

#x0250

#x02AF

IPAExtensions

#x02B0

#x02FF

SpacingModifierLetters

#x0300

#x036F

CombiningDiacriticalMarks

#x0370

#x03FF

Greek

#x0400

#x04FF

Cyrillic

#x0530

#x058F

Armenian

#x0590

#x05FF

Hebrew

#x0600

#x06FF

Arabic

#x0700

#x074F

Syriac

#x0780

#x07BF

Thaana

#x0900

#x097F

Devanagari

#x0980

#x09FF

Bengali

#x0A00

#x0A7F

Gurmukhi

#x0A80

#x0AFF

Gujarati

#x0B00

#x0B7F

Oriya

#x0B80

#x0BFF

Tamil

#x0C00

#x0C7F

Telugu

#x0C80

#x0CFF

Kannada

#x0D00

#x0D7F

Malayalam

#x0D80

#x0DFF

Sinhala

#x0E00

#x0E7F

Thai

#x0E80

#x0EFF

Lao

#x0F00

#x0FFF

Tibetan

#x1000

#x109F

Myanmar

#x10A0

#x10FF

Georgian

#x1100

#x11FF

HangulJamo

#x1200

#x137F

Ethiopic

#x13A0

#x13FF

Cherokee

#x1400

#x167F

UnifiedCanadianAboriginalSyllabics

#x1680

#x169F

Ogham

#x16A0

#x16FF

Runic

#x1780

#x17FF

Khmer

#x1800

#x18AF

Mongolian

#x1E00

#x1EFF

LatinExtendedAdditional

#x1F00

#x1FFF

GreekExtended

#x2000

#x206F

GeneralPunctuation

#x2070

#x209F

SuperscriptsandSubscripts

#x20A0

#x20CF

CurrencySymbols

#x20D0

#x20FF

CombiningMarksforSymbols

#x2100

#x214F

LetterlikeSymbols

#x2150

#x218F

NumberForms

#x2190

#x21FF

Arrows

#x2200

#x22FF

MathematicalOperators

#x2300

#x23FF

MiscellaneousTechnical

#x2400

#x243F

ControlPictures

#x2440

#x245F

OpticalCharacterRecognition

#x2460

#x24FF

EnclosedAlphanumerics

#x2500

#x257F

BoxDrawing

#x2580

#x259F

BlockElements

#x25A0

#x25FF

GeometricShapes

#x2600

#x26FF

MiscellaneousSymbols

#x2700

#x27BF

Dingbats

#x2800

#x28FF

BraillePatterns

#x2E80

#x2EFF

CJKRadicalsSupplement

#x2F00

#x2FDF

KangxiRadicals

#x2FF0

#x2FFF

IdeographicDescriptionCharacters

#x3000

#x303F

CJKSymbolsandPunctuation

#x3040

#x309F

Hiragana

#x30A0

#x30FF

Katakana

#x3100

#x312F

Bopomofo

#x3130

#x318F

HangulCompatibilityJamo

#x3190

#x319F

Kanbun

#x31A0

#x31BF

BopomofoExtended

#x3200

#x32FF

EnclosedCJKLettersandMonths

#x3300

#x33FF

CJKCompatibility

#x3400

#x4DB5

CJKUnifiedIdeographsExtensionA

#x4E00

#x9FFF

CJKUnifiedIdeographs

#xA000

#xA48F

YiSyllables

#xA490

#xA4CF

YiRadicals

#xAC00

#xD7A3

HangulSyllables

#xE000

#xF8FF

PrivateUse

#xF900

#xFAFF

CJKCompatibilityIdeographs

#xFB00

#xFB4F

AlphabeticPresentationForms

#xFB50

#xFDFF

ArabicPresentationForms-A

#xFE20

#xFE2F

CombiningHalfMarks

#xFE30

#xFE4F

CJKCompatibilityForms

#xFE50

#xFE6F

SmallFormVariants

#xFE70

#xFEFE

ArabicPresentationForms-B

#xFEFF

#xFEFF

Specials

#xFF00

#xFFEF

HalfwidthandFullwidthForms

#xFFF0

#xFFFD

Specials

Note:
The blocks mentioned above exclude the HighSurrogates,
LowSurrogates and HighPrivateUseSurrogates blocks.
These blocks identify "surrogate" characters, which do not
occur at the level of the "character abstraction" that XML instance documents
operate on.

Note: [Unicode Database] is subject to future revision.
For example, the
grouping of code points into blocks might be updated.
All ·minimally conforming· processors ·must·
support the blocks defined in the version of [Unicode Database]
that is current at the time this specification became a W3C
Recommendation. However, implementors are encouraged to support the
blocks defined in any future version of the Unicode Standard.

For example, the ·block escape· for identifying the
ASCII characters is \p{IsBasicLatin}.

[Definition:]
A
multi-character escape provides a simple way to identify
a commonly used set of characters:

[#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}]
(all characters except the set of "punctuation",
"separator" and "other" characters)

\W

[^\w]

Note:
The ·regular expression· language defined here does not
attempt to provide a general solution to "regular expressions" over
UCS character sequences. In particular, it does not easily provide
for matching sequences of base characters and combining marks.
The language is targeted at support of "Level 1" features as defined in
[Unicode Regular Expression Guidelines]. It is hoped that future versions of this
specification will provide support for "Level 2" features.

G Glossary (non-normative)

The listing below is for the benefit of readers of a printed version of this
document: it collects together all the definitions which appear in the
document above.

A canonical lexical representation
is a set of literals from among the valid set of literals
for a datatype such that there is a one-to-one mapping between literals
in the canonical lexical representation and
values in the ·value space·.

Every
·value space· has associated with it the concept of
cardinality. Some ·value space·s
are finite, some are countably infinite while still others could
conceivably be uncountably infinite (although no ·value space·
defined by this specification is uncountable infinite). A datatype is
said to have the cardinality of its
·value space·.

In this specification,
a datatype is a 3-tuple, consisting of
a) a set of distinct values, called its ·value space·,
b) a set of lexical representations, called its
·lexical space·, and c) a set of ·facet·s
that characterize properties of the ·value space·,
individual values or lexical items.

I Acknowledgements (non-normative)

The following have contributed material to the first edition of this specification:

Asir S. Vedamuthu, webMethods, IncMark Davis, IBM

Co-editor Ashok Malhotra's work on this specification from March 1999 until
February 2001 was supported by IBM. From February 2001 until May 2004 it
was supported by Microsoft.

The editors acknowledge the members of the XML Schema Working Group, the members of other W3C Working Groups, and industry experts in other
forums who have contributed directly or indirectly to the process or content of
creating this document. The Working Group is particularly grateful to Lotus
Development Corp. and IBM for providing teleconferencing facilities.

At the time the first edition of this
specification was published, the members of the XML Schema Working Group
were:

The XML Schema Working Group has benefited in its work from the
participation and contributions of a number of people not currently
members of the Working Group, including
in particular those named below. Affiliations given are those current at
the time of their work with the WG.

Paula Angerstein, Vignette Corporation

David Beech, Oracle Corp.

Gabe Beged-Dov, Rogue Wave Software

Greg Bumgardner, Rogue Wave Software

Dean Burson, Lotus Development Corporation

Mike Cokus, MITRE

Andrew Eisenberg, Progress Software

Rob Ellman, Calico Commerce

George Feinberg, Object Design

Charles Frankston, Microsoft

Ernesto Guerrieri, Inso

Michael Hyman, Microsoft

Renato Iannella, Distributed Systems Technology Centre (DSTC Pty Ltd)

Dianne Kennedy, Graphic Communications Association

Janet Koenig, Sun Microsystems

Setrag Khoshafian, Technology Deployment International (TDI)

Ara Kullukian, Technology Deployment International (TDI)

Andrew Layman, Microsoft

Dmitry Lenkov, Hewlett-Packard Company

John McCarthy, Lawrence Berkeley National Laboratory

Murata Makoto, Xerox

Eve Maler, Sun Microsystems

Murray Maloney, Muzmo Communication, acting for Commerce One

Chris Olds, Wall Data

Frank Olken, Lawrence Berkeley National Laboratory

Shriram Revankar, Xerox

Mark Reinhold, Sun Microsystems

John C. Schneider, MITRE

Lew Shannon, NCR

William Shea, Merrill Lynch

Ralph Swick, W3C

Tony Stewart, Rivcom

Matt Timmermans, Microstar

Jim Trezzo, Oracle Corp.

Steph Tryphonas, Microstar

The lists given above pertain to the first edition.
At the time work on this second edition was completed,
the membership of the Working Group was:

We note with sadness the accidental death of Mario Jeckle
shortly after the completion of work on this document.
In addition to those named above, several
people served on the Working Group during the development
of this second edition: