Schemas

In This Chapter

What's wrong with DTDs?

What is a schema?

The W3C XML Schema Language

Hello schemas

Complex types

Grouping

Simple types

Deriving Simple Types

Empty elements

Attributes

Namespaces

Annotations

Schemas are documents that define the valid contents of particular classes of XML documents. The schema language discussed in this chapter, the W3C XML Schema Language, has a number of useful characteristics, most notably the ability to specify data types for text content and attribute values. For example, a schema can state that a PRICE element has type double or that a YEAR attribute contains a number between 1966 and 2012. However, schemas have a number of other useful characteristics including namespace awareness and the ability to validate complex structures built up out of many different elements of many types.

What's Wrong with DTDs?

Document type definitions (DTDs) are an outgrowth of XMLs heritage in the Standardized General Markup Language (SGML). SGML was always intended for narrative-style documents: books, reports, technical manuals, brochures, web pages, and the like. DTDs were designed to serve the needs of these sorts of documents, and indeed they serve them very well. DTDs let you state very simply and straightforwardly that every book must have one or more authors, that every song has exactly one title, that every PERSON element has an ID attribute, and so forth. Indeed, for narrative documents that are intended for human beings to read from start to finish, that are more or less composed of words in a row, theres really no need for anything beyond a DTD. However, XML has gone well beyond the uses envisioned for SGML. XML is being used for object serialization, stock trading, remote procedure calls, vector graphics, and many more things that look nothing like traditional narrative documents; and it is in these new arenas that DTDs are showing some limits.

The limitation most developers notice first is the almost complete lack of data typing, especially for element content. DTDs cant say that a PRICE element must contain a number, much less a number thats greater than zero with two decimal digits of precision and a dollar sign. Theres no way to say that a MONTH element must be an integer between 1 and 12. Theres no way to indicate that a TITLE must contain between 1 and 255 characters. None of these are particularly important things to do for the narrative documents SGML was aimed at; but theyre very common things to want to do with data formats intended for computer-to-computer exchange of information rather than computer-to-human communication. Humans are very good at handling fuzzy systems where expected data is missing, or perhaps is not in quite the right format; computers are not. Computers need to know that when they expect an element to contain an integer between 1 and 12, the element really contains an integer in that range and nothing else.

The second problem is that DTDs have an unusual non-XML syntax. The same parrsers and APIs that read an XML document cant read a DTD. For example, consider this common element declaration:

<!ELEMENT TITLE (#PCDATA)>

This is not a legal XML element. You cant begin an element name with an exclamation point. TITLE is not an attribute. Neither is (#PCDATA). This is a very different way of describing information than is used in XML document instances. One would expect that if XML were really powerful enough to live up to all its hype, it would be powerful enough to describe itself. You shouldnt need two different syntaxes: one for the information and one for the meta-information detailing the structure of the information. XML element and attribute syntax should suffice for both info and meta-info.

The third problem is that DTDs are only marginally extensible and dont scale very well. Its difficult to combine independent DTDs together in a sensible way. You can do this with parameter entity references. Indeed, SMIL 2.0 and modular XHTML are based on this idea. However, the modularized DTDs are very messy and very hard to follow. The largest DTDs in use today are in the ballpark of 10,000 lines of code, and its questionable whether much larger XML applications can be defined before the entire DTD becomes completely unmanageable and incomprehensible. By contrast, the largest computer programs in existence today, which are much more intrinsically complex than even the most ambitious DTDs, easily reach sizes of 1,000,000 lines of code or more.

Perhaps most annoyingly, DTDs are only marginally compatible with namespaces. The first principle of namespaces is that only the URI matters. The prefix does not. The prefix can change as long as the URI remains the same. However, validation of documents that use namespace prefixes works only if the DTD declares the prefixed names. You cannot use namespace URIs in a DTD. You must use the actual prefixes. If you change the prefixes in the document but dont change the DTD, the document immediately ceases to be valid. There are some tricks that you can perform with parameter entity references to make DTDs less dependent on the actual prefix, but theyre complicated and not well understood in the XML community. And even when they are understood, these tricks simply feel far too much like a dirty hack rather than a clean, maintainable solution.

Finally, there are a number of annoying minor limitations where DTDs dont allow you to do things that it really feels like you ought to be able to do. For example, DTDs cannot enforce the order or number of child elements in mixed content. That is, you cant enforce constraints such as each PARAGRAPH element must begin with exactly one SUMMARY element that is followed by plain text. Similarly, you cant enforce the number of child elements without also enforcing their order. For example, you cannot easily say that a PERSON element must contain a FIRST_NAME child, a MIDDLE_NAME child, and a LAST_NAME child, but that you dont care what order they appear in. Again, there are workarounds, but they grow combinatorially complex with the number of possible child elements.

Schemas are an attempt to solve all these problems by defining a new XML-based syntax for describing the permissible contents of XML documents that includes the following:

Powerful data typing including range checking

Namespace-aware validation based on namespace URIs rather than on prefixes

Extensibility and scalability

However, schemas are not a be-all and end-all solution. In particular, schemas do not replace DTDs! You can use both schemas and DTDs in the same document. DTDs can do several things that schemas cannot do, most importantly declaring entities. And DTDs still work very well for the classic sort of narrative documents they were originally designed for. Indeed, for these types of documents, a DTD is often considerably easier to write than an equivalent schema. Parsers and other software will continue to support DTDs for as long as they support XML.

What Is a Schema?

The word schema derives from the Greek word σ&xhi;ημα, meaning form or shape. It was first popularized in the Western world by Immanuel Kant in the late 1700s. According to the 1933 edition of the Oxford English Dictionary, Kant used the word schema to mean, "Any one of certain forms or rules of the productive imagination through which the understanding is able to apply its categories to the manifold of sense-perception in the process of realizing knowledge or experience." (And you thought computer science was full of unintelligible technical jargon!)

Schemas remained the province of philosophers for the next 200 years until the word schema entered computer science, probably through database theory. Here, schema originally meant any document that described the permissible content of a database. More specifically, a schema was a description of all the tables in a database and the fields in the table. A schema also described what type of data each field could contain: CHAR, INT, CHAR[32], BLOB, DATE, and so on.

The word schema has grown from that source definition to a more generic meaning of any document that describes the permissible contents of other documents, especially if data typing is involved. Thus, youll hear about different kinds of schemas from different technologies, including vocabulary schemas, RDF schemas, organizational schemas, X.500 schemas, and of course, XML schemas.

You say schemas, I say schemata

Probably no single topic has been more controversial in the schema world than the proper plural form of the word schema. The original Greek plural is σ&xhi;ηματα, schemata in Latin transliteration; and this is the form which Kant used and which youll find in most dictionaries. This was fine for the 200 years when only people with Ph.D.s in philosophy actually used the word. However, as often happens when words from other languages are adopted into popular English, its plural changed to something that sounds more natural to an anglophone ear. In this case, the plural form, schemata, seems to be rapidly dying out in favor of the simpler schemas. In fact, the three World Wide Web Consortium (W3C) schema specifications all use the plural form schemas. I follow this convention in this book.

Because schemas is such a generic term, it shouldnt come as any surprise that theres more than one schema language for XML. In fact, there are many, each with its own unique advantages and disadvantages. These include Murata Makoto and James Clarks RELAX NG (http://relaxng.org/), Rick Jelliffes Schematron (http://www.ascc.net/xml/resource/schematron/schematron.html), and the W3Cs misleadingly, generically titled XML Schema Language. In addition, traditional XML DTDs can be considered to be simply another schema language.

This chapter focuses almost exclusively on the W3C XML Schema Language. Nonetheless, RELAX NG and Schematron are definitely worthy of your attention as well. In particular, if you find W3C schemas to be excessively complex (and many people do) and if you want a simpler schema language that still offers a complete set of extensible data types, you should consider RELAX NG. RELAX NG adopts the less controversial data types half of the W3C XML Schema Recommendation, but replaces the much more complex and much less popular structures half with a much simpler language.

Note

There are also several dead XML schema languages that have been abandoned by their manufacturers in favor of other languages. These include Document Content Description (DCD), Commerce Ones Schema for Object-Oriented XML (SOX), and Microsofts XML-Data Reduced (XDR). None of these is worth your time or investment at this point. They never achieved broad adoption, and their vendors are now moving to the W3C XML Schema Language instead.

Most schema languages, including W3C schemas, RELAX NG, and DTDs, take the approach that you must carefully specify what is allowed in the document. They are conservative: Everything not permitted is forbidden. If, on the other hand, youre looking for a less-restrictive schema language in which everything not forbidden is permitted, you should consider Schematron. Schematron is based on XPath, which allows it to make statements none of the other major schema languages can, such as "An a element cannot have another a element as a descendant, even though an a element can contain a strong element that can contain an a element if it itself is not a descendant of an a element." This isnt a theoretical example. This is a real restriction in XHTML that has to be made in the prose of the specification because neither DTDs nor the W3C XML Schema Language are powerful enough to say it. What it means is that links cant nest; that is, a link cannot contain another link.

From this point forward, I will use the unqualified word schema to refer to the W3Cs XML Schema Language; but please keep in mind that alternatives that are equally deserving of the appellation do exist.

The W3C XML Schema Language

The W3C XML Schema Language was created by the W3C XML Schema Working Group based on many different submissions from a variety of companies and individuals. It is a very large specification designed to handle a broad range of use cases. In fact, the schema specification is considerably larger and more complex than the XML 1.0 specification. It is an open standard, free to be implemented by any interested party. There are no known patent, trademark, or other intellectual property restrictions that would prevent you from doing anything you might reasonably want to do with schemas. (This, unfortunately, is not quite the same thing as saying that there are no known patent, trademark, or other intellectual property restrictions that would prevent you from doing anything you might reasonably want to do. The U.S. Patent Office has been a little out of control lately, granting patents left and right for inventions that really dont deserve it, including a lot of software and business processes. I would not be surprised to learn of an as yet unnoticed patent that at least claims to cover some or all of the W3C XML Schema Language.)

Hello Schemas

Lets begin our exploration of schemas with the ubiquitous Hello World example. Recall, once again, the code from Listing 3-2 (greeting.xml) in Chapter 3. It is shown here:

<?xml version="1.0"?>
<GREETING>
Hello XML!
</GREETING>

This XML document contains a single element, GREETING. (Remember that <?xml version="1.0"?> is the XML declaration, not an element.) This element contains parsed character data. A schema for this document has to declare the GREETING element. It may declare other elements too, including ones that arent present in this particular document, but it must at least declare the GREETING element.

The greeting schema

Listing 20-1 is a very simple schema for GREETING elements. By convention it would be stored in a file with the three-letter extension .xsdgreeting.xsd, for examplebut thats not required. It is an XML document, so it has an XML declaration. It can be written and saved in any text editor that knows how to save Unicode files. As always, you can use a different character set if you declare it in an encoding declaration. Schema documents are XML documents and have all the privileges and responsibilities of other XML documents. They can even have DTDs, DOCTYPE declarations, and style sheets if that seems useful to you, although in practice most do not.

The root element of this and all other schemas is schema. This must be in the http://www.w3.org/2001/XMLSchema namespace. Normally, this namespace is bound to the prefix xsd or xs, although this can change as long as the URI stays the same. The other common approach is to make this URI the default namespace, although that generally requires a few extra attributes to help separate out the names from the XML application the schema describes from the names of the schema elements themselves. Youll see this when namespaces are discussed at the end of this chapter.

Elements are declared using xsd:element elements. Listing 20-1 includes a single such element declaring the GREETING element. The name attribute specifies which element is being declared, GREETING in this example. This xsd:element element also has a type attribute whose value is the data type of the element. In this case the type is xsd:string, a standard type for elements that can contain any amount of text in any form but not child elements. Its equivalent to a DTD content model of #PCDATA. That is, this xsd:element says that a valid GREETING element must look like this:

<GREETING>
various random text but no markup
</GREETING>

Theres no restriction on what text the element can contain. It can be zero or more Unicode characters with any meaning. Thus, a GREETING element can also look like this:

Each GREETING element must consist of nothing more and nothing less than parsed character data between a <GREETING> start-tag and a </GREETING> end-tag.

Validating the document against the schema

Before a document can be validated against a DTD, the document itself must contain a document type declaration pointing to the DTD it should be validated against. You cannot easily receive a document from a third party and validate it against your own DTD. You have to validate it against the DTD that the documents author specified. This is excessively limiting.

For example, imagine youre running an e-commerce business that accepts orders for products using SOAP or XML-RPC. Each order comes to you over the Internet as an XML document. Before accepting that order, the first thing you want to do is check that its valid against a DTD youve defined to make sure that it contains all the necessary information. However, if DTDs are all you have to validate with, theres nothing to prevent a hacker from sending you a document whose DOCTYPE declaration points to a different DTD. Then your system may report that the document is valid according to the hacked DTD, even though it would be invalid when compared to the correct DTD. If your system accepts the invalid document, it could introduce corrupt data that crashes the system or lets the hacker order goods they havent paid for, all because the person authoring the document got to choose which DTD to validate against rather than the person validating the document.

Schemas are more flexible. The schema specification specifically allows for a variety of different means for associating documents with schemas. For example, one possibility is that both the name of the document to validate and the name of the schema to validate it against could be passed to the validator program on the command line, like this:

C:\>validator greeting.xml greeting.xsd

Parsers could also let you choose the schema by setting a SAX property or an environment variable. Many other approaches are possible. The schema specification does not mandate any one way of doing this. However, it does define one particular way to associate a document with a schema. As with DOCTYPE declarations and DTDs, this requires modifying the instance document to point to the schema. The difference is that with schemas, unlike with DTDs, this is not the only way to do it. Parser vendors are free to develop other mechanisms if they want to.

To attach a schema to a document, add an xsi:noNamespaceSchemaLocation attribute to the documents root element. (You can also add it to the first element in the document that the schema applies to, but most of the time adding it to the root element is simplest.) The xsi prefix is mapped to the http://www.w3.org/2001/XMLSchema-instance URI. As always, the prefix can change as long as the URI stays the same. Listing 20-2 demonstrates.

You can now run the document through any parser that supports schema validation. One such parser is Xerces Java from the XML Apache Project (http://xml.apache.org/xerces2-j/). It includes the a simple command line program named sax.Counter that can validate against schemas as well as DTDs. When you set the -v and -s flags, sax.Counter validates the documents against its schema as specified by the xsi:noNamespaceSchemaLocation attribute. Assuming sax.Counter finds no errors, it simply returns the amount of time that was required to parse the document, as in the following example:

Note

To install sax.Counter, copy the JAR archives bundled with the Xerces distribution into your jre/lib/ext directory. With the latest versions of the JDK, this may actually be named something like j2re1.4.2/lib/ext instead. On Windows with a default installation, youll find the appropriate directory in C:\Program Files\Java or perhaps C:\Program Files\Javasoft. (The exact names tend to change from one version of Java to the next.) You will need to have Java 1.2 or later installed. If necessary, you can download the latest version from http://java.sun.com/.

Now, suppose you have a document thats not valid, such as Listing 20-3. This document uses a P element that hasnt been declared in the schema.

The problem is that the GREETING element is declared to have type xsd:string, one of several "simple" types that cannot have any child elements. However, in this case, the GREETING element does contain a child element: the P element.

Complex Types

The W3C XML Schema Language divides elements into complex and simple types. A simple type element is one such as GREETING that can only contain text and does not have any attributes. It cannot contain any child elements. It may, however, be more limited in the kind of text it can contain. For example, a schema can say that a simple element contains an integer, a date, or a decimal value between 3.76 and 98.24. Complex type elements can have attributes and can have child elements.

Most documents need a mix of both complex and simple elements. For example, consider Listing 20-4. This document describes the song "Yes I Am" by Melissa Etheridge. The root element is SONG. This element has a number of child elements giving the title of the song, the composer, the producer, the publisher, the duration of the song, the year it was released, the price, and the artist who sang it. Except for SONG itself, these are all simple elements that can have type xsd:string. You might see documents like this used in CD databases, MP3 players, Gnutella clients, or anything else that needs to store information about songs.

The root element of this schema is once again xsd:schema, and once again the prefix xsd is mapped to the namespace URI http://www.w3.org/2001/XMLSchema. This will be the case for all schemas in this chapter, and indeed all schemas that you write.

This schema declares a single top-level element. That is, there is exactly one element declared in an xsd:element declaration that is an immediate child of the root xsd:schema element. This is the SONG element. Only top-level elements can be the root elements of documents described by this schema, though in general they do not have to be the root element.

The SONG element is declared to have type SongType. The W3C Schema Working Group wasnt prescient. They built a lot of common types into the language, but they didnt know that I was going to need a song type, and they didnt provide one. Indeed, they could not reasonably have been expected to predict and provide for the numerous types that schema designers around the world were ever going to need. Instead, they provided facilities to allow users to define their own types. SongType is one such user-defined type. In fact, you can tell its not a built-in type because it doesnt begin with the prefix xsd. All built-in types are in the http://www.w3.org/2001/XMLSchema namespace.

The xsd:complexType element defines a new type. The name attribute of this element names the type being defined. Here that name is SongType, which matches the type previously assigned to the SONG element. Forward references (for example, xsd:element using the SongType type before its been defined) are perfectly acceptable in schemas. Circular references are okay, too. Type A can depend on type B, which depends on type A. Schema processors sort all this out without any difficulty.

The contents of the xsd:complexType element specify what content a SongType element must contain. In this example, the schema says that every SongType element contains a sequence of eight child elements: TITLE, COMPOSER, PRODUCER, PUBLISHER, LENGTH, YEAR, PRICEARTIST, and PRICE. Each of these is declared to have the built-in type xsd:string. Each SongType element must contain exactly one of each of these in exactly that order. The only other content it may contain is insignificant white space between the tags.

minOccurs and maxOccurs

You can validate Listing 20-4, yesiam.xml, against the song schema, and it does indeed prove valid. Are you done? Is song.xsd now an adequate description of legal song documents? Suppose you instead wanted to validate Listing 20-6, a song document that describes Hot Cop by the Village People. Is it valid according to the schema in Listing 20-5?

The answer is no, it is not. The reason is that this song was a collaboration between three different composers and the existing schema only allows a single composer. Furthermore, the price is missing. If you looked at other songs, youd find similar problems with the other child elements. Under Pressure has two artists, David Bowie and Queen. We Are the World has dozens of artists. Many songs have multiple producers. A garage band without a publisher might record a song and post it on Gnutella in the hope of finding one.

The song schema needs to be adjusted to allow for varying numbers of particular elements. This is done by attaching minOccurs and maxOccurs attributes to each xsd:element element. These attributes specify the minimum and maximum number of instances of the element that may appear at that point in the document. The value of each attribute is an integer greater than or equal to zero. The maxOccurs attribute can also have the value unbounded to indicate that an unlimited number of the particular element may appear. Listing 20-7 demonstrates.

At least one, and possibly a great many, COMPOSERs (minOccurs="1" maxOccurs="unbounded")

Any number of PRODUCERs, although possibly no producer at all (minOccurs="0" maxOccurs="unbounded")

Either one PUBLISHER or no PUBLISHER at all (minOccurs="0" maxOccurs="1")

Exactly one LENGTH (minOccurs="1" maxOccurs="1")

Exactly one YEAR (minOccurs="1" maxOccurs="1")

At least one ARTIST, possibly more (minOccurs="1" maxOccurs="unbounded")

An optional PRICE, (minOccurs="0" maxOccurs="1")

This is much more flexible and easier to use than the limited ?, *, and + that are available in DTDs. It is very straightforward to say, for example, that you want between four and seven of a given element. Just set minOccurs to 4 and maxOccurs to 7.

If minOccurs and maxOccurs are not present, the default value of each is 1. Taking advantage of this, the song schema can be written a little more compactly, as shown in Listing 20-8.

Listing 20-8: Taking Advantage of the Default Values of minOccurs and maxOccurs

Element content

The examples so far have all been relatively flat. That is, a SONG element contained other elements; but those elements only contained character data, not child elements of their own. Suppose, however, that some child elements do contain other elements, as in Listing 20-9. Here the COMPOSER and PRODUCER elements each contain NAME elements.

Because the COMPOSER and PRODUCER elements now have complex content, you can no longer use one of the built-in types such as xsd:string to declare them. Instead, you have to define a new ComposerType and ProducerType using top-level xsd:complexType elements. Listing 20-10 demonstrates.

Sharing content models

You may have noticed that PRODUCER and COMPOSER are very similar. Each contains a single NAME child element and nothing else. In a DTD, youd take advantage of this shared content model via a parameter entity reference. In a schema, its much easier. Simply given them the same type. While you could declare that the PRODUCER has ComposerType or vice versa, its better to declare that both have a more generic PersonType. Listing 20-11 demonstrates.

Listing 20-11: Using a Single PersonType for Both COMPOSER and PRODUCER

However, the NAME element is only used inside PersonType elements. Perhaps it shouldnt be a top-level definition. For example, you might not want to allow NAME elements to be used as root elements, or to be children of things that arent PersonType elements. You can prevent this by defining a name with an anonymous type. To do this, instead of assigning the NAME element a type with a type attribute on the corresponding xsd:element element, you give it an xsd:complexType child element to define its type. Listing 20-12 demonstrates.

Defining the element types inside the xsd:element elements that are themselves children of xsd:complexType elements is a very powerful technique. Among other things, it enables you to give elements with the same name different types when used in different elements. For example, you can say that the NAME of a PERSON contains GIVEN and FAMILY child elements, while the NAME of a MOVIE contains an xsd:string, and the NAME of a VARIABLE contains a string containing only alphanumeric characters from the ASCII character set.

Mixed content

Schemas offer much greater control over mixed content than DTDs do. In particular, schemas let you enforce the order and number of elements appearing in mixed content. For example, suppose you wanted to allow extra text to be mixed in with the names to provide middle initials, titles, and the like as shown in Listing 20-13.

Caution

The format used here is purely for illustrative purposes. In practice, Id recommend that you make the middle names and titles separate elements as well.

Its very easy to declare that an element has mixed content in schemas. First, set up the xsd:complexType exactly as you would if the element only contained child elements. Then add a mixed attribute to it with the value true. Listing 20-14 demonstrates. It is almost identical to Listing 20-12 except for the addition of the mixed="true" attribute.

Grouping

So far, all the schemas youve seen have held that order mattered; for example, that it would be wrong to put the COMPOSER before the TITLE or the PRODUCER after the ARTIST. Given these schemas, the document shown in Listing 20-15 is clearly invalid. But should it be? Element order often does matter in narrative documents such as books and web pages. However, its not nearly as important in record-like documents such as the examples in this chapter. Do you really care whether the TITLE comes first or not, as long as there is a TITLE? After all, if the documents going to be shown to a human being, it will probably first be transformed with an XSLT style sheet that can easily place the contents in any order it likes.

Listing 20-15: A Song Document That Places the Elements in a Different Order

The W3C XML Schema Language provides three grouping constructs that specify whether and how ordering of individual elements is important:

The xsd:all group requires that each element in the group must occur at most once, but that order is not important.

The xsd:choice group specifies that any one element from the group should appear. It can also be used to say that between N and M elements from the group should appear in any order.

The xsd:sequence group requires that each element in the group appear exactly once, in the specified order.

Unfortunately, these constructs are not everything you might desire. In particular, you cant specify constraints such as those that would be required to really handle Listing 20-14. In particular, you cant specify that you want a SONG to have exactly one TITLE, one or more COMPOSERs, zero or more PRODUCERs, and one or more ARTISTs, but that you dont care in what order the individual elements occur.

The xsd:all Group

You can specify that you want each NAME element to have exactly one GIVEN child and one FAMILY child, but that you dont care what order they appear in. The xsd:all group accomplishes this, as in the following example:

Unfortunately, the W3C XML Schema Language restricts the use of minOccurs and maxOccurs inside xsd:all elements. In particular, each ones value must be 0 or 1. You cannot set it to 4 or 7 or unbounded. Therefore, the preceding type definition is invalid. Furthermore, xsd:all can only contain individual element declarations. It cannot contain xsd:choice or xsd:sequence elements. xsd:all offers somewhat more expressiveness than DTDs do, but probably not as much as you want.

Choices

The xsd:choice element is the schema equivalent of the | in DTDs. When xsd:element elements are combined inside an xsd:choice, exactly one of those elements must appear in instance documents. For example, the choice in this xsd:complexType requires either a PRODUCER or a COMPOSER, but not both.

The xsd:choice element itself can have minOccurs and maxOccurs attributes that establish exactly how many selections may be made from the choice. For example, setting minOccurs to 1 and maxOccurs to 6 would indicate that between one and six elements listed in the xsd:choice should appear. Each of these can be any of the elements in the xsd:choice. For example, you could have six different elements, three of the same element and three of another, or up to six of the same element. This next xsd:choice allows for any number of artists, composers, and producers. However, in order to require that there be at least one ARTIST element and at least one COMPOSER element, rather than allowing all spaces to be filled by PRODUCER elements, its necessary to place xsd:element declarations for these two outside the choice. This has the unfortunate side effect of locking in more order than is really needed.

Sequences

An xsd:sequence element requires each member of the sequence to appear in the same order in the instance document as in the xsd:sequence element. Ive used this frequently as the basic group for xsd:complexType elements in this chapter so far. The number of times each element is allowed to appear can be controlled by the xsd:elements minOccurs and maxOccurs attributes. You can add minOccurs and maxOccurs attributes to the xsd:sequence element to specify the number of times the sequence should repeat.

Simple Types

Until now Ive focused on writing schemas that validate the element structures in an XML document. However, theres also a lot of non-XML structure in the song documents. The YEAR element isnt just a string. Its an integer, and maybe not just any integer either, but a positive integer with four digits. The PRICE element is some sort of money. The LENGTH element is a duration of time. DTDs have absolutely nothing to say about such non-XML structures that are inside the parsed character data content of elements and attributes. Schemas, however, do let you make all sorts of statements about what forms the text inside elements may take and what it means. Schemas provide much more sophisticated semantics for documents than DTDs do.

Listing 20-16 is a new schema for song documents. Its based on Listing 20-8, but read closely and you should notice that a few things have changed.

Did you spot the changes? The values of the type attributes of the LENGTH and YEAR declarations are no longer xsd:string. Instead, LENGTH has the type xsd:duration and YEAR has the type xsd:gYear. These declarations say that its no longer okay for the YEAR and LENGTH elements to contain just any old string of text. Instead, they must contain strings in particular formats. In particular, the YEAR element must contain a year; and the LENGTH element must contain a recognizable length of time. When you check a document against this schema, the validator will check that these elements contain the proper data. Its not just looking at the elements. Its looking at the content inside the elements!

Lets actually validate hotcop.xml against this schema and see what we get:

Thats unexpected! The problem is that 6:20 is not in the proper format for time durations, at least not the format that the W3C XML Schema Language uses and that schema validators know how to check. Schema validators expect that time types are expressed in the format defined in ISO standard 8601, Representations of dates and times (http://www.iso.ch/iso/en/prods-services/popstds/datesandtime.html). This standard says that time durations should have the form PnYnMnDTnHnMdS, where n is an integer and d is a decimal number. P stands for "period." nY gives the number of years; the first nM gives the number of months; and nD gives the number of days. T separates the date from the time. Following the T, nH gives the number of hours; the second nM gives the number of minutes; and dS gives the number of seconds. If d has a fraction part, the duration can be specified to an arbitrary level of precision.

In this format, a duration of 6 minutes and 20 seconds should be written as P0Y0M0DT0H6M20S. If you prefer, the zero pieces can be left out, so you can write this more compactly as PT6M20S. Listing 20-17 shows the fixed version of hotcop.xml with the LENGTH in the right format.

Admittedly the ISO 8601 format for time durations is a little obtuse, if precise. You may well be asking whether theres a type that you can specify for the LENGTH that would make lengths such as 6:20 and 4:24 legal. In fact, theres no such type built in to the W3C XML Schema Language, but you can define one yourself. Youll learn how to do that soon, but first lets explore some of the other data types that are built in to the W3C XML Schema Language.

There are 44 built-in simple types in the W3C XML Schema Language. These can be unofficially divided into seven groups:

Numeric types

Time types

XML types

String types

The boolean type

The URI reference type

The binary types

Numeric data types

The most obvious data types, and the ones most familiar to programmers, are the numeric data types. Among computer scientists, theres quite a bit of disagreement about how numbers should be represented in computer systems. The W3C XML Schema Language tries to make everyone happy by providing almost every numeric type imaginable, including the following:

Integer and floating point numbers

Finite size numbers similar to those in Java and C and infinitely precise, unlimited-size numbers similar to those in Eiffel and Javas java.math package

Signed and unsigned numbers

Youll probably only use a subset of these. For example, you wouldnt use both the arbitrarily large xsd:integer type and the four-byte-limited xsd:int type. Table 20-1 summarizes the different numeric types.

Table 20-1

Schema Numeric Types

Name

Type

Examples

xsd:float

IEEE 754 32-bit floating point number, or as close as you can get using a base 10 representation; same as Java's float type

-INF, -1E4, -0, 0, 12.78E-2, 12, INF, NaN

xsd:double

IEEE 754 64-bit floating-point number, or as close as you can get using a base 10 representation; same as Java's double type

Time data types

The next set of simple types the W3C XML Schema Language provides are more familiar to database designers than to procedural programmers; these are the time types. These can represent times of day, dates, or durations of time. The formats, shown in Table 20-2, are all based on the ISO standard 8601, Representations of Dates and Time. Time zones are given as offsets from Coordinated Universal Time (Greenwich Mean Time to lay people) or as the letter Z to indicate Coordinated Universal Time.

Table 20-2

XML Schema Time Types

Name

Type

Examples

xsd:dateTime

A particular moment in Coordinated Universal Time, up to an arbitrarily small fraction of a second

A length of time, without fixed endpoints, to an arbitrary fraction of a second

P2000Y10M31DT09H32M7.4312S

Notice, in particular, that in all the date formats the year comes first, followed by the month, the day, the hour, and so on. The largest unit of time is on the left, and the smallest unit is on the right. This helps avoid questions such as whether 20040211 is February 11, 2004, or November 2, 2004.

XML data types

The next batch of schema data types should be quite familiar. These are the types related to XML constructs themselves. Most of these types match attribute types in DTDs such as NMTOKENS or IDREF. The difference is that with schemas these types can be applied to both elements and attributes. These also include four new types related to other XML constructs: xsd:language, xsd:Name, xsd:QName, and xsd:NCName. Table 20-3 summarizes the different types.

For more details on the permissible values for elements and attributes declared to have these types, see Chapters 9 and 11.

String data types

Youve already encountered the xsd:string type. Its the most generic simple type. It requires a sequence of Unicode characters of any length, but this is what all XML element content and attribute values are. There are also two very closely related types: xsd:token and xsd:normalizedString. These are the same as xsd:string, except that a schema aware processor may eliminate some white space from the value before reporting it to the client application. Table 20-4 summarizes the string data types.

Table 20-4

XML Schema String Types

Name

Type

Examples

xsd:string

A sequence of zero or more Unicode characters that are allowed in an XML document; essentially the only forbidden characters are most of the C0 controls, surrogates, and the byte-order mark

p1

, p2, 123 45 6789, ^*&^*&_92, red green blue, NT-Decl, seventeen; Mary had a little lamb, The love of money is the root of all Evil., Would you paint the lily?

Would you gild gold?

xsd:normalizedString

A string in which all tabs, carriage returns, and linefeeds are replaced by spaces

PIC1

, PIC2, PIC3, cow_movie, MonaLisa, Hello World , Warhol, red green

xsd:token

A string in which all tabs, carriage returns, and linefeeds are replaced by spaces, consecutive spaces are compressed to a single space, and leading and trailing white space is trimmed

Its important to note that none of these three types impose any limits on what values may appear in the instance document. Elements with type xsd:strring, xsd:normalizedString, and xsd:token can all contain tabs, linefeeds, consecutive spaces, and so on. The difference is that for xsd:normalizedString and xsd:token the parser may throw away some of this white space, while it wont for an xsd:string,.

Binary types

Its impossible to include arbitrary binary files in XML documents, because they might contain illegal characters such as a form feed or a null that would make the XML document malformed. Therefore, any such data must first be encoded in legal characters. The W3C XML Schema Language supports two such encodings, xsd:base64Binary and xsd:hexBinary.

Hexadecimal binary encodes each byte of the input as two hexadecimal digits00, 01, 02, 03, 04, 05, 06, 07, 08, 09, 0A, 0B, 0C, 0D, 0E, 0F, 10, 11, 12, and so on. Thus, an entire file can be encoded using only the digits 0 through 9 and the letters A through F. (Lowercase letters are also allowed, but uppercase letters are customary.) On the other hand, each byte is replaced by at least two bytes, so this encoding at least doubles the size of the data. UTF-16 uses two bytes for each character so it quadruples the size of the data. Clearly, this is not a very efficient encoding. Hexadecimal binary encoded data tends to look like this:

A4E345EC54CC8D52198000FFEA6C807F41F332127323432147A89979EEF3

Base 64 encoding uses a more complex algorithm and a larger character set, 65 ASCII characters chosen for their ability to pass through almost all gateways, mail relays, and terminal servers intact, as well as their existence with the same code points in ASCII, EBCDIC, and most other common character sets. Base 64 encodes every three bytes as four characters, typically only increasing file size by a third in a character set such as UTF-8, so its somewhat more efficient than xsd:hexBinary. Base-64-encoded data tends to look something like this:

XML Digital Signatures use Base 64 encoding to encode the binary signatures before wrapping them in an XML element.

Caution

I really discourage you from using either of these if at all possible. If you have binary data, its much more efficient and much less obtuse to link to it using XLink or unparsed entities rather than encoding it in Base 64 or hexadecimal binary.

Miscellaneous data types

There are two types left over that dont fit neatly into the previous categories: xsd:boolean and xsd:anyURI. The xsd:boolean type represents something similar to C++s bool data type. It has four legal values: 0, 1, true, and false. 0 is considered to be the same as false, and 1 is considered the same as true.

The final schema simple type is xsd:anyURI. An element of this type contains a relative or absolute URI, possibly a URL, such as urn:isbn:0764547607, http://www.w3.org/TR/2000/WD-xmlschema-2-20000407/#timeDuration, /javafaq/reports/JCE1.2.1.htm, /TR/2000/WD-xmlschema-2-20000407/, or ../index.html.

Deriving Simple Types

Youre not limited to the 44 simple types that the W3C XML Schema Language defines. As in object-oriented programming languages, you can create new data types by deriving from the existing types. The most common such derivation is to restrict a type to a subset of its normal values. For example, you can define an integer type that only holds numbers between 1 and 20 by deriving from xsd:positiveInteger. You can create enumerated types that only allow a finite list of fixed values. You can create new types that join together the ranges of existing types through a union. For example, you can derive a type that can hold either an xsd:date or an xsd:int.

New simple types are created by xsd:simpleType elements, just as new complex types are created by xsd:complexType elements. The name attribute of xsd:simpleType assigns a name to the new type by which it can be referred to in xsd:elementtype attributes. The allowed content of elements and attributes with the new type can be specified by one of three child elements:

xsd:restriction to select a subset of the values allowed by the base type

xsd:union to combine multiple types

xsd:list to specify a list of elements of an existing simple type

Deriving by restriction

To create a new type by restricting from an existing type, give the xsd:simpleType element an xsd:restriction child element. The base attribute of this element specifies what type youre restricting. For example, this xsd:simpleType element creates a new type named phonoYear thats derived from xsd:gYear:

With this declaration, any legal xsd:gYear is also a legal phonoYear, and any illegal year is also an illegal phonoYear. You can limit phonoYear to a subset of the normal year values by using facets to specify which values are and are not allowed. For example, the minInclusive facet defines the minimum legal value for a type. This facet is added to a restriction as an xsd:minInclusive child element. The value attribute of the xsd:minInclusive element sets the minimum allowed value for the year:

Here the value of xsd:minInclusive is set to 1877, the year Thomas Edison invented the phonograph. Thus, 1877 is a legal phonoYear, 1878 is a legal phonoYear, 2001 is a legal phonoYear, and 3005 is a legal phonoYear. However, 1876, 1875, 1874, and earlier years are not legal phonoYears, even though they are legal xsd:gYears.

After the phonoYear type has been defined, you can use it just like one of the built-in types. For example, in the SONG schema, youd declare that the year element has the type phonoYear, like this:

<xsd:element type="phonoYear"/>

minInclusive

is not the only facet you can apply to xsd:gYear. Other facets of xsd:gYear are as follows:

xsd:minExclusiveThe minimum value that all instances must be strictly greater than

xsd:maxInclusiveThe maximum value that all instances must be less than or equal to

xsd:maxExclusiveThe maximum value that all instances must be strictly less than

xsd:enumerationA list of all legal values

xsd:whiteSpaceHow white space is treated within the element

xsd:patternA regular expression to which the instance is compared

Each facet is represented as an empty element inside an xsd:restriction element. Each facet has a value attribute giving the value of that facet. One restriction can contain more than one facet. For example, this xsd:simpleType element defines a phonoYear as any year between 1877 and 2100, inclusive:

Its possible that multiple facets may conflict. For example, the minInclusive value could be 2100 and the maxInclusive value could be 1877. While this is probably a design mistake, it is syntactically legal. It would just mean that the set of phonoYears was the empty set, and phonoYear type elements could not actually be used in instance documents.

Facets

Facets are shared among many types. For example, the minInclusive facet can constrain essentially any well-ordered type, including not only xsd:gYear, but also xsd:byte, xsd:unsignedByte, xsd:integer, xsd:positiveInteger, xsd:negativeInteger, xsd:nonNegativeInteger, xsd:nonPositiveInteger, xsd:int, xsd:unsignedInt, xsd:long, xsd:unsignedLong, xsd:short, xsd:unsignedShort, xsd:decimal, xsd:float, xsd:double, xsd:time, xsd:dateTime, xsd:duration, xsd:date, xsd:gMonth, xsd:gYearMonth, and xsd:gMonthDay. The complete list of constraining facets that can be applied to different types is as follows:

xsd:minInclusiveThe value that all instances must be greater than or equal to

xsd:minExclusiveThe value that all instances must be strictly greater than

xsd:maxInclusiveThe value that all instances must be less than or equal to

xsd:maxExclusiveThe value that all instances must be strictly less than

xsd:enumerationA list of all legal values

xsd:whiteSpaceHow white space is treated within the element

xsd:patternA regular expression to which the instance is compared

xsd:lengthThe exact number of characters a string, items in a list, or bytes in binary data

xsd:minLengthThe minimum length

xsd:maxLengthThe maximum length

xsd:totalDigitsThe maximum number of digits allowed in the element

xsd:fractionDigitsThe maximum number of digits allowed in the fractional part of the element

Not all facets apply to all types. For example, it doesnt make much sense to talk about the minimum value of an xsd:NMTOKEN or the number of fraction digits in an xsd:gYear. However, when the same facet is shared by different types, it has the same syntax and basic meaning for all the types.

Facets for strings: length, minLength, maxLength

The three length facetsxsd:length, xsd:minLength, and xsd:maxLengthspecify the number of units allowed in a value. For xsd:string and its subtypesxsd:normalizedString, xsd:token, xsd:hexBinary, xsd:base64Binary, xsd:QName, xsd:NCName, xsd:ID, xsd:IDREF, xsd:IDREFS, xsd:language, xsd:anyURI, xsd:ENTITY, xsd:NOTATION, xsd:NOTATIONS, xsd:NMTOKEN, and xsd:NMTOKENSthe units are characters and these facets specify the number of characters allowed in the element or attribute value. For list typesxsd:ENTITIES, xsd:NOTATIONS, and xsd:NMTOKENSthese facets control the number of instances in the list. And finally for the two binary typesxsd:base64Binary and xsd:hexBinarythese control the number of bytes in the decoded value. The value attribute of each of these facets must contain a nonnegative integer. xsd:length sets the exact number of units in the value, whereas xsd:minLength sets the minimum length and xsd:maxLength sets the maximum length.

For example, the schema in Listing 20-18 uses the xsd:minLength and xsd:maxLength facets to derive a new Str255 data type from xsd:string. Whereas xsd:string allows strings of any length from zero on up, Str255 requires each string to have a minimum length of 1 and a maximum length of 255. The schema then assigns this data type to all the names and titles to indicate that each must contain between 1 and 255 characters.

Listing 20-18: A Schema That Derives a Str255 Data Type from xsd:string

The whiteSpace facet

The whiteSpace facet is unusual. Unlike the other 11 facets, xsd:whiteSpace does not in any way constrain the allowed content of elements. Instead, it suggests what the application should do with any white space that it finds in the instance document. It says how significant that white space is. However, it does not say that any particular kind of white space is legal or illegal.

The xsd:whiteSpace facet has three possible values:

preserveThe white space in the input document is unchanged.

replaceEach tab, carriage return, and linefeed is replaced with a single space.

collapseEach tab, carriage return, and linefeed is replaced with a single space. Furthermore, after this replacement is performed, all runs of multiple spaces are condensed to a single space. Leading and trailing white space is deleted.

Again, these are all just hints to the application. None of them have any effect on validation.

The whiteSpace facet can only be applied to xsd:string, xsd:normalizedString, and xsd:token types. Furthermore, it only fully applies to elements. XML 1.0 requires that parsers replace all white space in attributes, and collapse white space in attributes whose DTD type is anything other than CDATA, regardless of what the schema says.

The schema in Listing 20-19 uses the xsd:whiteSpace facets to derive a new CollapsedString data type from xsd:string. Then it assigns this data type to all the names and titles to indicate that white space should be collapsed in these elements.

Facets for decimal numbers: totalDigits and fractionDigits

When you are formatting numbers, its useful to be able to specify how many digits should be used in the entire number, the integer parts, and the fraction parts. Schemas dont go as far in this regard as the printf() function in C or the java.text.DecimalFormat class in Java, but they do offer you some control.

The xsd:totalDigits facet specifies the maximum number of decimal digits in a number. It applies to most numeric types including xsd:byte, xsd:unsignedByte, xsd:integer, xsd:positiveInteger, xsd:negativeInteger, xsd:nonNegativeInteger, xsd:nonPositiveInteger, xsd:int, xsd:unsignedInt, xsd:long, xsd:unsignedLong, xsd:short, xsd:unsignedShort, and xsd:decimal. The only exceptions are the IEEE 754 types that occupy a fixed number of bytes; that is, xsd:float and xsd:double. The value of this facet must be a positive integer.

The xsd:fractionDigits facet specifies the maximum number of decimal digits to the right of the decimal point. (There is no facet that allows you to specify the minimum number of digits or fraction digits.) This only really applies to xsd:decimal. Technically, it applies to all the integer types too, but for those types its fixed to the value zero; that is, no fraction digits at all. Youre only allowed to change it for xsd:decimal. The value of this facet must be a nonnegative integer.

The enumeration facet

Rather than setting some sort of range on legal values, the xsd:enumeration facet simply lists all allowed values. It applies to every simple type except xsd:boolean. The syntax is a little unusual. Each possible value gets its own xsd:enumeration element as a child of the xsd:restriction element.

Listing 20-20 uses an enumeration to derive a PublisherType from xsd:string. It requires that the publisher be one of the oligopoly that controls 90 percent of all U.S. music. (Warner-Elektra-Atlantic, Universal Music Group, Sony Music Entertainment, Inc., Capitol Records, Inc., and BMG Music).

Listing 20-20: A Schema That Uses an Enumeration to Derive a Type from xsd:string

is far from the only type you can derive from via enumeration. You can derive from xsd:int, xsd:NMTOKEN, xsd:date, and, indeed, from all simple types except xsd:boolean. Of course, the enumerated values all have to be legal instances of the base type.

The pattern facet

Theres one element in the song examples that clearly deserves a data type, but so far doesnt have onePRICE. However none of the built-in data types really match the format for prices. Recall that PRICE elements look like this:

<PRICE>$1.25</PRICE>

This isnt an integer of any kind, because it has a decimal point. It could be a floating-point number, but that wouldnt account for the currency sign. You could drop off the currency sign, like this:

<PRICE>1.25</PRICE>

However, then youd have to assume you were working in dollars. What if you wanted to sell songs priced in pounds or yen or euros? Perhaps you could make the currency sign part of a separate element, like this:

<PRICE>
<CURRENCY>$</CURRENCY>
<AMOUNT>1.25</AMOUNT>
</PRICE>

AMOUNT

could be an xsd:float, and CURRENCY could be an xsd:string. However, this still isnt perfect. You want to limit the CURRENCY to exactly one character, and that character must be a currency sign. You dont want to allow it to contain any arbitrary string. Furthermore, youd like to limit the precision of the AMOUNT to exactly two decimal places. You probably dont want to sell songs that cost $1.1 or $1.99999.

The solution to this problem, and to many similar problems where the values you want to allow dont quite fit any of the existing types, is to use the xsd:pattern facet whose value attribute contains a regular expression that matches all legal values and doesnt match any illegal values.

The regular expressions used in schemas are similar to the regular expressions you might be familiar with from Perl, grep, or other languages. You use statements like [A-Z]+ to mean "a string containing one more of the capital letters from A to Z" or (club)* to mean "a string composed of zero or more repetitions of the word club."

Table 20-5 summarizes the grammar of XML schema regular expressions. In this table A and B represent some string or another regular expression particle from elsewhere in the table; that is, they will be replaced by something else when actually used in a regular expression. n and m represent some integer that will be replaced by a specific number.

Table 20-5

Regular Expression Symbols for XML Schema

Symbol

Meaning

A?

Zero or one occurrences of A

A*

Zero or more occurrences of A

A+

One or more occurrences of A

A{n,m}

Between n and m occurrences of A

A{n}

Exactly n occurrences of A

A{n,}

At least n occurrences of A

A|B

Either A or B

AB

A

followed by B

.

Any one character

\p{A}

One character from Unicode character class A

[abcdefg]

A single occurrence of any of the characters contained in the brackets

[^abcdefg]

A single occurrence of any of the characters not contained in the brackets

[a-z]

A single occurrence of any character from a to z inclusive

[^a-z]

A single occurrence of any of character except those from a to z inclusive

\n

Linefeed

\r

Carriage return

\t

Tab

\\

Backward slash \

\|

Vertical bar |

\.

Period .

\-

Hyphen -

\^

Caret ^

\?

Question mark ?

\*

Asterisk *

\+

Plus sign +

\{

Open brace {

\}

Closing brace }

\(

Open parenthesis (

\)

Closing parenthesis )

\[

Open bracket [

\]

Closing bracket ]

For the most part, these symbols have exactly the same meanings that they have in Perl. The schema regular expression syntax is somewhat weaker than Perls, but then whose isnt? In any case, this should be sufficient power to meet any reasonable needs that schemas have.

Schema regular expressions do have one important feature that isnt available prior to Perl 5.6 and is unfamiliar to most developersyou can use \p{} to stand in for a character in a particular Unicode character class. For example, N is the Unicode character class for numbers. This doesnt just include the European digits 0 through 9, but also the Arabic-Indic digits, the Devanagari digits, the Thai digits, and many more besides. Therefore, \p{N} represents any digit defined anywhere in Unicode. \p{N}+ represents a string consisting of one or more Unicode digits. Table 20-6 lists the various Unicode character classes you can take advantage of in regular expressions. For the money regular expression, you need the Sc class for currency indicators and the Nd class for decimal digits. This is a little more restrictive than the N class, which includes nondecimal digits, such as the Roman numerals and the Han ideograph representing 100,000,000.

Table 20-6

Unicode Character Classes

Abbreviation

Includes

Examples

Letters

L

All letters

a, b, c, A, B, C, ü, Ü, ç, Ç, ζ, θ, Ζ, Θ, а, б, в, А, Б, В,

א, ב, ג, dz, Dz, DZ

Lu

Uppercase letters

A, B, C, Ü, Ç, Ζ, Θ, А, Б, В, DZ

Ll

Lowercase letters

a, b, c, ü, ç, ζ, θ, а, б, в, dz

Lt

Title case letters

Dz

Lm

Modifier letters; letters that are attached to the previous characters somehow

h

, j, r, w

Lo

Other letters; typically ones from languages that dont distinguish upper- and lowercase

א

, ב, ג, Japanese Katakana and Hiragana, most Han ideographs

Marks

M

All marks

Mn

Nonspacing marks; mostly accent marks that are attached to the previous character on the top or bottom, and thus do not change the amount of space the character occupies

`, ', ¨, ¯

Mc

Spacing combining marks; accent marks that are attached to the previous character on the left or right, and thus do change the amount of space the character occupies

T

, Gurmukhi vowel sign AA

Me

Enclosing marks that completely surround a character

The Cyrillic hundred-thousands and millions signs

Numbers

N

All numbers

0, 1, 2, 3, _, _, _, _,

٠, ٩, I, II, III, IV, V, 〡, 〢, 〣, 〤

Nd

Decimal digits; characters that represent one of the numbers 0 through 9

The left-to-right and right-to-left marks used to indicate change of direction in bidirectional text

Co

Private use characters; code points that may be used for a program's internal purposes

Cn

Unassigned; code points that, while legal in XML, the Unicode specification has not yet assigned a character to

Youre now ready to put together a regular expression that describes money strings such as $1.25. What you want to say is that each such string contains the following:

A currency symbol

One or more decimal digits

An optional fractional part, which, if present at all, consists of a decimal point and two decimal digits

Heres the regular expression that says that

\p{Sc}\p{Nd}+(\.\p{Nd}\p{Nd})?

It begins with \p{Sc} to indicate a currency symbol such as $, @@yen, @@bp, or €.

This is followed by \p{Nd}+. \p{Nd} represents any decimal digit character. The + indicates one or more of these characters.

Next theres a parenthesized expression followed by a question mark, (\.\p{Nd}\p{Nd})?. The question mark indicates the parenthesized expression is optional. However, if it does appear, its entire contents must be present, not just part. In other words, the question mark stands for zero or one, just as it does in DTDs. The contents of the parentheses are \.\p{Nd}\p{Nd}, which represents a period followed by two decimal digits, for example .35. Normally a period in a regular expression means any character at all, so here its escaped with a preceding backslash to indicate that we really do want the actual period character.

Now that you have a regular expression that represents money, youre ready to define a money type. As for the other facets, this is done with the xsd:simpleType and xsd:restriction elements. Putting these together with the regular expression produces this type definition:

Listing 20-21 provides the complete song schema, including this type definition. Take special note of the XML comment used to elucidate the regular expression. Regular expressions can be quite opaque, and a comment like this one can go a long way toward making the schema more comprehensible.

Unions

Restriction is not the only way to create a new simple type, although it is the most common way. You can also combine types using unions. For example, you could combine the built-in xsd:decimal type with the money type just defined to create a type that could contain either a decimal or a money value. To do this, give the xsd:simpleType element an xsd:union child element instead of an xsd:restriction child element. The xsd:union element contains more xsd:simpleType elements identifying the types youre combining in the union. For example, this is the previously described money/xsd:decimal combined type:

This requires that elements with type YearList contain a white space-separated list of legal xsd:gYear values.

Caution

I must admit that Im not very fond of list types, especially for elements. It seems to me that if youre going to have a list of different items, each of those items should be a separate element, possibly a child element of some parent element, but still its own element. Lists make a little more sense for attributes, but if theres a lot of substructure in the text, you should probably be using an element instead of an attribute anyway.

You can derive another list type from an existing list type. When so doing, you can restrict it according to the length, minLength, maxLength, and enumeration facets. In this case, the values of the three length facets refer to the number of items in the list rather than the number of characters in the content. For example, this xsd:simpleType element derives a DoubleYear list type that must hold exactly two years from the YearList type previously defined:

Empty Elements

Empty elements are those that cannot contain any child elements or parsed character data. This is the same as using the EMPTY content model in a DTD. As an example of this technique, Ill define an empty PHOTO element. This will be used in the next section when attributes are introduced.

To create an empty element, you define it as a type but dont give it an xsd:sequence, xsd:all, or xsd:choice child. Thus, you dont actually provide any child elements. For example:

This does not require the PHOTO element to be defined with an empty-element tag such as <PHOTO/>. The start-tag-end-tag pair <PHOTO></PHOTO> is also acceptable. In fact, the XML 1.0 specification says these two forms are equivalent. Schemas change nothing about XML 1.0. An XML 1.0 parser that knows nothing about schemas will have no trouble reading a document that uses schemas.

Attributes

In the examples so far, two XML constructs have been conspicuous by their absence: entities and attributes. The omission of entities was quite deliberate. Schemas cannot declare entities. If you need entities, you must use a DTD. (Of course, you can use a schema as well as the DTD.) However, schemas are fully capable of declaring attributes. Indeed, they do a much better job of it than DTDs do because schemas can use the full set of data types like xsd:float and xsd:anyURI.

Note

You may not have noticed my avoidance of attributes, because the examples all used xmlns:xsi and xsi:noNamespaceSchemaLocation attributes on the root element. However, as far as a schema validator is concerned, attributes used to declare namespaces, or to attach documents to schemas, "don't count." You do not have to, and indeed should not, declare these attributes. However, you do have to declare all the other attributes you use.

As a concrete example, lets consider how you might add an empty PHOTO element to the SONG documents. This element would be similar to the IMG element in HTML and would have an SRC attribute that contained a URL pointing to the photos location, an ALT attribute containing some text in the event that the PHOTO cant be displayed, and WIDTH and HEIGHT attributes that together give the size of the image in pixels. Listing 20-22 demonstrates.

Listing 20-22: The PHOTO Element Has Several Attributes of Different Types

Even though the PHOTO element is empty, because it has attributes, it has a complex type. You define a PhotoType just as you previously defined a PersonType and a SongType. However, where those types used xsd:element to declare child elements, this type will use xsd:attribute to declare attributes.

Because the SRC attribute should contain a URL, its been given the type xsd:anyURI. Because the HEIGHT and WIDTH attributes should each be an integer greater than zero, theyre given the type xsd:positiveInteger. Finally, because the ALT attribute can contain essentially any string of text of any length, its set to the most general type, xsd:string.

In this particular example, all the elements either have child elements or attributes, not both. However, thats certainly not required. In general, elements can have both child elements and attributes. Just use both xsd:element and xsd:attribute in the same xsd:complexType element. The xsd:attribute elements must come after the xsd:sequence, xsd:choice, or xsd:all group that forms the body of the element. For example, this xsd:element says that a PERSON element can have an optional attribute named ID with type ID:

Attributes can also be attached to elements that can only contain text such as an xsd:string or an xsd:gYear. The details are a little more complex, because an element with attributes by definition has a complex type. To make this work, you derive a new complex type from a simple type by giving the xsd:complexType element an xsd:simpleContent child element instead of an xsd:sequence, xsd:choice, or xsd:all. The xsd:simpleContent element itself has an xsd:extension child element whose base attribute identifies the simple type to extend such as xsd:string. The xsd:attribute elements are placed inside the xsd:extension element.

For example, suppose you want to allow the TITLE elements to have ID attributes, like this:

<TITLE ID="test">Yes I Am</TITLE>

Previously, TITLE was defined with type xsd:string. Instead, lets derive a new type called StringWithID from xsd:string, like this:

The StringWithID type can then be applied to the TITLE element in the usual way, like this:

<xsd:element name="TITLE" type="StringWithID"/>

By default, attributes declared in schemas are optional (#IMPLIED in DTD terminology). However, an xsd:attribute can have a use attribute with the value required to indicate that the element must occur. In this case, you probably do want to insist that each of the four attributes be present. Therefore, the declaration of PhotoType becomes this:

The use attribute can also have the value optional to indicate that it may or may not be present. (This is also the default if there is no use attribute.) If optional, xsd:attribute may also have a default attribute giving the value the parser will provide if it doesnt find one in the instance document. If there is no default attribute, this is the same as #IMPLIED in ATTLIST declarations in DTDs. Instead of a use attribute, xsd:attribute can have a fixed attribute whose value is the constant value for the attribute, whether present in the instance document or not. This has the same effect as #FIXED in DTDs. Listing 20-23 puts this all together in a complete schema for songs, including a PHOTO element with several required attributes.

Namespaces

So far, the example song documents have been blissfully namespace-free. Adding namespaces to the documents and designing a schema that applies to the namespace-qualified documents is not particularly difficult. Namespaces add some important features, such as the ability to write schemas and validate documents that use elements and attributes from multiple XML applications. However, the terminology is a little confusing. Some words, such as qualified, dont mean quite the same thing in schemas as they do in other XML technologies, so you do need to pay close attention and read what follows carefully.

Schemas for default namespaces

Lets begin with a simple example in which the XML application described by the schema uses a single default, nonprefixed namespace. Most of the time each namespace URI maps to exactly one schema (though later youll learn several techniques to break large schemas into parts using xsd:import and xsd:include).

The schema for elements that are not in any namespace is identified by an xsi:noNamespaceSchemaLocation attribute. The schemas for elements that are in namespaces are identified by an xsi:schemaLocation attribute. This attribute contains a list of namespace URI/schema URI pairs. Each namespace URI is followed by one schema URI. The namespace URI is almost always absolute, but the schema URI is almost always a URL and often a relative URL.

Listing 20-24 demonstrates. This is the familiar hotcop.xml document that youve seen several times already, though its been simplified a bit to keep the examples smaller. All the elements in this document are in the http://ns.cafeconleche.org/song namespace defined by the xmlns attribute on the root element. The attributes in this document are not in any namespace because they dont have prefixes. There are two things you need to remember here:

Attributes without prefixes are never in any namespace, no matter what namespace their parent element is in, and no matter what default namespace the document uses.

For purposes of schema validation, namespace declaration attributes, such as xmlns and xmlns:xsi, and schema attachment attributes, such as xsi:schemaLocation, dont count. You do not need to declare these in your schema.

In this case, all the elements are in the http://ns.cafeconleche.org/song namespace, so an xsi:schemaLocation attribute is needed to associate this namespace with a URL where the schema can be found, namespace_song.xsd for this example.

Listing 20-24: A SONG Document in the http://ns.cafeconleche.org/song Namespace

The first xmlns attribute establishes the default namespace for this schema, which is, after all, an XML document itself. It sets the namespace to http://ns.cafeconleche.org/song, the same as in the instance documents youre trying to model. This says that the unprefixed element names used in this schema such as PhotoType are in the http://ns.cafeconleche.org/song namespace.

The second attribute says that this schema applies to documents in the http://ns.cafeconleche.org/song namespace; that is, the elements identified by name attributes such as SONG, PHOTO, and TITLE are in the http://ns.cafeconleche.org/song namespace.

The third attribute, elementFormDefault, has the value qualified. This means that the elements being described in this document are in fact in a namespace; specifically, theyre in the target namespace given by the targetNamespace attribute. This does not mean that the elements being modeled necessarily have prefixes, merely that they are in some namespace.

Finally, the fourth attribute, attributeFormDefault, has the value unqualified. This means that the attributes described by this schema are not in a namespace.

Schemas have one major advantage over DTDs when you are working with documents with namespaces. They validate against the local name and the namespace URIs of the elements and attributes, not the prefix and the local name like DTDs do. This means the prefixes do not have to match in the schema and in the instance documents. Indeed, one might use prefixes and the other might use the default namespace.

For example, consider Listing 20-26. This is the same as Listing 20-24 except that it uses the song prefix rather than the default namespace to indicate the http://ns.cafeconleche.org/song namespace. However, it can use the exact same schema! The schema does not need to change just because the prefix (or lack thereof) has changed. As long as the namespace URI stays the same, the schema is happy.

Listing 20-26: A SONG Document in the http://ns.cafeconleche.org/song Namespace with Prefixes

Multiple namespaces, multiple schemas

Now, consider the case in which one document mixes markup from different vocabularies. In particular, suppose that you want to use XLink to connect the PHOTO element to the actual JPEG image rather than application-specific markup such as SRC. You need to set xlink:type, xlink:href, xlink:show, and xlink:actuate attributes on the PHOTO element to give it the proper meaning and behavior, like this:

Now the document uses two main namespaces, the http://ns.cafeconleche.org/song namespace for songs and the http://www.w3.org/1999/xlink namespace for XLinks. Thus, it needs two schemas. However, because the root element can have only one xsi:schemaLocation attribute, it has to serve double duty and declare both. Listing 20-27 demonstrates.

Listing 20-28 shows the XLink schema. It only declares attributes, no elements at all. You havent seen an example of this yet, but its not hard. Just use xsd:attribute elements at the top level, that is, as direct children of the xsd:schema element. The other difference between these top-level xsd:attribute elements and the ones youve seen before is that three of the attributes have fixed values and dont even need to be explicitly included in the instance document. Only the xlink:href attribute asks the author to supply a value. However, this is rather specific to this particular use of XLink. Almost anything else youd do with an XLink other than embedding an image or other non-XML content into the document would require a different schema that used different defaults.

This schema doesnt actually apply these attributes to any elements. Therefore, the schema that does describe the PHOTO element needs to import xlink.xsd in order to reference these declarations. This is done with an xsd:import element. The xsd:import's schemaLocation attribute tells the processor where to find the schema to import. The namespace attribute says which elements and attributes the schema declares. After this schema has been imported, you can add those attributes to any xsd:complexType by giving it an xsd:attribute child whose ref attribute identifies the attribute to be attached. Listing 20-29 demonstrates.

Annotations

At some point in this chapter, its likely to have occurred to you that schemas can get rather large and complex. If that hasnt occurred to you yet, just imagine a schema not for the very small and simple song documents demonstrated in this chapter, but for much larger XML applications such as Scalable Vector Graphics or XHTML.

You can certainly use regular XML comments to describe schemas, and I encourage you to do so, especially when youre doing something less than obvious in the schema. The W3C XML Schema Language also provides a more formal mechanism for annotating schemas. Both the top-level xsd:schema element itself and the various other schema elements (xsd:complexType, xsd:all, xsd:element, xsd:attribute, and so on) can contain xsd:annotation child elements that describe that part of the schema for human readers or for other computer programs. This element has two kinds of child elements:

The xsd:documentation child element describes the schema for human readers. It often contains copyright and similar information.

The xsd:appInfo child element describes the schema for computer programs. For example, it might contain instructions about what style sheets to apply to the schema.

Each xsd:annotation element can contain any number of either of these. However, no special syntax has been defined for the content of these elements. You can put anything in there you find convenient, including other XML markup, subject only to the usual well-formedness constraints. Thus, an xsd:documentation element might contain XHTML, and an xsd:appInfo element might contain XSLT. Then again, either or both might simply contain plain, unmarked-up text. For example, this annotation could be added to the song schemas developed in this chapter:

Summary

Schemas address a number of perceived limitations of DTDs, including a strange, non-XML syntax, namespace incompatibility, lack of data typing, and limited extensibility and scalability.

There are multiple XML schema languages, including RELAX NG and the W3C XML Schema Language (described in this chapter).

An XML document can indicate the schema that applies to its non-namespace-qualified elements via an xsi:noNamespaceSchemaLocation attribute, which is normally placed on the root element.

An XML document can indicate the schema that applies to its namespace qualified elements via an xsi:schemaLocation attribute, which is normally placed on the root element.

Schemas declare elements with xsd:element elements.

The type attribute of xsd:element specifies the data type of that element.

Elements with complex types can have attributes and child elements.

Elements with simple types only contain character data.

The xsd:complexType element defines a new type for an element that can contain child elements, attributes, and/or mixed content.

The xsd:group, xsd:all, xsd:choice, and xsd:sequence elements let you specify particular combinations of elements in an elements content model.

The minOccurs and maxOccurs attributes of xsd:element determine how many of a given element are allowed in the instance document at that point. The default for each is 1. maxOccurs can be set to unbounded to indicate that any number of the element may appear.

There are 44 built-in simple types, including many numeric, string, time, binary, URI, and XML types.

The xsd:simpleType element defines a new type for an element or attribute that can only contain character data.

You can define your own simple types by restricting an existing type such as xsd:string with the xsd:restriction element. The base attribute of the xsd:restriction child specifies what type youre deriving from.

An xsd:simpleType element can create a new type by unifying the value spaces of existing types. Each existing type combined into the new type is identified by an xsd:union child element.

A list type can hold one or more white-space-separated instances of an existing type. Such a type is defined by the xsd:list child of an xsd:simpleType element.

Schemas declare attributes with xsd:attribute elements.

The xsd:import element imports declarations for elements and attributes in a different namespace from another schema document.

The xsd:include element imports declarations for elements and attributes in the same namespace from another schema document.

Adding xsd:annotation elements helps make your schemas more readable.

The xsd:documentation child of an xsd:annotation element provides information for human readers.

The xsd:appInfo child of an xsd:annotation element provides information for software programs reading the schema, though schema validators ignore it.

This completes your training in core XML technologies. The next part begins several case studies of different XML applications in different vertical domains. First out the gate is the Extensible Hypertext Markup Language (XHTML). XHTML 1.0 is an XMLized form of HTML. XHTML 1.1 is a modularized form of XHTML 1.0 that can be mixed with other XML applications.