Chapter 2: Complexity of XML Schemas

In May of 2001, the W3C (finally) published a new way of defining
XML documents that was more flexible and powerful than the classical
DTD's of the day, called XML Schemas. XML Schemas overcame a lot of
the limitations of DTDs, allowing for much more reusability and
scalability. A handful of "native" simple datatypes were introduced,
user defined complex datatypes are allowed, and borrowing a page from
Object Oriented Programming, nodes can be descendants of other nodes,
inheriting the structures and definitions of their ancestors.

Yes, the W3C XML Schema language is a much more powerful way to define an XML document,
however with greater power comes greater complexity. While DTDs are
a lot more primitive in what they can define, for the most part they
are not too difficult to understand at a glance. A simple, basic XML
Schema also is not too difficult to understand, however once you add
the power of Schemas to your document, it quickly loses it's ability
to be easily legible.

Power and Flexibility of XML Schemas

XML Schemas added quite a bit of power and flexibility to how you can
define your XML document. Previously, documents were only defined
through DTDs. DTDs allowed for primitive definitions of documents.
They allowed you to define nodes, their names, and any children they
could have, but that was about it. All nodes were defined at the 'top'
level, so there were no nesting of nodes and thus all node names had
to be unique. Nothing but PCDATA was allowed for their datatypes
(i.e. one could not limit a node to being only a number, for example).
And to top it off, DTDs were written in a different syntax than XML,
so a developer had to learn two languages in order to effectively code in XML.

When XML Schemas were introduced, it was received with great enthusiasm.
Written in XML, the developer did not have to learn another syntax in
order to define his or her documents. Being XML, you can easily nest nodes.
Now you can define more than just CDATA nodes, anything from predefined simple
types like string, integer and datetime. Nodes can be defined at a global level
and then redefined locally. Also being XML, schemas have a hierarchical format that can
somewhat mimic the structure of the XML Document that's being defined.

XML Schemas vs. DTD

DTDs could define XML documents as such:

Constrains allowable elements and attributes

Limited occurrence of elements

Choice of elements in a sequence

All elements globally declared

XML Schemas allowed all of the above, but could in addition do the following:

Support Primitive Datatypes (string, int, etc.)

Greater context support

More detailed occurrence control

Default values

Nested elements

AddressType example

The best way to show how XML Schemas improved upon DTDs is by example.
We will use the common example of an Address to show these differences.
The XML snippet shown below is a sample of the Address element that we
are trying to define.

Here we see that already we have the ability to define with more precision
how our XML document should look. Not only do we name the nodes, we also
tell it what types they are, where they appear in the sequences, and even
how many can appear in that sequence. Also, you may note the difference
on where the definitions appear in the document. In the DTD, they are all
at the top level of the document, in the Schema, it appears very similar to
how the XML document appears (i.e. the StreetAddress element node is a child
of the Address element node (through a couple of XSD nodes, of course)).

It should also be noted that another major difference can been seen between
these two schema languages even in this simplistic example. A number of times an element
can repeat in a DTD is very limited, either 0, 1, or unlimited. With XML Schemas,
you have the ability to define a very specific number of occurances with
the minOccurs and maxOccurs attributes.

Power begets confusion

The above example is XSD at it's simplest, and easiest to understand state.
If that was all we were planning on use Schemas for, there is very little
reason to upgrade from DTDs. The real power behind Schemas lies in it's
ability to let the definitions be quite extensive and precise. However,
with this power comes a great price.

Powerful XML Schema definitions

In addition to what was listed above, XML Schemas can do the following:

Derivation of complex and simple types

Substitution groups for complex schemas

Greater detail of restrictions on simple types

Built-in support for documentation

Namespace support

Reference external schemas

Etc.

The more precise you make your document, the less and less legible it becomes.

Flexible yet complex AddressType

Taking the above example of an Address, let's extend it even further,
defining a "global" address, one that could be used for a few different
countries in the world.

Believe it or not, this defines a
relatively simple XML document, the complexity begins with
the introduction of choice nodes, documentation, and this is
only the beginning. We can make this even more complex by
adding additional countries, or adding enumerations for states, etc.

The more complex a schema becomes, the more difficult
it becomes to understand.

Necessity to remove complexity

As you can see from the AddressType example above, this particular document
is no longer easy to understand at a glance. There is a lot of power that
comes from reusability and delegation, however confusion is born from this.
The need to clarify schemas becomes more and more obvious the more and more
complex our schemas become.

Paradox of defining XML Documents

As I define my XML Documents, I tend to find myself first writing an example
of my Document before hopping into DTDs or Schema Creation. I find it a lot
more natural to think of what I want the final document to look like, before
I create the definition for the document. This retro-creation doesn't work
well with current XML Schema editors. There is no aide to showing the final
output of the Schema as you create it. It's a 'hit-and-miss' tactic,
particularly for complex schemas, where you would write the schema, then
validate a document against it, and if it didn't work, go back and edit the
schema. Very similar to the archaic, first generation language methodology of development, this does not
appeal to anyone who's familiar with the modern visual development tools
of our day.