Filling in the DTD Gaps with Schematron

Many XML developers, just when they've gotten used to DTDs, are hearing about alternatives and wondering what to do with them. W3C schemas, RELAX NG,
Schematron -- which should they go with? What will each buy
them? What software support does each have? How much of their current systems
will they still be able to use? The feeling of unease behind these questions
can be summed up with one question: if I leave DTDs behind to use one of the others, will I regret it?

One nice thing about Schematron
is its ability to work as an adjunct to the others, including DTDs, so you
don't have to leave DTDs behind to take advantage of Schematron. To use
Schematron in combination with RELAX NG, Sun's msv
validator has an add-on
library that lets you check a document against a RELAX NG schema with
embedded Schematron rules, but you don't need a combination validator like msv
to take advantage of Schematron. There's nothing wrong with checking a
document against one type of schema and then checking it against a set of
Schematron rules as well. In fact, more and more XML developers are realizing
that a pipeline of specialized processes that each check for one class of
problems can serve their needs better than a monolithic processor that does
most of what they need and several more things that they don't need.

This turns out to be the answer to the prayers of many developers
wondering about the best way to move on from DTDs. If you have a working
system built around DTDs and want to take advantage of the Schematron features
that are unavailable in DTDs, you can go ahead and write the Schematron rules
that fill in those gaps and continue using your DTD-based system.

While writing a DTD for PRISM ("Publishing Requirements for Industry
Standard Metadata"), I was frustrated to realize that several constraints described in the PRISM
specification could not be expressed using XML 1.0 DTDs. RELAX NG could express all of the PRISM spec's constraints, but that wouldn't help me create valid XML documents using Emacs with PSGML. I then realized that a DTD
expressing most of what I need would let me create the documents, and a
straightforward Schematron schema less
than half the length of the DTD could ensure that the document met the
additional constraints, and I would have everything I needed.

Exclusive ORs

"Exclusive or" is programmer talk for ensuring that one and only one of
a number of conditions is true. For example, the PRISM spec says that the
dc:identifier element must have a value as content between the
dc:identifier start- and end-tags or as the value of an
rdf:resource attribute, but cannot have both. This is easy to express
in RELAX NG, where the choice element can specify that one or the
other must be there:

In a DTD, if either the element's content or the rdf:identifier attribute may or may not be there, the content model can be PCDATA, because an empty element will still validate. The attribute must be #IMPLIED to show that it's optional:

A Schematron pattern can contain assertions, which declare a condition
that must be true if there is to be no error message, and reports, which
describe problems that, if found, should trigger error messages. The following
pattern has two report checks for potential problems in dc:identifier elements. The first report checks if the element has both content of more than zero characters and an rdf:resource attribute specified. The second checks whether the element has neither content nor an rdf:resource attribute. Both report elements include the appropriate error message to output.

These two report elements demonstrate the simple,
straightforward way that Schematron lets you pair a description of a condition whose verification can be easily automated with a natural language description
of the condition that can be used for intuitive error output. These natural language messages include the Schematron name element, which inserts
the name of the element type.

The first report element's diagnostics attribute names a routine to use to provide further information about the problem. The "resourceAttrVal" diagnostic outputs a message that includes the value of the
rdf:resource attribute:

This makes it easier to find which dc:identifier element has
both content and an rdf:resource value. (There's no point in using a
diagnostic with the other report, because an empty element with no
rdf:resource value has no useful information to pass along.)

The PRISM spec actually designates many more elements that may have either content or an rdf:resource attribute value. Changing the rule above to account for them all merely means adding them to the context
attribute in the rule element's start-tag: