Login

The Fundamentals of DTD Design

Ever tried to read a DTD, and failed miserably? Ever wondered what all those
symbols and weird language constructs meant? Well, fear not – this crash course
will get you up to speed with the basics of DTD design in a hurry.If you’ve been playing with XML for a while, you probably already know that XML
documents come in two flavours: well-formed and valid.

A well-formed document is one which meets the specifications laid down in the
XML recommendation – that it, it follows the rules for element and attribute names,
contains all essential declarations, and has properly-nested elements.

A valid document is one which, in addition to being well-formed, adheres to the
rules laid out in a DTD or XML Schema. By imposing some structure on an XML document,
a DTD makes it possible for documents to conform to some standard rules, and for
applications to avoid nasty surprises in the form of incompatible or invalid data.

If you’re serious about developing your XML skill set, you’re going to bump your
head up against DTDs sooner or later – and the arcane commands and symbols you
find will make you want to weep and beg for Mommy. Unless, of course, you’re armed
with your own secret weapon…

This article.

Over the course of the next few pages, I’m going to find out just what makes
a DTD tick, with examples, explanations and illustrations that will demystify
this simple yet surprisingly-scary piece of the XML puzzle. Strap yourself in,
and prepare to meet the beast! {mospagebreak title=DTD Who?} Let’s start with
the basics: what’s a DTD when it’s home, and why do you care?

The first part of the question is easy enough to answer. A DTD, or document type
definition, is a lot like a blueprint. Unlike most blueprints, however, it doesn’t
tell you where the kitchen goes or how the capsule containing the plutonium is
to be wired up. Nope, this blueprint is a lot more boring – it tells you exactly
how an XML document should be structured, complete with lists of allowed values,
permitted element and attribute names, and predefined entities.

DTDs are essential when managing a large number of XML documents, as they immediately
make it possible to apply a standard set of rules to different documents and thereby
demand conformance to a common standard. However, for smaller, simpler documents,
a DTD can often be overkill, adding substantially to download and processing time.

Most XML documents start out as well-formed data – they meet the basic syntactical
rules described in the XML specification, and are correctly structured (no overlapping,
badly-nested elements or illegal values). However, an XML document which additionally
meets all the rules, conditions and structural guidelines laid down in a DTD qualifies
for the far cooler “valid” status. Think of it like a free airline upgrade from
business to first…except, of course, without the complimentary drinks.

Why do you need to know about this? Well, you don’t.

If your day job involves carrying out covert operations for an unnamed intelligence
agency or building houses, you’d be better off studying the other sort of blueprint.
If, on the other hand, your job involves developing and using XML applications
and data, you need to have at least a working knowledge of how DTDs are constructed,
so that you can roll your own whenever required.{mospagebreak title=How’s The
Weather Up There?} In order to illustrate how DTDs work, consider this simple
XML document:

As of now, this file is merely well-formed – it hasn’t yet been compared to a
DTD and declared valid. In order to perform this comparison, I need to link it
to a DTD – which I can do by adding a document type declaration referencing the
DTD.

To the untrained eye – gibberish. But give it a couple of minutes…{mospagebreak
title=Rainy Days} Once you have an XML document and a DTD linked together, an
XML parser can verify the document against the DTD and let you know if it finds
errors. A number of tools are available online to perform this validation – my
favourite is the XML Spy editor, available at http://www.xmlspy.com/ , although
you can also try out expat, at http://sourceforge.net/projects/expat/, and rxp, at http://www.cogsci.ed.ac.uk/~richard/rxp.html

While this version of the document is still well-formed, it no longer follows
the rules laid down in “weather.dtd” and hence cannot be considered valid – which
is why rxp barfs and generates a list of errors.

Incidentally, it’s also possible to place the DTD within the XML document itself.
Although this is quite rare – the DTD is usually stored in a central place so
that it can be referenced by different XML documents – you should know how to
do it in case you’re ever home on a Saturday evening and feel like experimenting.

{mospagebreak title=Simply Elementary} Now that you know the basics of linking
and validating XML data against DTDs, let’s focus in on the different components
that actually go into a DTD.

All XML documents consist of some combination of elements, attributes, entities
and character data. In case you’ve forgotten what these are, here are some quick
definitions:

An element, which is the basic unit of XML, consists of textual content (character
data), enhanced with descriptive tags.

<dinosaur>Stegosaurus</dinosaur>

An attribute is a name-value pair which provides additional descriptive parameters
or default values to an element.

<person sex="male">Spiderman</person>

An entity is an XML construct, referenced by name, which stores text, images
and file references; it is primarily used as a mechanism to store and reuse content
which appears in multiple places within an XML document.

Each of these basic constructs can be defined in a DTD. I’ll begin with element
declarations, which typically look like this:

<!ELEMENT elementName (contentType)>

As an example, consider the “forecast” element from the previous example:

<!ELEMENT forecast (#PCDATA)>

In English, this declares an element with name “forecast” and content of the
form “parsed character data” (in case you’re wondering, this means that the parser
will parse the contents of the “forecast” element, automatically processing its
child elements and entities).

The alternative to parsed character data is regular character data, which will
be treated as literal text by the parser without any further processing. Here’s
an example of this type of element declaration:

<!ELEMENT greeting (#CDATA)>

In case you don’t want to specify a content type, you can escape without making
a decision by allowing any content.

<!ELEMENT address ANY>

Of course, doing this kinda negates the purpose of having a DTD in the first
place…

If an element contains nested child elements, it’s necessary to specify these
element names within the declaration. In the following example,

the “book” element contains four child elements nested within it – which is why
its element declaration in the DTD looks like this:

<!ELEMENT book (author, title, price)>

XML also allows for so-called empty elements – essentially, elements which have
no content and therefore do not require a closing tag. Such elements are closed
by adding a slash (/) to the end of their opening tag. Consider the following
XML snippet

{mospagebreak title=What’s The Frequency, Bobby?} A number of special symbols
can be added to an element declaration in order to define its frequency and order,
or the frequency and order of its child elements. Here’s a quick list:

symbol description
---------------------------------------------------------
+
one
or more occurrence(s)
* zero or more occurrence(s)
? zero or one occurrence(s)
|
choice

If you’re familiar with regular expressions, you’ll feel right at home with these
symbols – they’re almost identical to the symbols used to build regular expression
patterns.

Let’s take this for a quick spin. Consider the following revised XML document

Obviously, all these symbols can also be combined to create weird and wonderful
rules for the document to follow. An example awaits you at the end of the article…but
first, attributes.{mospagebreak title=Turning Up The Heat} Just as you can declare
elements, a DTD also allows you to define the attributes attached to each element.
An attribute declaration typically looks like this:

<!ATTLIST elementName attributeName contentType modifier>

In order to demonstrate, consider the following XML document, which adds a couple
of attributes to the previously declared XML elements.

In English, this declares that the element “city” has an attribute named “state”
containing character data. The additional #REQUIRED modifier indicates that this
is a required attribute – failure to include it will render the XML document invalid.

A number of different content types are available for attributes. You’re already
familiar with character data – here’s a quick list of the others:

You can also specify a list of allowed attribute values by enclosing them in
parentheses and separating them with the | operator. The following attribute declaration
does just that, limiting the list of allowed values for the “units” attribute
to either “celsius” or “fahrenheit”.

<!ATTLIST high units (celsius | fahrenheit) #REQUIRED>

A number of modifiers are available, each applying a special characteristic to
the attribute. For example, you can specify a default value for the attribute
by enclosing it in quotes; this default value is used if the attribute is absent.

<!ATTLIST high units (celsius | fahrenheit) "fahrenheit">

The #IMPLIED modifier is used to declare a particular attribute as optional.

<!ATTLIST high units #IMPLIED>

And finally, the #FIXED keyword is used to fix an attribute value to something
specific, allowing the XML document author no choice in the matter.

<!ATTLIST high units (celsius | fahrenheit) #FIXED "fahrenheit">

If an element has more than one attribute, you can declare them all within the
same attribute declaration. Consider the following XML document,

Let’s move on to entities. {mospagebreak title=An Entity In The Attic} XML entities
are a bit like variables in other programming languages – they’re XML constructs
which are referenced by a name and store text, images and file references. Once
an entity has been defined, XML authors may call it by its name at different places
within an XML document, and the XML parser will replace the entity name with its
actual value.

XML entities come in particularly handy if you have a piece of text which recurs
at different places within a document – examples would be a name, an email address
or a standard header or footer. By defining an entity to hold this recurring data,
XML allows document authors to make global alterations to a document by changing
a single value.

Note, however, that an entity cannot reference itself, either directly or indirectly,
as this would result in an infinite loop (most parsers will warn you about this.)
And now that I’ve said it, I just know that you’re going to try it out to see
how much damage it causes. {mospagebreak title=The Old Popcorn Trick} And that
just about covers everything I have to say on the topic. Before I go, though,
I’d like to run through a composite example illustrating everything you’ve learned
thus far.

Take a look at the following XML document

<?xml version="1.0"?>
<!DOCTYPE review SYSTEM "movie.dtd">
<review id="42">
<header>
<title>Pearl
Harbor</title>
<cast>Ben Affleck, Josh Hartnett and
Kate Beckinsale</cast>
<director>Michael
Bay</director>
<duration
units="m">167</duration>
<genre>Drama</genre>
<slug>War
Games</slug>
<author>J. Doe</author>
<date>2001-08-08</date>
</header>
<body>
<para>On
December 7, 1941, Japan unexpectedly attacked the American naval
base
at Pearl
Harbor, hoping to gain the initiative in the war against
Europe. As
it turned
out, the attack had the effect of galvanizing the
<quote>sleeping
American
giant</quote>, resulting in the utter rout of the
Japanese and German
armies
and bolstering America's dominant role in world
politics. </para>
<para>While
<title>Pearl Harbor</title>'s love story may seem unbelievably
trite,
the
effects are most certainly not. The Japanese attack on the naval
port is
described
in tremendous detail, and is perhaps the most compelling
reason to
watch this
film. With over forty minutes of reel time devoted to
the attack,
you've probably
never seen anything like it before; it's a
visual spectacle that
hits home more
than any written description ever
will. Bay's direction is superb
- he knows just
where to put the camera,
and he always gets the money shot -
and the cinematography
and visuals -
especially those shot in the train station,
with steam billowing
out in the
background - simply gorgeous. </para>
<para>While
I think
the love story embedded within <title>Pearl
Harbor</title> isn't
really
all that compelling - <title>Moulin
Rouge</title> did it better
- this is
still a film worth watching, if only
to understand a little bit of
history! </para>
</body>
</review>

And that’s about it from me. In case you’re interested in finding out about the
more arcane aspects of DTDs – notations, parameter entities and overrides – you
should consider checking out the following links.

If, on the other hand, all you were looking for was a working knowledge of DTDs
to get you though your day, I hope you found it here. I’ll be back soon with another
article on a related technology, XML Schema, which is quickly gaining followers
on account of its ease of use and powerful data-validation capabilities (think
of it as DTDs on steroids, but without the nasty symbols). Until then, though…be
good!