B

DocBook and XML

$Revision: 546 $

$Date: 2001-08-02 06:27:50 -0400 (Thu, 02 Aug 2001) $

XML, the Extensible
Markup Language, is a simple dialect of SGML. In the words of the
XML specification, “the goal [of XML] is to enable generic SGML to be
served, received, and processed on the Web in the way that is now possible
with HTML.”

XML raises two issues with respect to DocBook:

Are DocBook SGML instances valid XML instances?

Can the DocBook DTD be made into a valid XMLDTD?

If you have an existing SGML system, and your primary goal is
to serve DocBook documents over the Web as XML, only the first of
these issues is relevant. As the popularity of XML grows, we will
see more and more XML-aware tools that don't implement full
ISO 8879 SGML. If your goal is to author DocBook
documents with one of this new generation of tools, you will only be
able to achieve validity with an XML DocBook DTD.

Although not yet officially adopted by the OASIS DocBook Technical
Committee, an XML version of DocBook is available now and
provided on the CD-ROM.

DocBook Instances as XML

Most DocBook documents can be made into well-formed XML documents very
easily. With few exceptions, valid DocBook SGML instances are also well-formed
XML instances. The following areas may need to be addressed.

System Identifiers

It is common for SGML instances to use only a public identifier in document
type and parameter entity declarations:

If you're used to using catalog files to resolve system identifiers,
you may be dismayed to learn that system identifiers are required. Because most
tools favor system identifiers over public identifiers, all of the portability
that was gained by the use of catalog files seems to have been lost. In the
long run, it'll be regained by the fact that XML system identifiers can be
URNs, which will have a resolution scheme like catalogs, but what about the
short run?

Luckily, there are a couple of options. First, you can tell your tools to use the public identifiers even
though system identifiers are present. Simply add:

OVERRIDE YES

to your catalog files. Alternatively, you can remap system identifers
with the SYSTEM catalog directive. If you are faced with
documents that don't use public identifiers at all, this is probably your
only option.

XML also forbids tag omission, and there are
probably a half dozen or so more exotic
examples of minimization that you have used. They're all illegal. The
easiest way to remove these minimizations is probably with a tool like
sgmlnorm (included in the SP and Jade distributions, on
the CD-ROM).

Address expresses
that its content is line-specific with an attribute.

Some XML processing environments are going to ignore the doctype declaration
in your document, even if it's present. This is relevant when your instance
uses elements that have attributes with default values. The default values
are expressed in the DTD, but may not be expressed in your instance. In the
case of DocBook, there are relatively few of these, and your stylesheet can
probably be constructed to do the right thing in either case. (It essentially
treats the attributes as if they had implied values.)

Character and SDATA Entities

<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook V3.1//EN">
<chapter><title>Chapter Title</title>
<para>
This book was published by O'Reilly&trade;.
</para>
</chapter>

The DocBook DTD defines all of the standard ISO
entities automatically, but the ISO definitions use
SDATA, which is not allowed in XML. Eventually,
ISO (or someone else) will release official
ISO standard entity sets that make reference to the
appropriate Unicode character for each entity. Until then, the XML
version of DocBook is
distributed with an unofficial set.

If you use entities in your document, it may be wise to put declarations
for them in the internal subset of each instance, because some
XML browsers are going to parse the internal subset but not the external subset.
If the entity declarations are in your DTD, and the browser does not parse
the external subset, the browser won't know how to display the entities in
your document.

With the standard DocBook SGML declaration, DocBook instances are not
case-sensitive with respect to element and attribute names. XML is always
case-sensitive. As long as you have used the same case consistently, your
XML instances will be well-formed, but it may still be advantageous to do some
case-folding because it will simplify the construction of stylesheets.

Keywords in XML are case-sensitive,
and must be in uppercase.

The name declared in the document
type declaration, like all other names, is case-sensitive.

Start and end tags must use the same
case.

In XML, Para is not the
same as PARA. Note that this is a validity error (against
the XML version of DocBook), but it is not an XML well-formedness error. The use of
para and PARA as distinct names is as legitimate
as using foo and bar, as long as they
are properly nested.

XML instances cannot use CDATA or SUBDOC
external entities. One option for integrating external
CDATA content into a document is to employ a pre-processing pass
that inserts the content inline, wrapped in a CDATA marked
section.

SUBDOC entities may be more problematic. If you do
not require validation, it may be sufficient to simply put them inline. XML
namespaces may offer another possible solution.

No Data Attributes on Notations

They're not allowed in XML, so don't add any.

No Attribute Value Specifications onEntity Declarations

They're not allowed in XML, so don't add any.

The DocBook DTD as XML

Converting the DocBook DTD to XML is much more challenging
than converting the instances. It is probably not possible to
construct an XMLDTD that is identical to the validation power
of DocBook. The list below identifies most of the issues that
must be addressed, and describes how the DocBook XMLDTD; deals with
them:

Comments are not allowed inside markup declarations

Most of them have been moved to comment declarations preceding the markup
declaration that used to contain them. A few small, inline comments that seemed
like they would be out of context if moved before the declaration were simply
deleted.

Name groups are not allowed in element or attribute list
declarations

The small number of places in which DocBook uses name groups have
been expanded.

There's one downside: DocBook uses %admon.class; in a name
group to define the content model, and attribute lists for elements in the
admonitions class. In DocBook XML, this convenience cannot be expressed. If additional
admonitions are added, the element and attribute list declarations will have
to be copied for them.

No CDATA or RCDATA
declared content

Graphic and InlineGraphic have
been made EMPTY. The content model for SynopFragmentRef
, the only RCDATA element in DocBook, has been
changed to (arg | group)+.

No exclusions or inclusions on element declarations

They had to be removed.

In DocBook, exclusions are used to exclude the following:

Ubiquitous elements (indexterm
and BeginPage) from a number of contexts in which they
should not occur (such as metadata, for example).

Removing these exclusions from DocBook XML means that it is now valid, in
the XML sense, to do some things that don't make a lot of sense (like put
a Footnote in a Footnote). Be careful.

Inclusions in DocBook are used to add the ubiquitious elements (
indexterm and BeginPage) unconditionally to a
large number of contexts. In order to make these elements available in
DocBook XML,
they have been added to most of the parameter entities that include
#PCDATA. If new locations are discovered where these terms are desired, DocBook XML
will be updated.

Elements with mixed content must have #PCDATA
first.

The content models of many elements have been updated to make them a
repeatable OR group beginning with #PCDATA.

Many declared attribute types (NAME,
NUMBER, NUTOKEN, and so on) are not allowed