Ruminations on DocBook V.next

Volume 6, Issue 19; 21 May 2003 (modified 08 Oct 2010)

There comes a point in the life cycle of any system when adding
one more patch is the wrong solution to every problem. Eventually,
it's time to rethink, refactor, and rewrite. For DocBook, I think that time
has come.

Any
fool can write code that a computer can understand. Good
programmers write code that humans can understand.

—Martin Fowler

The DocBook TC has been kicking the idea of DocBook V5.0 around
for a long time. I think I've figured out why.

There comes a point in the life cycle of any system when adding
one more patch is the wrong solution to every problem. Eventually,
it's time to rethink, refactor, and rewrite. For DocBook, I think that time
has come.

Considering the Past

These are my recollections of how DocBook developed. I do
not claim that these are all facts, only that they are the
most factual memories that I have.

It Was a Long Time Ago...

DocBook is more than ten years old; its design stretches back to
the early 90's. Back then, men were real men, women were real women,
and SGML applications were really rare and
expensive. (What about XML, you ask? I'm not talking just talking
pre-XML here, I'm talking pre-HTML.)

Hampered by the dearth and cost of commerical SGML applications,
I eventually built my first publishing system with bailing wire and
duct tape instead (SP output and beta versions of Perl 5).
I recall struggling to get SP through
gcc so that I could get at the ESIS output of the
parser.

The Tools Were Weak

The limitations of tools, and the limitations of
SGML DTDs, were a constant influence on our
design.

DocBook was for Exchange

The original vision for DocBook was that it would be principally
an exchange DTD. Different vendors (of things like Unix and X Windows)
would all use DocBook to share content and build common documentation
libraries.

DocBook is a Victim of its Own Success

Over the years, DocBook has experienced “growth by accretion.”
Decisions that were made early on (like allowing some elements to have
<title>s both inside and outside of the info
wrappers), seemed fine at the time when there were probably only a
handful of elements that had titles. But now those choices seem
like inconsistent warts.

We Stumbled Once Before

We're also suffering from the consequences of an earlier refactoring
attempt. The first refactoring of docbook occurred between the
2.4.1 and 3.1 releases.
EveMaler
rationalized the parameter entity structure
and applied the methodology she developed with
JeanneEl Andaloussi
for developing SGML DTDs[1]

This refactoring was necessary and valuable, but it was never
entirely complete. it left us with some pretty awkward content
models:

In the intervening years, we've talked many times about
“reworking the parameter entities”, but we've postponed
it indefinitely as we've fixed bugs and added features.

Considering the Present

Today, HTML exists. A lot more developers have gotten used to the idea
of writing structured documentation. (Say what you want about the structure of most
HTML, it did expose people to the idea of putting elements and attributes in their
documents and separating structure from presentation, at least a little bit.)

Today, XML exists. XML has supplanted
SGML in every significant way. XML parsers
are nearly ubiquitous. The state of the art in tools for manipulating
XML includes powerful technologies tools like
SAX, StAX,
various flavors of DOM, and things like
JAXB. On top of that platform, we have
XSLT, XSL-FO,
and support for transformation and rendering of XML in the
browser.

Today, a lot of people author in DocBook.
They do this for many reasons, and one of them is exchange, but
they aren't principally writing in some private tag set, or deep
customization of DocBook, and then converting to the standard to pass
documents to other interchange partners. They're writing directly in
standard DocBook.

A Modern Approach

If we were starting over, I think we'd approach the problem much
differently:

We'd use XML.

We'd use RELAX-NG.

We'd design for the web.

We'd design for regularity and consistency at the current scale.
(Designing a schema of roughly 400 elements is different than
designing a schema of roughly 100.)

Design Principles

A good place to start would be some design principles. If 100
people are going to ask you to make a 100 different changes, it's nice
to have some rules for sorting out which ones make sense and which
ones don't.

Whatever we do, it should still look and feel like DocBook. In
all fairness, when I said “starting over”, I wasn't
really thinking of going back to first principles and reinventing all
the elements and content models. I think one of the goals should be
that most valid DocBook documents can be transformed into new valid
V.next documents with XSLT.

There are only a few kinds of elements: <set> and
<book>; divisions (<part> and
<reference>); components (<preface>,
<chapter>, etc.); formal blocks (<figure>,
<example>, etc.); and blocks (<para>,
<blockquote>, etc.); and inlines.

There are only three kinds of inlines: “just text”,
general inlines, and domain-specific inlines.

All the metadata goes in an <info> wrapper. RELAX NG lets
us have different content models for <info> depending on the context.
(So it can have a required title for some elements, an optional title for others, and
a forbidden title for yet others).

Some Open Questions

I expect this section to get longer as I fiddle with
instantiating an experimental V.next. These questions are in no
particular order.

Is the distinction between formal/informal useful anymore? I think it's
a holdover from the days when building a “list of titles” based on
whether or not the elements actually had titles was considred too hard. That's
hardly the case these days.

Are varying content models, such as described above for <info>,
harder for users to understand? My intuition is no, I don't think most users envision
things in terms of content models (“Oh, this is an <info>
wrapper so it must (or must not) have a title.”), I think they envision
things in terms of more semantic structures (“figures must have titles,
titles go in the <info> wrapper.).”

Ubiquitous linking is a no brainer, at least on inlines. Does it make sense
on blocks too? If we're going to allow <phrase href="...">,
is there any reason not to allow
<chapter href="...">? And if you say “yes”,
what is the design principle that you use to distinguish between the two cases?

Are inlines cheap? This is more of a long-term maintainance question, but
we have a large pool of inlines in DocBook and enough elements to make it hard for
new users to see what goes where. So, on one hand, adding new inlines gives better
semantic markup for the users that need those inlines. On the other hand, it's yet
more tags for new users to learn.

Brace yourself. We're just about to slam squarely into the character entity
problem. No DTD means no named character entities. I think this just adds fuel to
the fire that says the right answer here is to publish some normative entity sets
separately from the DTD. Then you can include the sets you need directly:

Frankly, I Like the Timing

The imminent release of
XSLT 2.0
is an ideal opportunity to
refactor the DocBook XSL Stylesheets. Supporting a refactored DocBook
schema at the same time makes good engineering sense.

Considering the Future

The XML world will continue to evolve. We should bear that in
mind. Designing so we can add new features incrementally will keep
DocBook stable and useful for another 10 years. Until the next
refactoring.

Herewith, some things to bear in mind.

Using Schematron
assertions with existing RELAX NG grammars gives us
the ability to validate conditions that aren't easily modeled with grammar-based
languages. For example, typed links (<glossterm>s should only point
to <glossentry>s, etc.)

A future version of RELAX NG might give us back our exclusions.

Are You On Crack?

It's certainly fair to ask: should we do this at all?

There's a lot of legacy out there. Of
course, nothing that's suggested here will ever break that legacy.
It'll still be valid DocBook and there will still be tools that
process it. The only concern I really have on this front is how painful
it will be for users of legacy systems to move forward.

Maybe it would it be better to just declare DocBook finished and
move on? I've pretty well convinced myself that piling on yet more
fixes is not practical.

[1]Developing SGML DTDs: From Text to Model to Markup
published by Prentice-Hall PTR (1996, ISBN: 0-13-309881-8).
Out of print, but still a valuable resource if you can get your hands on one.