Position Paper on Compound Documents

This paper represents the personal position of Micah Dubinko, with
special thanks to Sanjay Kshetramade, Yatin Vasavada, and Danny Tom,
all of Verity, Inc.

Abstract

Several areas related to compound documents and web applications would
benefit from greater standardization and industry consensus. In
particular, a consistent model for compound documents, a
broadly-applicable linking vocabulary, and additional work on
suitability for hand-authoring.

The Web needs a consistent model for compound documents

Compound documents are a fact of modern life. From emails with
attachments to recent office file formats, compound documents are
already commonplace--but the lack of standardized packaging makes
accessing and processing of such documents harder than necessary. We
need a standardized way of addressing individual components of a
compound document without apriori knowledge about its structure or
schema.

For file formats, several successful commercial products, including
Verity LiquidOffice, use the open zip format, containing XML and
related files, including images. As useful as this is, however, is not
a full solution.

Another difficult issue with compound documents is deciding how to
validate and label them. For example, is an XHTML file containing
inline SVG and MathML still just application/xhtml+xml? What is the
proper DTD or schema against which to validate it? In the short term,
hand-assembled profiles (like XHTML+MathML+SVG profile) are working,
mostly, but the combinatorial explosion of possible mime and document
types is daunting. We need a better way.

Crawling and Indexing Compound Documents

Within compound documents, it is currently difficult to determine:

Where boundaries exist between one document type and another
(namespaces help in some cases, but not all)

How to determine whether a particular document fragment is valid
(For highly-modular validations problems, Relax NG has been useful.)

What special processing is needed at any given node, including rules
for word-separation and hyphenation, metadata, and security issues

What mechanisms are in place to indicate hyperlinks

Additional standardization work is necessary to resolve these issues.

Event flow and style cascading across document type boundaries is
another challenging subject: Additional use cases need to be gathered,
in order to determine the "correct", standards-compliant behavior.

Tool-authored vs. Hand-authored Documents

Namespace proliferation is a problem. Even fairly modest documents now
require a huge raft of declarations at the top. As the author of an
O'Reilly book on XForms, I can report that 90% of the technical
questions from readers involve confusion related to namespaces.

For purely machine-generated and machine-processed XML, namespace
proliferation is a minimal concern. On the other hand, it is
increasingly common for humans to directly read, or in some cases,
write XML. If XML becomes so complicated that it is only possible to
work with it through custom applications, that effectively gives
proprietary formats an extra advantage to displace standards.

Putting it together: Web Applications

XForms, in combination with CSS3, provides a solid
foundation for the next generation of interactive applications. XHTML
version 2.0, with an increased emphasis on structure and a declarative
approach is likewise a good direction. With properly defined
abstraction, multimodality and accessibility fall out naturally.

One technically nonstandard but wildly useful API is called XMLHTTP,
which is more-or-less equally supported across IE, Netscape/Mozilla,
and Safari. This important technology would benefit from
standardization, as long as the core functionality, as already
deployed, doesn't get changed significantly.

XForms offers a very restricted technique for client-side storage, but
it is currently difficult to use due to file system differences across
platforms, within the constraints necessitated by security.

I tend to look favorably upon the work to standardize an XBL-like
layer that works with SVG (and hopefully other vocabularies). I am
concerned, however, about that layer receiving sufficient community
review.

In broader terms, extensions to specifications are fruitful, provided
that sufficient community review (possibly W3C, possibly not) is
possible, and that the extensions are available on IP terms comparable
to the main standard.