Extensible Markup Language (XML)

C. M. Sperberg-McQueen

University of Illinois at Chicago
u35395@uicvm.uic.edu

Tim Bray

Textuality
tbray@textuality.com

Keywords: text encoding, WWW

Extensible Markup Language (XML for short) is being designed under
the auspices of the World-Wide-Web Consortium (W3C); the larger goal of
this effort is "to enable future Web user agents to receive and process
generic SGML in the way that they are now able to receive and process
HTML. As in the case of HTML, the implementation of SGML on the Web
will require attention not just to structure and content (the domain of
SGML per se) but also to link semantics and display semantics." (See
http://www.w3.org/pub/WWW/MarkUp/SGML/Activity for the W3C's
description of this activity.) As a subgoal, we are creating an SGML
application profile, XML, that is designed to provide many of the
benefits of SGML in a lightweight, easy-to-use, easy-to-implement
dialect that omits many of the difficult or problematic features of the
full standard. This paper is a report on the XML specification; if time
allows, some information will also be provided on the progress of the
work toward a typology of links and link behaviors. At the time this
abstract is prepared, the XML specification has been made public, but is
still officially a working draft.

Motivation

The Standard Generalized Markup Language (SGML) is the most fully
developed specification of the use of descriptive markup languages for
electronic documents. The idea of descriptive markup is simple and
powerful, and in fact has proved to be a basic requirement for many
advanced information processing applications.

Unfortunately, the adoption of SGML has proved surprisingly
difficult, expensive and slow, given that the underlying ideas are
simple and self-evidently good. In particular, there is very little use
of SGML on the World-Wide Web, which is the world's most popular
electronic information delivery mechanism. Some of the perceived
reasons have included:

The SGML standard itself is large, complex, and difficult to
understand.

The standard specifies several optional and advanced markup
features, some of which are almost never implemented.

Some of the features of SGML have proven counter-productive in
practical use.

Practical use of SGML requires learning several other
languages, including the language used to write DTDs, various
stylesheeting and formatting languages, and the SGML/Open Entity
Catalog language.

The design of SGML takes little account of the contemporary theory
of formal languages and finite automata. One practical result is that
SGML parsers are unable to make use of some advanced tools and
techniques made possible by that theory. Consequently, they are large
and complex pieces of computer software; as such they (a) suffer from
reliability problems, (b) have in practice proven difficult to integrate
into applications, and (c) change slowly in response to advances in
software and document processing technology.

Nonetheless, there remains a consensus that SGML's basic design
partition into entities, elements, and attributes is correct and useful.
One result is a common tendency, in strategic projects involving SGML,
to avoid using many advanced features and operate within the bounds of a
highly restricted subset. This approach has generally met with success.
However, this restricted subset has been re-invented by each successive
group that has attacked the problem.

The SGML standard itself identifies two subsets of its features,
intended to simplify implementation: Minimal SGML (defined in ISO 8879,
clause 15.1.2) and Basic SGML (ISO 8879, 15.1.1). These have not
had any practical significance, however, both because the
choice of SGML features they include is not a happy one and because they
have no free-standing definition, which means they cannot be implemented
by anyone who has not first studied and understood the full text of ISO
8879.

There has been informal discussion for years on the subject of a
further-simplified version of the standard. In recent times, there have
been a substantial number of formal proposals for such a simplification.
They include:

A lexical analyzer for HTML by Dan Connolly of W3C, as presented in
"A Lexical Analyzer for HTML and Basic SGML: W3C Working Draft
15-Jun-96"
(http://www.w3.org/pub/WWW/TR/WD-sgml-lex); this is a slight
simplification of Basic SGML, with no entity handling.

the Minimal Generalized Markup Language (MGML) defined by Tim Bray
in "MGML -- an SGML Application for Describing Document Markup
Languages", unpublished draft paper for SGML '96 (
http://www.textuality.com/mgml/index.html). MGML is unique among
the contributions in that it proposes using instance syntax for markup
declarations.

Normalised SGML (NSGML), an invention of Henry Thompson, David
McKelvie, and Steve Finch, presented in "The Normalised SGML Library
(NSL)", NSL Version 1.4.4, Documentation version Fri Aug 2 14:13:40 BST
1996 (
http://www.ltg.ed.ac.uk/corpora/nsldoc/nsldoc.html). NSGML includes
not just a language definition but a suite of software modules for
parsing and handling documents in an efficient pipelined fashion.

the TEI Interchange Format defined in Association for Computers and
the Humanities (ACH), Association for Computational Linguistics (ACL),
and Association for Literary and Linguistic Computing (ALLC), Guidelines
for Electronic Text Encoding and Interchange (TEI P3), edited by C. M.
Sperberg-McQueen and Lou Burnard (Chicago, Oxford: Text Encoding
Initiative, 1994). The portions of the TEI Guidelines relevant to the
interchange format have been extracted and are available separately at
(http://www-tei.uic.edu/orgs/tei/ml/tif.html).

These simplified application profiles of SGML all take advantage of
the fact that SGML exhibits an extreme case of the `80-20 syndrome';
that is to say, 80% of the benefit is gained by applying only 20% of the
machinery. The W3C SGML Activity has formalized the definition of a
useful subset in the form of the Extensible Markup Language, or XML.

Structure, Membership, and Mechanisms

The current work was initiated by Jon Bosak of Sun Microsystems, who,
in co-operation with the Tim Berners-Lee and Dan Connolly of the
World-Wide Web Consortium, initiated the formation of the Consortium's
SGML Editorial Review Board and Working Group, who labor under the
unwieldy acronyms W3C SGML ERB and W3C SGML WG. The mandate for this
effort may be found at
http://www.w3.org/pub/WWW/MarkUp/SGML/Activity; it includes SGML
simplification and work on hyperlink semantics and display processing
(presumably via a DSSSL profile). This paper describes the SGML
simplification work.

The work is co-ordinated by the Editorial Review Board. Its members
are: Jon Bosak (Sun, Chair), Tim Bray (Textuality, XML Co-Editor),
James Clark, (Independent, Technical Lead), Steve DeRose (EBT), Dave
Hollander (HP), Eliot Kimber (Passage), Tom Magliery (NCSA), Eve Maler
(ArborText), Jean Paoli (Microsoft), Peter Sharpe (SoftQuad), and
Michael Sperberg-McQueen (University of Illinois at Chicago, XML
Co-Editor); Dan Connolly serves as liaison with W3C. The main functions
of the ERB are to steer the design and discussion activities, and to
resolve issues by voting. There is a well-defined voting procedure
designed to maximize the chances of reaching consensus and to exercise
majority rule rapidly when consensus is not possible.

The main work is done in the Working Group; this has over 60
members, including those of the ERB. The Working Group provides
technical input, design proposals, and design critiques. It includes
many people who have published significant papers on SGML or played a
visible role in the design, evolution, and implementation of SGML; in
particular Charles Goldfarb and James Mason from WG8. As a result of
this overlap, it is likely that XML will avoid taking any directions
fundamentally incompatible with the future development of SGML; in fact,
the debate on XML is apt to have some influence on the next SGML
revision.

Design Goals

Prior to the commencement of discussion in the WG, the ERB developed
a `strawman' set of design goals to guide this discussion. While these
remain open for challenge and revision, they have been fairly stable and
thus presumably represent a reasonably large-scale consensus among those
involved in this work. The design goals are:

XML shall be straightforwardly usable over the Internet.
This does not mean users can feed it to, for example, the
Netscape of today, but that the design will have regard at all times to
the needs of distributed applications working on large-scale networks.

XML shall support a wide variety of applications.
No design elements shall be adopted which would impair the usability of
XML documents in other contexts such as print or CD-ROM, nor in
applications other than network browsing, such as validating editors,
batch validators, simple filters which understand XML document
structure, normalizers, formatting engines, translators to render XML
documents into other lanuages, and specialized browsers for specialized
markup.

XML shall be compatible with SGML.
I.e. (1) Existing SGML tools will be able to read and write XML data.
(2) XML document instances are SGML document instances as they are,
without changes. (3) For any XML document, a DTD can be generated such
that SGML will produce the same parse as would an XML processor. (4)
XML should have essentially the same expressive power as SGML.
Note: (1) and (2) describe our goal in its ideal form. If this
goal is not achievable in its fullest form, then we may back out to a
weaker form: it shall be simple to transform XML documents into
equivalent SGML documents, and vice versa. Our intention, however, is
to bite the bullet and ensure if we can that no transformation is needed
to allow SGML tools to read and write XML document instances.
(3) and (4) indicate our intentions accurately, but it is not yet clear
how best to formalize and explain the phrase "the same parse", or the
phrase "essentially the same expressive power". These remain open
questions; see point 8 also.

It shall be easy to write programs which process XML
documents.
In particular, it shall be straightforwardly possible to construct
useful XML applications which do not read, or need to read, the DTD of
the XML document.
Note: For this purpose, easy means that the holder of a
bachelor's degree in computer science should be able to construct basic
processing (parsing, if not validating) machinery in less than a week,
and that the major difficulty in the application should be the
application-specific functions; XML should not add to the inherent
difficulties of writing such applications.

The number of optional features in XML is to be kept to the
absolute minimum, ideally zero.
As a result of this, any XML document has a high probability of being
handled successfully by any XML processor.

XML documents should be human-legible and reasonably
clear.

The XML design should be prepared quickly.
A first draft of the XML design should be ready for distribution and
comment by end of 1996; a version should be ready for production use by
the end of March 1997.

The design of XML shall be formal and concise.
XML should be simple and easy for implementors to grasp; its reference
documentation should not exceed 20 pages, which should contain mostly
formal grammar and very little normative text, if any. Note:
normative text is not the same as descriptive or explanatory text.
XML shall specify clearly what characteristics of the input must be
represented in the parse tree of an XML document, and what
characteristics need not be captured by XML processors. This means the
property sets `significant' in an XML application will be defined both
formally and informally. Which properties are significant and which are
insignificant remains an open question.

XML documents shall be easy to create.
It should be a straightforward task (though possibly labor-intensive) to
create valid XML documents by hand (i.e. without a validating authoring
tool). It should be a straightforward task (though possibly
labor-intensive) to create a validating XML authoring system.

Terseness is of minimal importance.
Minimizing keystrokes is not deemed important in achieving any of the
above goals, but other things being equal a concise notation should be
preferred to a verbose.

A Snapshot of XML

At the time this paper is submitted, an initial public draft of the
XML specification has been distributed, but like all working drafts it
is subject to change. The broad outlines of XML, however, are clear
enough to be summarized here.

XML omits a large number of SGML features often left unused in
practice: DATATAG, OMITTAG, RANK, LINK, CONCUR, SHORTREF, SUBDOC, and
FORMAL are all dropped. SHORTTAG, which defines several ways in which
SGML documents may abbreviate their tags, is entirely disallowed except
that attributes need not be specified if a default is specified for them
when they are declared.

Most of these features are rarely used in any case; the most visible
change is the absolute abandonment of SGML techniques of
markup minimization. In XML documents, all tags are always present in
full (except that attributes may be omitted if they have their default
values). This will make no difference to those who use SGML or XML
editors; others may choose to write their documents using standard SGML
tag omission and then run the document through a normalizer like James
Clark's spam.

In order to ensure that XML processors can, under certain
circumstances, skip the document's DTD and still process the document
correctly, empty elements (like the TEI's PTR element or like the HTML
BR element) are required to be self-identifying: instead of the form
<e>, they must take the form <e/>. This
simple innovation radically reduces the complexity of parsing XML
documents.

Comments and processing instructions are retained; XML uses a number
of specialized processing instructions of its own as declarations.
Comments are simplified, however, to try to minimize user errors.

In order to ensure the widest possible use, XML requires conforming
processors to support usage of the characters from ISO 10646 (Unicode)
in both markup and data. For the convenience of those still working
without Unicode editors (currently the majority of users), processors
are encouraged to accept other character-set encodings as well.

XML also restricts, in some ways, the normal SGML syntax for
declaring elements and attributes. In particular, the AND connector is
dropped, inclusion and exclusion exceptions are dropped, and the set
of data types for attributes is simplified and rationalized
(within the limits set by the design goal of compatibility with
SGML).

Conditional marked sections are allowed in the DTD, but not in the
document instance. In DTDs, conditional sections allow easy
customization of the DTD; they appear unnecessary in document instances,
since most practitioners agree that variant text is better handled with
specialized elements and style-sheets.
CDATA marked sections, in which markup characters need not be escaped,
are allowed only in the document itself, and only in a restricted
form.

In the interests of simplicity, XML abandons SGML's notion of
abstract syntax and defines only a single concrete syntax,
modeled on SGML's reference concrete syntax but extended to
handle polyglot documents and large documents better. In XML, all tags
will be enclosed in < and >, all entity
references between & and &refc;, and all
attribute values quoted. Unlike SGML, XML provides no mechanism for
changing the default delimiters.

Future Plans

At the time this paper is submitted, it is the intention of
the ERB to revise the draft XML spec and to turn, in early 1997, to the
topic of hyperlink typology. In late 1997, the third phase of the
project will see the specification of a subset of the Document Style
Semantics and Specification Language (DSSSL) intended for use in network
browsers (DSSSL-Online). The XML specification may change as a result
of work in the two later phases; when it appears stable, steps will be
taken to move it through the normal W3C processes to make it a technical
report, then a proposed recommendation, and finally a specification of
recommended practice.