Magazine Archive

XML: A Second Chance for Web Markup

HTML gave up a lot of SGML's power. XML brings back the power but keeps it simple

By Neil Randall

November 4, 1997

As far back as the sixties, IBM scientists were working on a Generalized Markup Language (GML) for describing documents and their formatting. In 1986 the International Standards Organization (ISO) adopted a version of this standard called Standard Generalized Markup Language (SGML). SGML offers a highly sophisticated system for marking up documents so that their appearance is independent of specific software applications. It is big, powerful, filled with options, and well suited for large organizations that need exacting document standards.

But early in the game, it became apparent that SGML's sophistication made the language quite unsuitable for quick and easy Web publishing. For that, we needed a simplified markup system, one at which practically anyone could quickly gain proficiency. Enter HyperText Markup Language (HTML), which is little more than one specific SGML document type, or Document Type Definition (DTD). Because it is so easy to learn and to implement, and because early Web browsers supported it, HTML quickly became the basis of the burgeoning Web. In fact, if SGML had been the Web's essential markup language, the Web probably wouldn't have attained the enormous popularity it now enjoys.

The problem with HTML, however, was that it quickly proved to be too simple. It was superb for the early days of the Web--with text-based documents that featured headings, bulleted lists, and hyperlinks to other documents--but as soon as Web authors started demanding multimedia and page design capabilities, the language started experiencing severe growing pains. Straightforward in-line graphics were fine, but you couldn't do much to place them, so page design suffered. And image maps (images with embedded hyperlinks) created new problems and needed new solutions. Then came blinking text, tables, frames, and dynamic HTML. Every time we turn around, it seems, someone's trying to add something new to HTML, and every time that happens we end up with new incompatibilities and the need for new standards.

Why is this happening? Because, quite simply, HTML isn't extensible. Over the years Microsoft has added tags that work only in Internet Explorer, and Netscape has added tags that work only in Navigator, but you, as a Web author, have no way to add your own.

Undoubtedly, you've experienced the frustration of HTML's limitations as a page layout system, and just as undoubtedly you've eagerly embraced new tags and elements as they've been introduced. But to design serious sites, you keep needing more. Hence Java and JavaScript, hence Active Server Pages, hence all the continuing developments that are making HTML more powerful. Recent HTML developments such as Cascading Style Sheets (CSS) and dynamic HTML offer some of the necessary strength for customizing your Web designs more completely, but these additions simply highlight the growing problem. Full customization of Web page design remains at the whim of the people who make the browsers.

The Web's many developers have recognized for quite some time the irony of all this: Whereas HTML offers no extensibility, its parent system, SGML, is fully extensible. To create a fully customized set of documents in SGML, authors develop a DTD that will control all documents in that set. This is time-consuming and can be extremely complex, but it works. The question, then, is how to capture SGML's extensibility, which serious HTML authors require, without retaining the complexity, which almost nobody wants. In other words, the issue is how to bridge the gap between SGML and HTML.

Enter XML

The answer is Extensible Markup Language, better known as XML. Proposed in late 1996 to the World Wide Web Consortium (W3C), XML currently exists as a pair of draft documents at www.w3.org/pub/WWW/MarkUp/SGML/Activity. Its intention is to offer some of SGML's power while avoiding the language's complexity, enabling Web authors to produce fully customized documents with a high degree of design consistency. It can offer these things because XML is SGML. Whereas HTML is merely one SGML document type, XML is a simplified version of the parent language itself.

XML is more than a markup language. Like SGML, it's a metalanguage, or a language that allows you to describe languages. HTML and other markup languages let you define how the information in a document will appear in an application that displays it, but SGML and XML let you define the markup language itself. In this sense, XML can actually control HTML documents. Think of HTML as a description system and XML (like SGML) as a system for defining description systems and you get the idea. One benefit of SGML is that you can use it to define and control an unlimited variety of description systems, HTML being only one of these, and XML offers this advantage as well.

Like HTML and SGML, XML will require viewing software that will interpret it according to the author's instructions. In all likelihood, future versions of Microsoft Internet Explorer and Netscape Navigator will include XML interpretation, but in the meantime, you might want to check out JUMBO, an experimental browser originally designed to display chemical industry documents. JUMBO is available at www.venus.co.uk/omf/cml/, where it displays an SGML/XML implementation called Chemical Markup Language, or CML (see Figure 1).

One of XML's greatest strengths is that it lets entire industries, academic disciplines, and professional organizations develop sets of DTDs that will standardize the presentation of information within those disciplines. To an extent this works against the much-ballyhooed universality of the Web and HTML, but if you work in a specialized area, you're probably aware of the need for systems that let you produce documents enabling you to communicate efficiently with your colleagues. Specialists often need to display formulas, hierarchies, mathematical and scientific notations, and other elements, all within well-defined parameters. SGML's DTD system lets you do so, and XML picks up on the DTD system without all the complexity.

One example of XML's advantage over HTML lies in its linking possibilities. HTML's linking, even though it is the basis of the entire Web, is extremely limited. You can link to internal or external documents, but the links are unidirectional and always connected to a hard-coded address. That's why you get so many "Document not found" errors.

HTML's redirection capabilities--which automatically forward the browser to another location--take care of some of these issues, but the linking portion of the proposed XML standard (www.w3.org/pub/WWW/TR/WDxml-link-970406.html) takes linking much further. With XML, Web authors can establish multidirectional links, which not only link to a destination location but also provide information about links to the current location from other locations. As an example, an author can provide a link that will take users to a particular resource; a cross-reference link will then show all the links that lead to that resource, and the user can follow these links to their sources. XML authors can also specify what happens when a link is found, such as whether or not the link will be followed automatically, and even whether or not the linked document will be displayed within the original document. As XML linking options find their way into general use, the Web will become a much more capable hypermedia system.

Valid and Well-formed

The DTD system is only one method of creating XML documents. DTDs offer the greatest possible flexibility and extensibility, but one of the XML team's design goals has been to eliminate the need for building them. As a result, there are two types of XML document, those with DTDs and those without. Those with DTDs that conform to the SGML standard are called valid files. Documents that exclude DTDs must be well-formed; that is, they must conform to a specific set of standards. Valid files must be well-formed too.

A valid XML document, like a valid SGML document, opens with a Document Type Declaration, through the <!doctype ..> element. In addition, the document might have an XML Declaration before the DTD to specify the version of XML in use, but this isn't strictly needed. If present it takes the form

<?XML Version="1.0"?>

with 1.0 replaced, of course, by whatever version is in effect. The XML version must be available locally or over the Net, and the XML Declaration will state its location.

The Document Type Definition's purpose is to specify the structure for the content of all documents of a certain type; thus the Document Type Declaration represents the core of SGML. It might seem strange or even impossible, therefore, that XML could let you dispense with the >!doctype< element completely. It does so by demanding that files be well-formed, which lets the viewer interpret them as SGML. Instead of having a DTD, the XML document must follow a series of rules, none of which are difficult for authors to master.

First, a document must begin with a Required Markup Declaration (RMD) stating that the document lacks a DTD (the code is "NONE"). This RMD occurs in the same line as the XML Declaration, in the form

<?XML VERSION="1.0" RMD="NONE">

Second, all values for attributes must be enclosed in quotation marks. Third, all elements must have opening and closing tags, unlike some elements in HTML. Other requirements dictate the type of attributes available, as well as some restrictions on the data itself. As long as you adhere to these rules you may omit the DTD, and that simple fact goes a long way toward making XML more accessible than SGML.

So what about HTML? Are your current HTML documents invalid or poorly formed or both? Not necessarily. Remember that HTML is simply one SGML DTD; as long as a document conforms to the HTML 3.2 standard, it will be all or at least mostly well-formed. All you have to do is ensure that it adheres to the XML rules for well-formed documents and you're set. You can also run your HTML files through an SGML-aware authoring tool such as SoftQuad's HoTMetaL Pro (www.sq.com/) or turn to parsing software such as Lark (www.textuality.com/Lark/).

Types of XML Applications

Unless XML offers the ability to produce new kinds of applications, it won't be of great value to the Web authoring community. Much of the early development work is still in progress, but XML appears to be extremely well suited to several advanced application types.

First, because of its data structures, XML provides a good way to develop applications that let the user view data from various perspectives. Such applications can make documents more useful by sorting data according to various criteria (by name, by number, and so on) or by providing a way to toggle different information on and off. For instance, a listing that contains program information for all flavors of Windows could display only the user's version at the click of a mouse.

XML can also be applied to an intranet (a site restricted to users inside an organization) or an extranet (a site restricted to select users outside an organization). If an organization needs to present extensive amounts of data in particular formats, complete with strong database linkages, XML offers a solution. From an extranet standpoint, organizations can make their information available to clients through XML browsers, and entire industries can band together to produce an XML standard for information presentation.

XML is also much better than HTML at drawing data from heterogeneous database types and displaying that data in a consistent format. Of course, SGML already makes all of this possible, but XML is easier to use and faster to implement.

The first high-profile application of XML will be Microsoft's Channel Definition Format (CDF), included in Internet Explorer 4.0. Microsoft has based CDF on XML standards, and you can see the DTD at www.microsoft.com/standards/cdf.htm. This DTD shows the value of XML quite clearly: Microsoft has defined the XML elements specifically for push technology, with element names such as channel, item, schedule, and tracking. The push providers need only fit their data to the appropriate element types and their applications will be consistent with IE 4.0's display capabilities. This is the kind of standardization that just can't be achieved with HTML.

Neil Randall is the author of The Soul of the Internet (ITCP) and coauthor of Special Edition Using Microsoft FrontPage.
Figure 1: The XML browser JUMBO here shows how clicking on a term yields an individual Java window in Chemical Markup Language.