An Introduction to XML

Every now and then, an idea comes along that in retrospect seems just so simple and obvious that everyone wonders why it hadn't been seen all along. Often when that happens, it turns out that the idea isn't really all that new after all. The Java revolution began by drawing on ideas from all the programming languages that came before it. Now, XML--the Extensible Markup Language--is doing for content what Java did for programming: providing a portable language for describing data.

XML is a simple, common format for representing structured information as text. The concept of XML follows the success of HTML as a universal document presentation format and generalizes it to handle any kind of data. In the process, XML has not only recast HTML but is transforming the way that businesses think about their information. In the context of a world driven more and more by documents and data exchange, XML's time has come.

A Bit of Background

XML and HTML are called markup languages because of the way they add structure to plain-text documents--by surrounding parts of the text with tags that indicate structure or meaning, much as someone with a pen might highlight a sentence and add a note. While HTML predefines a set of tags and their structure, XML is a blank slate in which the author gets to define the tags, the rules, and their meanings.

Both XML and HTML owe their lineage to Standard Generalized Markup Language (SGML)--the mother of all markup languages. SGML has been used in the publishing industry for many years (including at O'Reilly). But it wasn't until the Web captured the world that it came into the mainstream through HTML. HTML started as a very small application of SGML, and if HTML has done anything at all, it has proven that simplicity reigns.

HTML flourished but eventually showed its limitations. Documents using HTML have an unhealthy mix of structural information (such as <head> and <body>) and presentation information (for an egregious example, <blink>). Mixing the model and the user interface in this way limits the usefulness of HTML as a format for data exchange; it's hard for a machine to understand. XML documents consist purely of structure, and it is up to the reader of the document to apply meaning. As we'll see in this chapter, several related languages exist to help interpret and transform XML for presentation or further processing.

Text Versus Binary

When Tim Berners-Lee began postulating the Web back at CERN in the late 1980s, he wanted to organize project information using hypertext. When the Web needed a protocol, HTTP--a simple, text-based client-server protocol--was invented. So what exactly is so enchanting about the idea of plain text? Why, for example, didn't Tim turn to the Microsoft Word format as the basis for Web documents? Surely a binary, non-human-readable format and protocol would be more efficient? Since the Web's inception, there have now been trillions of HTTP transactions. Was it really a good idea for them to use (English) words like "GET" and "POST"?

The answer, as we've all seen, is yes! What humans can read, human developers can work with more easily. There is a time and place for a high level of optimization (and obscurity), but when the goal is universal acceptance and cross-platform portability, simplicity and transparency are paramount. This is the first, fundamental proposition of XML.

A Universal Parser

Using text to exchange data is not exactly a new idea, either, but historically, for every new document format that came along, a new parser would have to be written. A parser is an application that reads a document and understands its formatting conventions, usually enforcing some rules about the content. For example, the Java Properties class has a parser for the standard properties file format (Chapter 10). In our simple spreadsheet in Chapter 17, we wrote a parser capable of understanding basic mathematical expressions. As we've seen, depending on complexity, parsing can be quite tricky.

With XML, we can represent data without having to write this kind of custom parser. This isn't to say that it's reasonable to use XML for everything (e.g., typing math expressions into our spreadsheet), but for the common types of information that we exchange on the Net, we should no longer have to write parsers that deal with basic syntax and string manipulation. In conjunction with document-verifying components (DTDs or XML Schema), much of the complex error checking is also done automatically. This is the second fundamental proposition of XML.

The State of XML

The APIs we'll discuss in this chapter are powerful and well tested. They are being used around the world to build enterprise-scale systems today. Unfortunately, the current slate of XML tools bundled with Java only partially remove the burden of parsing from the developer. Although we have taken a step up from low-level string manipulation to a common, structured document format, the standard tools still generally require the developer to write low-level code to traverse the content and interpret the string data manually. The resulting program remains somewhat fragile, and much of the work can be tedious. The next step, as we'll discuss briefly later in this chapter, is to begin to use generating tools that read a description of an XML document (an XML DTD or Schema) and generate Java classes or bind existing classes to XML data automatically.

The XML APIs

As of Java 1.4, all the basic APIs for working with XML are bundled with Java. This includes the javax.xml standard extension packages for working with Simple API for XML (SAX), Document Object Model (DOM), and Extensible Stylesheet Language (XSL) transforms. If you are using an older version of Java, you can still use all these tools, but you will have to download the packages separately from http://java.sun.com/xml/.

XML and Web Browsers

Microsoft's Internet Explorer web browser was the first to support XML explicitly. If you load an XML document in IE 5.0 or greater, it is displayed as a tree using a special stylesheet. The stylesheet uses dynamic HTML to allow you to collapse and expand nodes while viewing the document. IE also supports basic XSL transformation directly in the browser. We'll talk about XSL later in this chapter.

Netscape 6.x and the latest Mozilla browsers also understand XML content and support the rendering of documents using XSL. At the time of this writing, however, they don't offer a friendly viewer by default. You can use the "view source" option to display an XML document in a nicely formatted way. But in general, if you load an XML document into either of these browsers, or any browser that doesn't explicitly transform it, it simply displays the text of the document with all the tags (structural information) stripped off. This is the prescribed behavior for working with XML.

XML Basics

The basic syntax of XML is extremely simple. If you've worked with HTML, you're already halfway there. As with HTML, XML represents information as text using tags to add structure. A tag begins with a name sandwiched between less-than (<) and greater-than (>) characters. Unlike HTML, XML tags must always be balanced; in other words, an opening tag must always be followed by a closing tag. A closing tag looks just like the opening tag but starts with a less-than sign and a slash (</). An opening tag, closing tag, and any content in between are collectively referred to as an element of the XML document. Elements can contain other elements, but they must be properly nested (all tags started within an element must be closed before the element itself is closed). Elements can also contain plain text or a mixture of elements and text. Comments are enclosed between <!-- and --> markers. Here are a few examples:

Attributes

The attribute value must always be enclosed in quotes. You can use double (") or single (') quotes. Single quotes are useful if the value contains double quotes.

Attributes are intended to be used for simple, unstructured properties or identifiers associated with the element data. It is always possible to make an attribute into a child element, so there is no real need for attributes. But they often make the XML easier to read and more logical. In the case of the Document element in our snippet above, the attributes type and ID represent metadata about the document. We might expect that a Java class representing the Document would have static identifiers for document types such as LEGAL. In the case of the Image element, the attribute is simply a more compact way of including the filename. As a rule, attributes should be atomic, with no significant internal structure; by contrast, child elements can have arbitrary complexity.

XML Documents

An XML document begins with the following header and has one root element:

<?xml version="1.0" encoding="UTF-8"?><MyDocument></MyDocument>

The header identifies the version of XML and the character encoding used. The root element is simply the top of the element hierarchy, which can be considered a tree. If you omit this header or have XML text without a single root element, technically what you have is called an XML fragment.

Encoding

The default encoding for an XML document is UTF-8, the ASCII-friendly 8-bit Unicode encoding. But an XML document may specify an encoding using the encoding attribute of the XML header.

Within an XML document, certain characters are necessarily sacrosanct: for example, the "<" and ">" characters that indicate element tags. When you need to include these in your text, you must encode them. XML provides an escape mechanism called "entities" that allows for encoding special structures. There are five predefined entities in XML, as shown in Table 23-1.

Table 23-1: XML entities

Entity

Encodes

&amp;

& (ampersand)

&lt;

< (less than)

&gt;

> (greater than)

&quot;

" (quotation mark)

&apos;

' (apostrophe)

An alternative to encoding text in this way is to use a special "unparsed" section of text called a character data (CDATA) section. A CDATA section starts with <![CDATA[ and ends with ]]>, like this:

<![CDATA[ Learning Java, O'Reilly & Associates ]]>

The CDATA section looks a little like a comment, but the data is really part of the document, just opaque to the parser.

Namespaces

You've probably seen that HTML has a <body> tag that is used to structure web pages. Suppose for a moment that we are writing XML for a funeral home that also uses the tag <body> for some other, more macabre, purpose. This could be a problem if we want to mix HTML with our mortuary information.

If you consider HTML and the funeral home tags to be a language in this case, the elements (tag names) used in a document are really the vocabulary of those languages. An XML namespace is a way of saying whose dictionary you are using for a given element, allowing us to mix them freely. (Later we'll talk about XML Schema, which enforce the grammar and syntax of the language.)

A namespace is specified with the xmlns attribute, whose value is a Universal Resource Identifier (URI) that uniquely defines the set (and usually the meaning) of tags from that namespace:

<element xmlns="namespaceURI">

Recall from Chapter 13 that a URI is not necessarily a URL. URIs are more general than URLs. In practical terms, a URI is simply to be treated as a unique string. Often, the URI is, in fact, also a URL for a document describing the namespace, but that is only by convention.

An xmlns namespace attribute can be applied to an element and all its children; this is called a default namespace for the element:

<body xmlns="http://funeral-procedures.org/">

But more often it is desirable to specify namespaces on a tag-by-tag basis. To do this, we can use the xmlns attribute to define a special identifier for the namespace and then use that identifier as a prefix on the tags in question. For example:

In the above snippet of XML, we've qualified the body tag with the prefix "fun:" that we defined in the <funeral> tag. In this case, we should also qualify the root tag as well, reflexively:

<fun:funeral xmlns:fun="http://funeral-procedures.org/">

In the history of XML, support for namespaces is relatively new. Not all parsers support them. To accommodate this, the XML parser factories that we discuss later have a switch to specify whether you want a parser that understands namespaces.

factory.setNamespaceAware(true);

We'll talk more about parsing in the sections on SAX and DOM later in this chapter.

Validation

A document that conforms to the basic rules of XML, with proper encoding and balanced tags, is called a well-formed document. Just because a document is syntactically correct doesn't mean that it makes sense, however. Two related specifications, Document Type Definitions (DTDs) and XML Schema, define ways to provide a grammar for your XML elements. This allows you to create syntactic rules, such as "a City element can appear only once inside an Address element." XML Schema goes further to provide a flexible language for describing the validity of data content of the tags, including both simple and compound data types made of numbers and strings. Although XML Schema is the ultimate solution (it includes data validation and not just rules about elements), it is more theory than practice at present, at least in terms of its integration with Java. (We hope that will change soon.)

A document that is checked against a DTD or XML Schema description and follows the rules is called a valid document. A document can be well-formed without being valid, but not vice versa.

HTML to XHTML

To speak very loosely, we could say that the most popular and widely used form of XML in the world today is HTML. The terminology is loose because HTML is not even well-formed XML. HTML tags violate XML's rule forbidding empty elements; the common <p> tag is typically used without a closing tag, for example. HTML attributes also don't require quotes. XML tags are case-sensitive; <P> and <p> are two different tags in XML. We could generously say that HTML is "forgiving" with respect to details like this, but as a developer, you know that sloppy syntax results in ambiguity. XHTML is a version of HTML that is clear and unambiguous. Fortunately, you don't have to manually clean up all your HTML documents; Tidy (http://tidy.sourceforge.net) is an open source program that automatically converts HTML to XHTML, validates it, and corrects common mistakes.