Learn the rules of XML syntax that are stated or implied in the XML 1.0 Recommendation from the W3C. Kenneth Sall introduces a considerable amount of XML terminology, including discussions of parsing, well-formedness, validation, XML document structure, legal XML Names, and CDATA.

This sample chapter is excerpted from XML Family of Specifications: A Practical Guide, by Kenneth Sall.

This chapter is from the book

This chapter is from the book

In this chapter, we cover the rules of XML syntax that are stated or implied
in the XML 1.0 Recommendation from the W3C. A considerable amount of XML
terminology is introduced, including discussions of parsing, well-formedness,
and validation. XML document structure, legal XML Names, and CDATA are also
among the topics. The XML 1.0 specification also discusses rules for Document
Type Definitions (DTDs), which we present in chapter 4. The material in chapters
3 and 4 is very interrelated.

Elements, Tags, Attributes, and Content

To understand XML syntax, we must first be familiar with several basic terms
from HTML (and SGML) terminology. XML syntax, however, differs in some important
ways from both HTML and SGML, as we'll see.

Elements are the essence of document structure. They represent pieces of
information and may or may not contain nested elements that represent even more
specific information, attributes, and/or textual content. In our employee
directory example from chapter 2 (Listing 2-2), some of the elements were
Employees, Employee, Name, First,
Last, Project, and PhoneNumbers.

Tags are the way elements are indicated or marked up in a document. For each
element,1 there is typically a start tag that
begins with < (less than) and ends with > (greater
than), and an end tag that begins with </ and ends with
>. Some of the start tags in our example were
<Employees>, <Employee>, <Name>,
and so forth. The corresponding end tags for these elements were
</Employees>, </Employee>, and
</Name>.

If an element has one or more attributes, they must appear between the
< and > delimiters of the start tag. Attributes are
qualifying pieces of information that add detail and further define an instance
of an element. They are typically details that the language designer feels do
not need to be nested elements themselves; the assumption is that the attributes
will generally be accessed less often than the elements that contain them, but
this tends to be application dependent.2 In our
employee example, the only element that had an attribute was Employee,
and the attribute was sex, with two kinds of instances:

<Employee sex="male">

or

<Employee sex="female">.

Each attribute has a value, the quoted text to the right of the equal sign.
In the preceding examples, the values of the two instances of the sex attribute
are "male" and "female". Although in this case the value is
a single word, values can be any amount of text, enclosed in single or double
quotes. HTML permits attributes that do not require values (e.g., the
selected attribute to denote a default choice in a form, as in
<OPTIONselected>), but this so-called attribute
minimization is expressly not permitted in XML.

Content is whatever an element contains. Sometimes element content is simply
text. In other cases, elements contain nested elements; the inner (child)
elements are called the content of the outer (parent) element. Content is the
data that the element contains. For example, in this fragment:

"123 Milky Way" is the text content of the Street element,
"Columbia" is the text content of the City element, and
Street, City, State, and Zip are all nested
element content of the parent Address element, in other words,
"123 Milky Way Columbia MD 20777". (The space preceding the last three
words is due to newlines, as we'll see.)

Notice that the content of Zip is the text string "20777".
Why do we not say that this is a number or, better yet, an example of some zip
code datatype (constrained to either the valid five-digit or
five-plus-four-digit ddddd-dddd values for zip codes)? Because there is nothing
about the Zip element that conveys its content is numeric! We could,
however, denote the element's datatype explicitly by means of an
attribute.

<Zip type="integer">20777</Zip>

We'll eventually see how an alternative to DTDs called XML Schema makes
data typing easier and far more flexible.

Another possibility, called mixed content, was illustrated in chapter
2 in the section "Document-Centric vs. Data-Centric," in which both
text and element content may appear as the content of a parent element.
We'll see how to handle this in chapter 4.