XML in C

Status

Personal thoughts on what the XML syntax should
be. Compare with my earlier notes.

Abstract

XML is a base-language for expressing arbitrary
structured data in text form. It consists of several modules: core
syntax, meta-syntax, linking, style-bindings, and maybe more. Of these
only the core syntax is common to all XML applications. Applications
can choose to omit the other modules if they don't need them.

This text describes one possible core syntax, using flex/bison
specifications. The most important additions relative to the XML-lang draft of 30 June
1997 include: automatically ignored newlines, attribute defaults,
and boolean attributes.

Why this specification?

Split the core syntax and the meta-syntax

The core syntax of XML specifies a very general
language within which all XML applications have to stay. Most
applications will want to restrict this language, and XML provides an
optional module with a meta-syntax for writing those restrictions. The
restrictions are often referred to as a DTD, Document Type Definition,
after SGML, where this term was introduced.

In the XML-lang
draft of 30 June 1997 the core syntax and the meta-syntax are
combined into a single draft. The linking module and the planned
style-sheet binding modules are kept in separate draft. There are good
reasons for keeping the core syntax and the meta-syntax separate:

For consistency.

Because the current meta-syntax is not very good and there are
other proposals in separate documents (e.g., XML-data).

Because it makes the draft easier to read.

To allow one to be changed without the other.

"RE delenda est"

This Latin phrase means "RE is to be deleted." It
refers to a rule from SGML that specifies in which contexts an "RE"
(Record-End, SGML-speak for a newline) is to be ignored. The precise
rules in SGML are very complicated, but in general a newline is
ignored after a start tag and before an end tag. This allows SGML
documents to be somewhat pretty-printed, by starting tags on a new
line.

XML also has start and end tags, but none of the exceptions of
SGML, so the "RE delenda rule" can be applied without any
problems.

In fact, looking at how people write XML and HTML, the rule good be
generalized a bit, to say that a newline before and "<" and after
any ">" is to be ignored, whether that "<" is part of a start
tag or not.

There is a lot of confusion over this issue. The first applications
that are based on XML seem to assume that not only one newline is
ignored, but that all whitespace, even multiple lines, is to be
ignored. While this allows even more "pretty printing", it also means
that a lot of meaningful spaces have to be escaped (as &32;).

The 30 June draft of XML on the other hand says that no
whitespace is to be ignored, not even a single newline.

Most people meanwhile seem to agree that ignoring one newline is a
good compromise. It allows tags to be put on separate lines, while not
requiring meaningful whitespace to be escaped. That is therefore what
the syntax below describes.

Default attributes

Especially if an application uses the linking module,
it will benefit a lot from being able to specify defaults for
attributes. The 30 June draft relies on the meta-syntax to provide
default attribute values. This is not a good idea, for several
reasons:

The meta-syntax is not very good and may change.

Many application that could benefit from attribute defaults
have no use for the rest of the meta-syntax (or cannot afford the cost
and complexity of parsing the meta-syntax).

Restricting the syntax and setting defaults are logically two
very different things and should not be mixed so easily.

The syntax below therefore includes an attribute defaulting
mechanism that is part of the core syntax.

Boolean attributes

All attributes in XML are by default string valued,
although the meta-syntax should be able to restrict that. There are
different proposals for doing that. One interesting one is Tim
Bray's proposal

But there is one very simple type that is useful in almost all
applications and that can be added to the core syntax without
complicating it, and that is booleans. The syntax below therefore
includes boolean attributes as well as string-valued ones.

The code

The code is in two parts: a flex tokenizer and a
bison grammar. Also included are a test program and a makefile. Below
is some documentation for each of them. To download all of them
together, download this tarfile.

Flex scanner

The actual scanner code is very short. After all, there are only 12
tokens to be recognized. The code relies on a few macros that keep the
code clear:

nl

A newline can be either a carriage return, a line feed, or
both.

ws

Whitespace is any sequence of one or more spaces, tabs, carriage
returns or line feeds.

open

The rule that a newline is to be ignored just before a "<" is
expressed by this macro, that combines and optional newline and a
"<".

close

Same for the delimiter that signals the end of mark-up: a ">"
optionally followed by a newline.

namestart

This represents all the characters that can start a name (element
name, attribute name). This code doesn't try to deal with character
encodings (most 8-bit encodings, as well as UTF-8 should work fine,
though), and so it simply accepts all non-ASCII characters as name
start characters. This is probably too lenient, but since all the
delimiters in XML are from the ASCII set, it doesn't really
matter.

namechar

All the characters that are allowed in a name, after the first
character. The same leniency as for namestart above.

data

The data in an XML file, i.e., the characters between a start and
end tag, are matched by this regular expression, that accepts all
characters except a "<", and only accepts a newline if it is not
immediately followed by a "<". There may be escaped characters in
this data, of the form "&#[0-9]+;" or "&#x[0-9a-f]+;". This
program doesn't expand them. To do that would require implementing the
character encodings and the program currently doesn't do
that.

string

A string is something between double or single quotes. Like data,
it may include escaped characters.

The scanner works in one of two modes (start
conditions). The INITIAL mode ignores white space and recognized
names, strings, and most of the other tokens. It is active as the
program starts and every time the tokenizer is in between "<" and
">".

The CONTENT mode is entered after the ">" of any start or end
tag. In this mode only data, "<", comments, and the start of an
attribute defaults declaration are recognized.