SXML: an XML document as an S-expression

Kirill Lisovskiy, Dmitry Lizorkin

Institute for System Programming RAS,
Moscow State University

This article is the first in the series of the articles dedicated to application
of functional techniques for XML data management, which may be effectively
used for implementation of XML-based digital libraries.

We consider SXML --- a representation of the XML Information Set in the form
of S-expressions (nested lists) and Scheme functional programming language
which naturally processes S-expressions and thus SXML. This approach avoids
the impedance mismatch problem that currently affects many XML-based applications,
including digital libraries.

Textual formats for representation of semistructured data, such as SGML and now
XML, support the representation of a text as an "ordered hierarhy of content objects",
and are very popular as a technology for digital libraries [1].

The "XML revolution" is a fait accompli now, and XML technologies are used in
virtually every area of information technologies now. Digital libraries (DL) are
not an exception here, and XML is generally considered as a key technology for
modern digital libraries [2].

XML data processing languages suggested by the W3 Consortium, such as XPath, XSLT,
XQuery etc., are not general-purpose programming languages and are thus not sufficient
for implementing complete applications. That's why most XML applications are implemented
by means of traditional programming languages, such as C and Java; or scripting
languages, for example, Perl, JavaScript and Python.

An attempt to combine two different languages (for example, XPath and Java) leads
to a problem known as impedance mismatch. Impedance mismatch problem
consists of two aspects:

Different data models. E.g. XPath models an XML document as a tree,
while most general purpose programming languages have no native data types
for a tree.

Different computational paradigms. Say, XSLT is a functional language, while
Java is object-oriented, and Perl is a procedural one.

Impedance mismatch requires complicated converters and APIs (for example, DOM)
to be used for combining such two languages.

Impedance mismatch problem can be reduced and even eliminated using the Scheme
functional programming language for XML data processing:

Nested lists (S-expressions in Scheme) provide a natural representation
for nested XML documents. Scheme represents its code and data as nested lists
of dynamically typed elements. XML document, being a hierarchical structure
of nested XML elements, can be thought of as a hierarchical nested Scheme
list (so-called S-expression).

Scheme is a functional language, as most XML-languages (e.g. XSLT and XQuery)
are. Scheme processes (nested) lists in a recursive manner which can be thought
of as traversing/transforming the document tree.

Scheme, being a Lisp dialect, is a widely recognized scripting language [3].
It is one of the most elegant and compact programming languages practically used:
the description of Scheme standard [4] consists
of 40 pages only. Scheme is a high-level programming language, suitable for fast
prototyping. Moreover, Scheme programs are generally several times more compact
than the equivalent C programs.

This paper introduces SXML --- a representation for an XML document in the form
of an S-expression. XML and SXML textual notations are much alike: informally,
SXML replaces XML start/end tags with opening/closing brackets. SXML, being an
S-expression and thus the main data structure for Scheme programming language,
is easily and naturally processed via Scheme.

Section 2 discusses the relationship between XML
and SXML from the point of view of the XML Information Set W3C recommendation.
Section 3 gives a simplified description of the SXML
specification. Section 4 considers some important
SXML features.

An XML document [5]
is essentially a tree structure. The start and the end tags of the root element
enclose the whole content of the document, which may include other elements or
arbitrary character data. Text with familiar angular brackets is an external representation
of an XML document. Applications ought to deal with an internalized form: an XML
Information Set, or its specializations (such as the DOM). This internalized form
lets an application locate specific data or transform an XML tree into another
tree.

The W3 Consortium defines the XML Information Set [6]
(Infoset) as an abstract data set that describes information available in a well-formed
XML document. An XML document's information set consists of a number of information
items, which denote elements, attributes, character data, processing instructions,
and other components of the document. Each information item has a number of associated
properties, e.g., name, namespace, URI. Some properties ---
for example, "children" and "attributes" --- are collections of other information
items. Although technically Infoset is specified for XML it largely applies
to other semi-structured data formats, in particular, HTML.

XML document parsing is just one of possible ways to create an instance of XML
Infoset.

It's worth a note that XML Information Set recommendation does not attempt to
be exhaustive, nor does it constitute a minimum set of information items
and properties. Its purpose is to provide a consistent set of definitions
for use in other specifications that need to refer to the information in a well-formed
XML document.

The abstract data model defined in the XML Information Set Recommendation is applicable
to every XML-related specification of the W3 Consortium. Namely, the Document
Object Model [7] can be considered the application
program interface (API) for dealing with information items; the XPath data model [8]
uses the concept of nodes which can be derived from information items, etc. The
DOM and the XPath data model are thus two instances of XML Information Set.

XML Information Set Recommendation itself imposes no restrictions on data structures
or interfaces for accessing information items. Different interpretations are thus
possible for the XML Information Set abstract data model. For example, it is convenient
to consider an XML Information Set a tree structure, and the terms "information
set" and "information item" are then similar in meaning to the generic terms "tree"
and "node" respectively.

An information item may be also considered as a container for its properties,
either text strings (e.g. name, namespace URI) or containers themselves (e.g.
child elements for an XML element). The information set is thus a hierarchy
of nested containers. As noted by Oleg Kiselyov [9],
such a hierarchy of containers comprised of text strings and other containers
greatly lends itself to be described by an S-expression, because the latter is
recursively defined as a list whose members are either atomic values or S-expressions
themselves. S-expressions [10] are easy
to parse into an internal representation suitable for traversal; they also have
a simple external notation, which is relatively easy to compose even by hand.

SXML is a concrete instance of the XML Infoset in the form of S-expressions. Infoset's
goal is to present in some form all relevant pieces of data and their abstract,
container-slot relationships with each other. SXML gives the nest of containers
a concrete realization as S-expressions, and provides means of accessing items
and their properties. SXML is a "relative" of XPath and the DOM, whose data models
are two other instances of the XML Infoset. SXML is particularly suitable for
Scheme-based XML/HTML authoring, XPath queries, and tree transformations.

XML and SXML can thus be considered two syntactically different representations
for the XML Information Set.

SXML is the concrete instance of the XML Infoset in the
form of S-expressions. Further discussion on SXML in this section is based on
the SXML specification [9].

A simplified SXML grammar in EBNF notation is presented on figure 1.
An SXML <name> is a single Scheme symbol.

[1]

<TOP>

::=

( *TOP* <PI>*
<Element>)

[2]

<Element>

::=

( <name> <attributes-list>?
<child-of-element>* )

[3]

<attributes-list>

::=

( @ <attribute>*
)

[4]

<attribute>

::=

( <name> "value"? )

[5]

<child-of-element>

::=

<Element> | "character data" |
<PI>

[6]

<PI>

::=

( *PI* pi-target "processing
instruction content string" )

Table 1: A simplified SXML grammar

Since Infoset's information item is a sum of its properties, a list is
a particularly suitable data structure to represent an item. The head of the list,
a Scheme identifier, names the item. For many items this is their (expanded) item
name. For an information item that denotes an XML element, the corresponding list
starts with element's name, optionally followed by a collection of attributes.
The rest of the element item list is an ordered sequence of element's children
--- character data, processing instructions, and other elements in turn. Every
child is unique; items never share their children even if the latter have the
identical content.

Figure 2 illustrates the XML element and its SXML
form (which satisfies the <Element> production from figure 1).

The value of an attribute is normally a string; it may be omitted (in case of
HTML) for a boolean attribute, e.g., a "certified" attribute on figure 2.

A collection of attributes is considered an information item in its own right,
tagged with a special name @ . The character @ may not occur
in a valid XML name; therefore an <attributes-list> cannot be mistaken
for a list that represents an element. An XML document renders attributes, processing
instructions and other meta-data differently from the element markup. In contrast,
SXML represents element content and meta-data uniformly --- as tagged lists. RELAX NG [11]
-- a schema language for XML -- also aims to treat attributes as uniformly as
possible with elements. This uniform treatment, argues James Clark [12],
is a significant factor in simplifying the language. SXML takes advantage of the
fact that every XML name is also a valid Scheme identifier, but not every Scheme
identifier is a valid XML name. This observation lets us introduce administrative
names such as @, *PI*, *TOP* without worrying about
potential name clashes. The observation also makes the relationship between XML
and SXML well-defined. An XML document converted to SXML can be reconstructed
into an equivalent (in terms of the Infoset) XML document. Moreover, due to the
implementation freedom given by the Infoset specification, SXML itself is an instance
of the Infoset.

The XML Recommendation specifies that processing instructions (PI) are
distinct from elements and character data; processing instructions must be passed
through to applications. In SXML, PIs are therefore represented by nodes of a
dedicated type *PI*. XPath and the DOM Level 2 treat processing instructions
in a similar way.

A sample XML document and its SXML representation are both shown on figure 3,
thus providing an illustrative comparison between nested XML tags and nested SXML
lists. Note that the SXML document is a bit more compact than its XML counterpart.

SXML can also be considered an abstract syntax tree for a parsed XML document.
An XML document or a well-formed part of it can automatically be converted into
the corresponding SXML form via a functional Scheme XML parsing framework SSAX [13].

It's worth a note that SXML represents all the information contained in XML documents,
including comments, namespaces and external entities. These are omitted in this
article for the sake of simplicity, but they are considered in the SXML specification [9].

Since an SXML document is essentially a tree structure, it can be described in
a more uniform way by introducing the term of an SXML node for nodes
of this tree.

An SXML node can be defined on the basis of SXML grammar (figure 1)
as a single production --- [N] on figure 4. Alternatively,
an SXML node can be defined as a set two mutually-recursive datatypes --- [N1],
[N2] and [N3] on figure 4. In the latter case, a Node
is constructed by adding a name to the Nodelist as its leftmost member;
a Nodelist is itself a (sorted) list of Nodes.

The uniformity of the SXML representation for elements, attributes, and processing
instructions simplifies queries and transformations. For the SXML data model,
attributes and processing instructions look like regular elements with a distinguished
name. Therefore, query and transformation functions dedicated to attributes become
redundant, because ordinary functions with distinquished names can be used.

The uniform representation for SXML elements and attributes is especially convenient
for practical tasks [14]. Differences between
elements and attributes in XML are blurred. Choosing either an element or an attribute
for representing concrete practical information is often a question of style [15],
and such a choice can later be changed. Such a change in a data structure is expressed
in SXML as simply an addition/removal of one hierarchy level, namely an attribute-list.
This requires the minimal modification of an SXML application. For the SXML notation,
the only difference between an attribute and an element is that the former is
contained within the attribute-list (which is a special SXML node) and
cannot have nested elements.

For example, if data restructuring requires that the weight of the delivered load,
initially represented as a nested element, to be represented as an attribute,
the SXML element

(delivery
...
(weight "789")))
will change to:

(delivery
(@ (weight "789"))
...)
Such a notation for elements and attributes simplifies SXML data restructuring
and allows uniform queries to be used for data processing.

The syntax of LISP family programming languages, in particular, Scheme, is based
on S-expressions used for both data and code representation. This makes it possible
and convenient for Scheme programs to be treated as a semi-structured data [16]
and vice versa.

Since an SXML document and its nodes are S-expressions, they can be used for representing
a Scheme program. For making this possible, it is required (and sufficient) that
the first member of every list contained in the SXML tree is a function. The rest
members of the list are than the arguments, which are passed to that function.
In accordance with SXML grammar, attribute and element names and special names
must be bound to functions.

An SXML document or an SXML node that fulfils these requirements may be considered
a Scheme program which can be evaluated, for example, by means of eval
function.

(para "plain"
(bold "highlighted")
"plain")
can be treated as a program, and the result of its evaluation is the SXML element:

(p "plain"
(b "highlighted")
"plain")
Note that the result of evaluating such a program is not necessary an SXML element.
Namely, a program may return a textual representation for the source data in XML
or HTML [17]; or even have a side
effect, such as saving the SXML data in a relational database.

Traditional XML applications involve XML-language and general-purpose programming
languages, which leads to an impedance mismatch problem. Impedance mismatch can
be successfully overcome by using the Scheme functional programming language which
is widely approved as a scripting language. Scheme operates S-expressions which
are particularly suitable for representing hierarchical data structures.

Information available in a well-formed XML document is described by the XML Information
Set. XML, the DOM, XPath data model, etc. are different instances of the XML Infoset.
This paper suggested the other concrete instance of the XML Infoset --- SXML ---
that has the form of an S-expression and can thus be easily and naturally processed
via Scheme.

The technology considered may be employed for implementation of XML-based digital
libraries, and is especially suitable for implementation of light-weight digital
libraries [18].

About Authors

Dmitry Lizorkin - a Ph.D. student in the Moscow State University. His
M.Sc. thesis, defended in 2002, was dedicated to implementation of XML Linking
Language (XLink) using functional methods.
e-mail: lizorkin@hotbox.ru

Kirill Lisovskiy - PhD, IT Consultant and Senior Researcher Institute
for System Programming Russian Academy of Science. His primary area of research
interests is functional and logic techniques for semistructured data management.
Since 1999 he had participated in a number of research and development projects
related to implementation and application of XML data management techniques
based on the Scheme programming language.
e-mail: lisovsky@acm.org http://pair.com/lisovsky/