On XML Languages…

Norman Walsh

Some XML languages have an XML syntax, some have a non-XML
syntax, and some have both. This paper explores the intersection of these
languages and syntaxes. What are the advantages of an XML syntax? What are
the advantages of a non-XML syntax? After discussing the general issues, the
paper presents two, alternative non-XML syntaxes for XProc as a case study
to further explore the issues.

Norman Walsh is a Lead Engineer at MarkLogic Corporation where he
works with the Application Services team. Norm is also an active
participant in a number of standards efforts worldwide: he is chair of
the XML Processing Model Working Group at the W3C where he is also
co-chair of the XML Core Working Group. At OASIS, he is chair of the
DocBook Technical Committee.

With more than a decade of industry experience, Norm is well known for
his work on DocBook and a wide range of open source projects. He is the
author of DocBook: The Definitive Guide.

On XML Languages…

Balisage: The Markup Conference 2012August 7 - 10, 2012

The Desperate Perl Hacker featured often in the early days of
XML. Designing a markup format that could be processed easily by
ordinary programmers using their chosen languages was an explicit
goal of XML: 4. It
shall be easy to write programs which process XML documents.

This goal was achieved, at least for XML itself, if not all of the
subsequent specifications in the broader ecosystem, and as a consequence
there are no significant, mainstream languages which are incapable of processing
XML. There are probably none for which there aren't a choice of XML
parsers. Any language built on top of the Java VM includes such a choice.
Modern languages like Scala include features for the specific purpose of
writing domain specific language parsers. These allow XML,
or subsets of XML, to be incorporated directly into the language itself.

It is straightforward to parse XML with more-or-less any programming
language you care to use. The way, and the extent to which, XML coexists with
those languages is largely a question of their design and the full range
of language design is outside the scope of this paper.

Within the XML community, many XML
languages have been designed specifically for the
purpose of processing XML. These include all of the usual suspects:
validation languages, transformation languages, query languages, etc.
These are languages designed by XML users for XML users to process XML.
These are the languages that are the focus of this paper.

We are concerned mostly with the syntax of these
languages, not their semantics. Of course, syntax and semantics are not
wholly separable. A language whose semantics are nothing more than the
expression of a single boolean value needs at most two tokens and so
can be vastly simpler syntactically than a language with Turing complete
semantics. Nevertheless, we'll focus mostly on the syntax for syntaxes
sake.

The first, perhaps most obvious, question to ask about the syntax
of an XML language is: to what extent is it XML itself? A brief survey
of XML languages reveals that there is considerable variety on this point.

On one end of the spectrum,
RELAX NG Compact Syntax has nothing that resembles XML to the untrained
eye. See Figure 1.

Let's look a little more closely at the distinction between
XQueryX and XSLT. On the one hand,
XQueryX provides improved machine readability: there are no
semantic elements not manifest in the XML. On the other hand, it gains this benefit
by sacrificing human readability. These are two possible axes on which we can
analyze a language syntax, we'll revisit them later.

In the meantime, distinguish a “practical” XML
syntax as one that is concise enough for human comprehension
(even if it relies on some non-XML syntax to aid readability).

There may be room for debate about some cells in that table.
Evan Lenz's work on [carrot], for example, is
moving in the direction of a more compact, non-XML syntax for XSLT.
One could argue that TeX is a non-XML syntax for MathML. We might debate
whether or not attribute-based languages like XLink are or are not XML.
And, in addition,
there may be other syntaxes for these languages of which the author is
unaware. However, at a coarse level of granularity, what we can see is
that there are languages all across the spectrum.

Syntactically: XML or not?

Seeing languages spread across a spectrum like this invites the
question: why? What motivates a language designer to choose an XML
syntax, or not? When both are provided, what motivates a user to
choose an XML syntax, or not?

The case for XML syntaxes

Why choose XML?

“Eat your own dogfood”/”Fly your own airplanes.” One school of thought
says that XML languages should be expressed in XML simply because they are XML
languages. Some XML developers find XML to be a clear and precise format for
the expression of ideas.

Extensibility. The XML syntax has natural extension points,
attributes on start tags, for example, and namespaces. At a
syntactic level, extending an XML language is an
easily solved problem. Conversely, non-XML languages sometimes suffer from
a dearth of extension points. Keeping a grammar for a complex language like
XQuery free from ambiguity while simultaneously adding language features
can be a real challenge.

Whether the accretion of language features through this form of ad-hoc
extension, in either the XML or non-XML cases, produces a coherent and
regular language over time, is a separate question.

Accessibility to XML tools. The fact that an XSLT stylesheet can be used
to produce an XSLT stylesheet is not a feature that every
XSLT user needs, but there are circumstances when it is a great boon.

Documentation. The ability to inline documentation in an XML language
is considered a great benefit in some environments. Expressing XML documentation
in a non-XML language can have a deleterious effect readability. Compare,
for example, the non-XML representation of the unitprice pattern,
Figure 5, with the equivalent XML representation,
Figure 6.

Syntactic conformance. Operating on XML with a language that has
an XML syntax provides certain minimum assurances about the outputs. An XSLT
stylesheet, which must itself be well formed, guarantees[2] that
the resulting document will be well formed, by virtue of the nature of XSLT.

Learnability? There's certainly anecdotal evidence that
non-programmers can be taught to be productive with XSLT in ways that
don't have parallels in non-XML languages. This may be because the
structure of the XSLT stylesheet has a strong surface resemblance to
the documents that are to be transformed. This is true both at the
level of the surface syntax (they're both XML) and at a deeper level
in that templates contain fragments of the documents in a very obvious
and direct way.

Declarativeness? There's a tendency for XML languages to have a more
declarative nature than their non-XML counterparts. This can be seen particularly
in the case of XSLT as compared to XQuery. The XSLT stylesheet in
Figure 3 was written in a very “pull” fashion in order to
have as much surface similarity to the XQuery example, Figure 4,
as possible[3].

A more idiomatically natural XSLT solution for the problem is shown in
Figure 7.

In the idiomatic, or “push”, style separate templates are declared
for each component. This greatly increases the flexibility and reusability of
XSLT.

Familiarity. For users whose principle tasks involve editing, validating,
transforming, or otherwise working with XML, a language that is itself expressed
in XML has a certain familiarity. Languages like XSLT or RELAX NG can be edited
in the same comfortable, understood environment used for other XML editing tasks.

The case for non-XML syntaxes

Why choose a non-XML syntax?

Conciseness. One of the principle attractions of a non-XML syntax is that
it's more compact, more concise. A concise syntax allows more information to
fit on a screen or page and consequently provides the reader with a greater
perspective on the language.

The compact schema in Figure 1 fits easily on a
single page or screen and is completely straightforward to understand,
assuming you're familiar with RELAX NG and its compact syntax.

The same schema expressed in the XML syntax, Figure 8, is twice as long as it's
compact counterpart. It's not manifestly more difficult to understand,
assuming you're familiar with RELAX NG and its XML syntax, but it
doesn't fit on a single page and contains a lot of syntactic “clutter” that
one must learn to “look through”.

Familiarity. For tasks, such as programming, that are most
typically performed with non-XML languages, using a non-XML syntax for
an XML language makes it more familiar and approachable for users that come
from other backgrounds.

XQuery is arguably far more familiar, and consequently less threatening
and more approachable, and easier to learn for a programmer with a background
in SQL or any of a host of common scripting languages.

Accessibility to non-XML tools.
Both familiarity and conciseness play into another strength for non-XML languages:
support in tools and environments that programmers are used to. An XQuery or
RELAX NG Compact Syntax plugin for the programmer's favorite IDE makes editing those
files part of a comfortable, understood environment. Using an XML syntax may require
a new editing tool.

Syntactic expressiveness. An XML syntax imposes constraints on what characters
may appear unescaped. Some of the characters that must escaped are common in
other contexts. For example, it's easy to argue that “$a <= 5”
is easier to read and understand than “$a &lt;= 5”.

Syntactically: Both?

Why choose if you can have both? RELAX NG is widely praised for having both
an XML syntax and a compact syntax. Why not always take that approach?

One critical metric by which the success or failure of a
dual-syntax approach will be judged is semantic compatibility.
Arguably, the RELAX NG Compact Syntax has not been successful simply
because it has the advantages of a non-XML syntax, but also because it
describes exactly the same language as the XML
syntax. There are no constructs that can be represented in the compact
syntax that cannot be represented in the XML syntax, and vice-versa.
It is possible to translate every valid schema
losslessly from one format to the other and back again.

In practice, this is a remarkably high bar.
RELAX NG is a purely declarative language with no semantics for
iteration or transformation. As such, it is burdened with far fewer
semantics to express than a programming language like XSLT or XQuery.
It is difficult to imagine finding a useful alternative syntax for either
of those languages that expressed precisely the same
underlying semantics.

Yet, the absolute syntactic isomorphism of the two syntaxes is
considered in this paper to be an absolute requirement. Devising alternate
syntaxes for subsets of a language is both much easier and much less
useful. Every instance of the language that uses a construct not available
in the alternate syntax is unavailable to the users who prefer the alternative,
and to tools that are designed to work best with it.

It's also worth noting that even in the RELAX NG case, there are
unusual artifacts in the non-XML syntax: square bracketed notations
placed in front of the constructs that they modify and a somewhat
torturous representation of XML markup in such annotations. Luckily,
and by design, these annotations are uncommon, the simplest of these
annotations are the most common and the most complicated are quite
rare. Also, because of the syntactic isomorphism, it is possible to
switch back-and-forth between the syntaxes, editing XML annotations in
the XML syntax, and content models in the compact syntax, for
example.

Case studies: compact syntaxes for XProc

To explore these ideas further, for the balance of this paper,
we will consider two alternative, compact syntaxes for
XProc: An XML Pipeline Language.

XProc, for those unfamiliar with it, is a language “for
describing operations to be performed on XML documents.”A pipeline
accepts XML documents as input, performs an arbitrary series of
operations on them, and produces XML documents as output. In the
context of an XProc pipeline, an “operation” is one of a set of
discrete steps. These steps perform tasks such as adding an attribute,
counting nodes, deleting nodes, inserting nodes, performing XInclude,
XSLT, or XQuery, various forms of validation. XProc has about 40 such
operations built in and may be extended with additional operations.

This pipeline takes a single input document, performs XInclude processing,
styles it using the “dbslides.xsl” stylesheet, and then
produces as its output the result of that transformation. If the XProc processor
serializes the result, it does so as indented XHTML.

Case study 1: A compact syntax for XProc

How might the pipeline in Figure 9 be
represented in a compact, non-XML syntax? Where might we look for
inspiration?

Python? With significant whitespace?

Pascal? With BEGIN/END and :=?

Scheme? Because everything looks better with parentheses?

Something from the C/Java/JavaScript family?

For our first attempt, we'll take the last option. Translating
Figure 9 into a compact syntax along these lines produces
Figure 10.

This is in many ways a very direct translation. Like RELAX NG's
compact syntax and XQuery, we use curly braces to delimit the bodies
of our semantic constructs. Each new construct is introduced by a new
token. There are two syntactic extension points in the XML syntax that
we must accommodate: the presence of arbitrary extension attributes on what are
elements in the XML syntax, and the presence of arbitrary XML
fragments.

The “with” keyword is used at the end of each
construct in the compact syntax to introduce an unbounded list of
name/value pairs. These map back to extension attributes in the XML
syntax.

Where additional namespaces are required, as in the pipeline library
in Figure 11, they're introduced in the compact syntax and
CNames are allowed as tokens. The equivalent library in
this compact syntax is shown in Figure 12.

This example shows the use of an extension attribute, cx:type,
represented in the compact syntax.

The other challenge is representing arbitrary XML. In RELAX NG,
arbitrary XML fragments are always annotations of one sort or another;
they're both relatively uncommon and, to some extent, unimportant to
the core grammar. Not so in XProc where they appear both in annotations,
like p:documentation, Figure 13,
but also as inline document
content in the pipeline. Using a syntax as awkward as the approach in
RNC seems like a bad choice.

However, in the context of parsing a non-XML syntax, it must be
possible to recognize both where the XML begins and where it
ends. The presence of, for example, a fragment of
XProc compact syntax in a program listing in some XML must not be
accidentally parsed as XProc. One approach would be to build a
complete XML parser into the grammar of the compact syntax. But even
this is tricky because a p:inline might include
several consecutive sibling elements that each have to be recognized.

If only there were some string of tokens that can't appear in
XML…

In fact, such a sequence exists. Almost. The sequence “]]>”
is forbidden in XML except when it ends a CDATA section.
We can leverage this fact in our compact syntax to form delimiters for
arbitrary XML: “<![xml[” and “]]>”.
See Figure 14.

It's arguably a hack, but it allows us to satisfy the
requirement that each syntax represent exactly the same underlying
constructs.

This syntax has been implemented. The implementation strategy is
to transform the compact syntax into the XML syntax as a
pre-processing step and then process the resulting XML as usual.

How does this syntax stand up to the suggested benefits of
non-XML syntaxes?

Conciseness? A wash. It's not clearly shorter in terms of absolute number of lines.

Familiarity? Not clear. It has the advantage of
less visual clutter, but doesn't draw from the C/Java/JavaScript family in
any significant regard beyond curly braces.

Accessibility to non-XML tools? Probably an improvement. It's likely that a modern
IDE could be customized with the EBNF (see Appendix A).

Syntactic expressiveness? An improvement; outside of XML blocks, there are
no characters that need to be explicitly escaped.

Case study 2: An alternate compact syntax for XProc

When I presented the first compact syntax in a lightning talk last year,
Jeni Tennison
observed that it could be made more compact, and perhaps more useful
if it was more idiomatically like other programming languages. She
subsequently produced most of the “second compact syntax” language
design.

This is clearly a non-XML syntax, but it retains all of the semantic
flavor of the original. In the second XProc compact syntax, a choose statement
is represented using an if/then/else construct that's likely to be more familiar
to programmers, see Figure 18.