XPath

XPath is the W3C standard language for traversal and navigation in XML trees.

For navigation, we use the location path to identify nodes or content. A location path is a sequence of location steps separated by a /:

1

child::chapter/descendant::section/child::para

Every location step has an axis, :: and then a node test. Starting from a context node, a location returns a node-set. Every selected node in the node-set is used as the context node for the next step.

You can start an XPath expression with / start from the root, which is known as an absolute path.

XPath defines 13 axes allowing navigation, including self, parent, child, following-sibling, ancestor-or-self, etc. There is a special attribute axis to select attributes of the context node, which are not really in the child hierarchy. Similarly, namespace selects namespace nodes.

A nodetest filters nodes:

Test

Semantics

node()

let any node pass

text()

select only text nodes

comment()

preserve only comment nodes

name

preserves only elements/attributes with that name

*

* preserves every element/attribute

At each navigation step, nodes can be filtered using qualifiers.

1

axis::nodetest[qualifier][qualifier]

For instance:

1

following-sibling::para[position()=last()]

A qualifier filters a node-set depending on the axis. Each node in a node-set is kept only if the evaluation of the qualifier returns true.

Qualifiers may include comparisons (=, <, <=, …). The comparison is done on the string-value(), which is the concatenation of all descendant text nodes in document order. But there’s a catch here! Comparison between node-sets is under existential semantics: there only needs to be one pair of nodes for which the comparison is true. Thus, when negating, we can get universal quantification.

XPaths can be a union of location paths separated by |. Qualifiers can include boolean expressions (or, not, and, …).

XPath also has support for for variables, denoted $x (which are more like constants):

👉 This gives us the email addresses of reporters who may deal with events in the canton of Fribourg. See exercises 01 for more context.

There are a few basic functions: last(), position(), count(node-set), concat(string, string, ...string), contains(str1, str2), etc. These can be used within a qualifier.

XPath also supports abbreviated syntax. For instance, child:: is the default axis and can be omitted, @ is a shorthand for attribute::, [4] is a shorthand for [position()=4] (note that positions start at 1).

XPath is used in XSLT, XQuery, XPointer, XLink, XML Schema, XForms, …

Evaluation

To evaluate an XPath expression, we have in our state:

The context node

Context size: number of nodes in the node-set

Context position: index of the context node in the node-set

A set of variable bindings

XML Schemas

There are three classes of languages that constraint XML content:

Constraints expressed by a description of each element, and potentially related attributes (DTD, XML Schema)

To declare an element, we do as follows; by default, the author element as defined below may only contain string values:

1

<xsd:elementname="author"/>

But we can define other types of elements, that aren’t just strings. Types include string,
boolean, number, float, duration, time, date, AnyURI, … The types are still string-encoded and must be extracted by the XML application, but this helps verify the consistency.

1

<xsd:elementname="year"type="xsd:date"/>

We can bound the number of occurrences of an element. Below, the character element may be repeated 0 to ∞ times (this is equivalent to something like character* in a regex). Absence of minOccurs and maxOccurs implies exactly once (like in a regex).

This defines a Prolog type containing a sequence of a series, author, and Characters, which is character+.

Using the mixed="true" attribute on an xsd:complexType allows for mixed content: attributes, elements, and text can be mixed (like we know in HTML, where you can do <p>hello <em>world</em>!</p>).

There are more type constructor primitives that allow to do much of what regexes do: xsd:sequence, which we’ve seen above, but also xsd:choice (for enumerated elements) and xsd:all (for unordered elements).

Criticism

It requires a high level of expertise to avoid surprises, as there are many complex and unintuitive behaviors

The choice between element and attribute is largely a matter of the taste of the designer, but XML Schema provides separate functionality for them, distinguishing them strongly

There is only weak support for unordered content. In SGML, there was support for the & operator. A & B means that we must have A followed by B or vice-versa (order doesn’t matter). But we could enforce A & B* such that there would have to be a sequence of B which would have to be grouped. XML Schema is too limited to enforce such things.

In addition to datatypes, we can also express admissible XML content using regexes, but (and this is important!) we cannot exprain cardinality constraints or uniqueness constraints.

If we need to express those, we can make use of Schematron.

Schematron

Schematron is an assertion language making use of XPath for node selection and for encoding predicates. It is often used in conjunction with Relax NG to express more complicated constraints, that aren’t easily expressed (or can’t be expressed at all) in Relax NG. The common pattern is to build the structure of the schema in Relax NG, and the business logic in Schematron.

They can be combined in the same file by declaring different namespaces. For instance, the example below allows us to write a Relax NG schema as usual, and some Schematron rules rules under the sch namespace.

As we can see in the example below, a Schematron schema is built from a series of assertions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

<schemaxmlns="http://purl.oclc.org/dsdl/schematron"><title>A Schema for Books</title><nsprefix="bk"uri="http://www.example.com/books"/><patternid="authorTests"><rulecontext="bk:book"><asserttest="count(bk:author)!= 0">
A book must have at least one author
</assert></rule></pattern><patternid="onLoanTests"><rulecontext="bk:book"><reporttest="@on-loan and not(@return-date)">
Every book that is on loan must have a return date
</report></rule></pattern></schema>

A short description of the different Schematron elements follows:

<ns>: specifies to which namespace a prefix is bound. In the above example, the bk prefix, used as bk:book, is bound to http://www.example.com/books. This prefix is used by XPath in the elements below.

<pattern>: a pattern contains a list of rules, and is used to group similar assertions. This isn’t just for better code organization, but also allows to execute groups at different stages in the validation

<rule>: a rule contains <assert> and <report> elements. It has a context attribute, which is an XPath specifying the element on which we’re operating; all nodes matching the XPath expression are tested for all the assertions and reports of a rule

<assert>: provides a mechanism to check if an assertion is true. If it isn’t, a validation error occurs

<report>: same as an assertion, but the validation doesn’t fail; instead, a warning is issued.

XML Information Set

The purpose of XML Information Set, or Infoset, is to “purpose is to provide a consistent set of definitions for use in other specifications that need to refer to the information in a well-formed XML document1”.

It specifies a standardized, abstract model to represent the properties of XML trees. The goal is to provide a standardized viewpoint for the implementation and description of various XML technologies.

It functions like an AST for XML documents. It’s abstract in the sense that it abstract away from the concrete encoding of data, and just retains the meaning. For instance, it doesn’t distinguish between the two forms of the empty element; the following are considered equivalent (pairwise):

The Information Set is described as a tree of information items, which are simply blocks of information about a node in the tree; every information item is an abstract representation of a component in an XML document.

As such, at the root we have a document information item, which, most importantly, contains a list of children, which is a list of information items, in document order. Information items for elements contain a local name, the name of the namespace, a list of attribute information items, which contain the key and value of the attribute, etc.

XSLT

Motivation

XSLT is part of a more general language, XSL. The hierarchy is as follows:

XSL: eXtensible Stylesheet Language

XSLT: XSL Transformation

XLS-FO: XSL Formatting Objects

An XSLT Stylesheet allows us to transform XML input into other formats. An XSLT Processor takes an XML input, and an XSLT stylesheet and produces a result, either in XML, XHTML, LaTeX, …

XSLT is a declarative and functional language, which uses XML and XPath. It’s a W3C recommendation, often used for generating HTML views of XML content.

The XSLT Stylesheet consists of a set of templates. Each of them matches specific elements in the XML input, and participates to the generation of data in the resulting output.

e is an XPath expression that selects the nodes the XSLT processor will apply the template to

result specifies the content to be produces in the output for each node selected by e

xsl:apply-templates indicates that templates are to be applied on the selected nodes, in document order; to select nodes, it may have a select attribute, which is an XPath expression defaulting to child::node().

Default templates

XSLT Stylesheets contain default templates:

1
2
3

<xsl:templatematch="/ | *"><xsl:apply-templates/></xsl:template>

This recursively drives the matching process, starting from the root node. If templates are associated to the root node, then this default template is overridden; if the overridden version doesn’t contain any <xml: > elements, then the matching process is stopped.

In the above, the xsl:sort element has the following possible attributes:

select: here, the attribute is ., which refers to the title in this context

data-type: gives the kind of order (e.g. text or number)

order: ascending or descending

XQuery

XQuery is a strongly typed and functional language that offers features to operate on XML input for searching, selecting, filtering, transforming, restructuring information, etc. It is an SQL-like language for XML. It wasn’t defined with the same goals as XSLT, but has some overlap that we’ll discuss later.

It does not use the XML syntax. Instead, it offers a general purpose (Turing-complete) language that can be used for developing XML based applications.

XQuery is a W3C Recommendation, and is therefore closely linked to XML Schema, as it uses the XML Schema type system. Note that there is for no support for XQuery with Relax NG or other non-W3C schema languages. A nice book on XQuery is available at O’Reily.

A query takes some kind of XML content: an XML file, an XML fragment retrieved online, a native XML database, etc. The output is a sequence of values, which are often XML elements (this is important: not a document, but elements). But it could also be an XML Schema type, such as a string, a list of integers, etc.

The output can be serialized to a document, or just kept in-memory in the application for further processing.

Queries are evaluated by an XQuery processor, which works in two phases. First, the analysis phase may raise errors (that do not depend on the input, only on the query). Then, there is an evaluation phase, which may raise dynamic errors (e.g. missing input).

A query consists of one or more comma-separated XQuery expressions, which are composed of the following:

Creating XML content

To build XML content, we can embed “escaped” XQuery code using curly brackets, within our template file, as follows:

1
2
3

<reportyear="2018">
The value is {round (3.14)}
</report>

Sequences

A sequence is an ordered collection of items, which may be of any type (atomic value, node, etc). Duplicates are allowed. A sequence can contain zero (empty), one (singleton) or many items. Sequences are comma-separated. We can add parentheses for clarity, but not for nesting; a sequence is always flat (even if we nest parentheses):

As we can see above, there is some overlap between XQuery and XPath; the where condition can also be written as an XPath selection condition. Which to use is a question of style; there is no difference in performance.

The order by and where keywords work just like in SQL, so I won’t go into details here.

Conditional expressions

Like in any templating language, we can create conditional statements. It is mandatory to specify an else to every if, but if we do not want to return anything, we can return the empty sequence ().

The condition of an if must be a boolean or a sequence. Empty sequences are falsey, and sequences of one or more elements are truthy.

The types are sequence types, with both the number and types of items. For instance, xs:string? means a sequence of zero or one string. The return type is optional, but is strongly encouraged for readability, error checking and optimization.

Functions can be overloaded with a different number of parameters.

The body is enclosed in curly braces. It does not have to contain a return clause, it just needs to be an XQuery expression.

Modules

Functions can be grouped into modules, which declare the target namespace and bind it to a prefix (here, the strings prefix):

1

module namespace strings = "https://example.com/strings"

Anything declared under that prefix can be accessed from the outside, when importing the module.

Updating XML Content

Unlike SQL, standard XQuery only offers ways of querying data, and not of inserting, deleting or updating data. That’s why the W3C developed an extension to XQuery called the XQuery Update Facility.

Like SQL, the implementation of this Update Facility is often tied to specific database systems. In this course, we will use the eXist-db variant. Updates are executed by specifying the update keyword in the return clause.

Advanced features

As we mentionned earlier, XQuery is Turing complete. You can define your own functions, which may be grouped into modules, and may be higher-order functions.

Schema awareness is an optional feature; if it is supported, the validate expression may be used, which is useful for optimization and error checking. However, as we mentioned earlier, there is only support for W3C standardized schemas, not Relax NG.

While XQuery is mainly associated with XML, it is possible in newer versions to deal with text documents (like CSV, name/value config files, etc. since 3.0) and even JSON (since 3.1).

Coding guidelines

For robustness, it is important to handle missing values (empty sequences) and data variations.

XML Based Webapps

We’ve now learned to model (with schemas), transform (with XSLT), and query and process (with XQuery). How can we develop an XML based webapp combining these?

We will take a look at the Oppidum framework, which targets the development of XML-REST-XQuery (XRX) applications, using the eXist-db XML database.

XML Databases

An XML database looks quite a lot like a normal database; for instance, it uses a traditional, B-tree based indexing system, has a querying language, etc. The main difference is simply that data is XML instead of a table, and that we use XQuery instead of SQL.

REST

REST stands for REpresentational State Transfer. It’s an architectural style created by Roy Fieding in [his PhD thesis](https://www.ics.uci.edu/~fielding/

In REST, we have resources, located by a URL on Web-based REST, that may be processed by a client. A collection is simply a set of resources. Interaction with a REST API happens with classical CRUD (Create, Read, Update, Delete) on URLs, which in HTTP are the POST, GET, PUT and DELETE requests.

Oppidum

Oppidum is an open source framework to build XML Web-based applications with an MVC approach. The documentation is only in French, but the core idea is as follows: HTTP requests are handed to Oppidum by eXist. The application logic is then detailed in a pipeline consisting of:

Model: XQuery script (*.xql) returning relevant XML content

View: XSLT transformation (*.xsl)

Epilogue: XQuery script (epilogue.xql) for templating common content in HTML pages; this works using tags with the site namespace

To specify the REST architecture, Oppidum has a DSL that allows us to define the set of resources and actions, determine the URLs and associated HTTP verbs (GET, POST, etc) recognized by the application, and so on.

By convention, capitalized variables are type variables (non-terminals), and non-capitalized are terminals.

A tree grammar defines a set of legal trees. As any grammar, tree grammars are defined within an alphabet Σ, with a set of type variables X:={X1::=T1,…,XN::=Tn}. A tree grammar is defined by the pair (E,X), where E represents the starting type variable. Each Ti is a tree type expression:

To ensure that our tree grammar remains regular, we must introduce a syntactic restriction: every recursive use of a type variable X (unless it is within the content model) must be in the tail. For instance, the following grammars are not acceptable:

{X=a,X,b}{X=a,Y,b;Y=X}

But the following are fine:

{X=a,c[X],b}{X=a,Y;Y=b,X|ϵ}

A small reminder on regular vs. context-free grammars: regular grammars are decidable (we can check for inclusion with a DFA), while context-free grammars are undecidable (we cannot check for inclusion in anbn with a DFA, for instance).

Within the class of regular grammars, there are three subclasses of interest, in order of specificity (each of these is a subset of the classes above):

Context-free

Regular

Single Type

Local

Each subclass is defined by additional restrictions compared to its parent. The more restrictions we add, the more expressive power we lose. It turns out that these classes correspond to different XML technologies:

Context-free: ?

Regular: Relax NG

Single Type: XML Schema

Local: DTD

DTD & Local tree grammars

As we said previously, the expressive power of a grammar class is defined by which restriction have been imposed. In DTD, the restriction is that each element name is associated with a regex. This means that for each a[T1] and a[T1] occuring in X, the content models are identical: T1=T2.

In other words, in DTDs, the content of an XML tag cannot depend on the context of the tag. This removes some expressive power.

To construct a DTD validator, we just use a word automaton associated with each terminal. This automaton is a DFA, as DTD requires regular expressions to be deterministic. That is, the matched regexp must be able to be determined without lookahead to the next symbol. a(bc | bb) is not deterministic, but ab(c | b) is.

As a corollary, the union of two DTDs may not be a DTD. Indeed, the two DTDs could define different content models for the same terminal, which would be illegal. We say that the class is not closed composition (here, we showed that it isn’t closed under union).

XML Schema & Single-Type tree grammars

In XML Schema, it is possible to have different content models for elements of the same name when they are in different contexts (unlike for DTD). But still, for each a[T1] and a[T2] occuring under the same parent, the content models must be identical (T1=T2).

Still, this bring us more expressive power, so we have LDTD⊂Lxmlschema. This inclusion is strict, as we can construct grammars that are single-type (and not local) in XML Schema:

1
2
3
4
5
6

Dealer = dealer[Used, New]
Used = used[UsedCar]
New = new[NewCar]
UsedCar = car[Model, Year] // here, car can have different content models
NewCar = car[Model] // this is allowed as they have different parents
...

But XML schemas also have weaknesses: we cannot encode more advanced restrictions in it. For instance, with our car dealership example, we cannot encode something like “at least one car has a discount”, as it is not a single-type; we would require two different content models for a car within the same parent.

Consequently, this class is still not closed under union.

Relax NG & Regular tree grammars

Relax NG does not have any of the previously discussed restrictions. The content model does not have to depend on the label of the parent; it can also depend on the ancestor’s siblings, for instance. This allows us to have much more expressive power. Relax NG places itself in the class of regular tree grammars, and Lxmlschema⊂Lr.

Regular tree grammars are more robust (closed under set operations like union and intersection), give us high expressive power, while still remaining simply defined and well-characterized (inclusion can still be verified in linear time by DFA).

Tree automata

Definition

A tree automaton (plural automata) is a state machine dealing with tree structure instead of strings (like a word automaton would). Introducing these will allow us to provide a general framework for XML type languages by giving us a tool with which we can reason about regular tree languages.

A ranked tree can be thought of as the AST representation of a function call. For instance, f(a, b) can be represented as a tree with parent node f and two children a and b (in that order). We can also represent more complex trees with these notations (f(g(a, b, c), h(i)) gives us the full structure of a tree, for instance).

We define a ranked alphabet symbol as a formalization of a function call. It is a symbol a, associated with an integer representing the number of children, arity(a). We write a(k) for the symbol a with arity(a)=k.

This allows us to fix an arity to different tree symbols. Our alphabet could then be, for instance, {a(2),b(2),c(3),♯(0)}. In this alphabet, # would always be the leaves.

A ranked tree automaton A consists of:

F, a finite ranked alphabet of symbols

Q, a finite set of states

Δ, a finite set of transition rules

Qf⊆Q, a finite set of final states

In a word automaton, we write transitions as even1→odd. In a (bottom-up) tree automaton, the transitions are from the children’s state to the parents’ state. If a tree node has arity 2, a transition could be (q0,q1)a→q0. If the arity is k=0, we write ϵa(0)→q.

Example

As an example, we can think of a tree of boolean expressions. Let’s consider the following:

((0∧1)∨(1∨0))∧((0∨1)∧(1∧1))

We can construct this as a binary tree by treating the logical operators as infix notation of a function call:

∧(∨(∧(0,1),∨(1,0)),∧(∨(0,1),∧(1,1)))

In this case, our alphabet is F={∧,∨,0,1}. Our states are Q={q0,q1} (either true or false). The accepting state is Qf={q1}. Our transition rules are:

With these rules in place, we can evaluate binary expressions with a tree automaton.

Properties

The language of A is the set of trees accepted by A. For a tree automaton, the language is a regular tree language.

A tree automaton is deterministic as long as there aren’t too rules pointing us to different states:

(q1,…qk)a(k)→q,(q1,…qk)a(k)→q′q≠q′

With word automata, we know that we can build a DFA from any NFA. The same applies to tree automata: from a given non-deterministic (bottom-up) tree automaton, we can build a deterministic tree automaton.

As a corollary, this tells us that non-deterministic tree automata do not give us more expressive power; deterministic and non-deterministic automata recognize the same languages. However, non-deterministic automata tend to allow us to represent languages more compactly (conversion can turn a non-deterministic tree automaton of size N into a deterministic tree automaton of size O(2N), so we’ll use those freely.

Validation

Inclusion

Given a tree automaton A and a tree t, how do we check t∈Language(A)?

What we do is to just mechanically apply the transition rules. If the automaton is non-deterministic, we can keep track of the set of possible states, and see if the root of the tree contains a finishing state.

This mechanism of membership checking is linear in the size of the tree.

Closure

Tree automata are closed under set theoretic operations (we can just compute the union/intersection/product of the tuples defining the trees).

Emptiness

We can also do emptiness checking with tree automata (that is, checking if Language(A)=∅). To do so, we compute the set of reachable states, and see if any of them are in Qf. This process is linear in the size of the automaton.

Type inclusion

Given two automata A1 and A2, how can we check Language(A1)⊆Language(A2)?

Containment of a non-deterministic automata can be decided in exponential time. We do this by checking whether Language(A1∩¯A2)=∅. For this, we must make A2 deterministic (which is an exponential process).