An algorithm for RELAX NG validation

James Clark (jjc@thaiopensource.com)

2002-02-13

Table of contents

This document describes an algorithm for validating an XML document
against a RELAX NG schema. This algorithm is based on the idea of
what's called a derivative (sometimes called a residual).
It is not the only possible algorithm for RELAX NG validation. This
document does not describe any algorithms for transforming a RELAX NG
schema into simplified form, nor for determining whether a RELAX NG
schema is correct.

We use Haskell to describe the
algorithm. Do not worry if you don't know Haskell; we use only a tiny
subset which should be easily understandable.

Basics

First, we define the datatypes we will be using. URIs and local
names are just strings.

Note that there is an Element pattern rather than a
Ref pattern. In the simplified XML representation of
patterns, every ref element refers to an
element pattern. In the internal representation of
patterns, we can replace each reference to a ref pattern
by a reference to the element pattern that the
ref pattern references, resulting in a cyclic data
structure. (Note that even though Haskell is purely functional it can
handle cyclic data structures because of its laziness.)

In the instance, elements and attributes are labelled with QNames;
a QName is a URI/local name pair.

An XML document is represented as a ChildNode. There are
two kinds of child node:

a TextNode containing a string;

an ElementNode containing a name (of type
QName), a Context, a set of attributes
(represented as a list of AttributeNodes, each of which
will be an AttributeNode), and a list of children
(represented as a list of ChildNodes).

The key concept used by this validation technique is the concept of
a derivative. The derivative of a pattern p with
respect to a node x is a pattern for what's left of
p after matching x; in other words, it is a
pattern that matches any sequence that when appended to x
will match p.

If we can compute derivatives, then we can determine whether a
pattern matches a node: a pattern matches a node if the derivative of
the pattern with respect to the node is nullable.

It is desirable to be able to compute the derivative of a node in a
streaming fashion, making a single pass over the tree. In order
to do this, we break down an element into a sequence of components:

a start-tag open containing a QName

a sequence of zero or more attributes

a start-tag close

a sequence of zero or more children

an end-tag

We compute the derivative of a pattern with respect to an element
by computing its derivative with respect to each component in turn.

We can now explain why we need the After pattern. A
pattern After xy is a pattern that
matches x followed by an end-tag followed by
y. We need the After pattern in
order to be able to express the derivative of a pattern with respect
to a start-tag open.

The central function is childNode which computes the
derivative of a pattern with respect to a ChildNode and a
Context:

Perhaps the trickiest part of the algorithm is in computing the
derivative with respect to a start-tag open. For this,
we need a helper function; applyAfter takes
a function and applies it to the second operand of
each After pattern.

We rely here on the fact that After patterns are
restricted in where they can occur. Specifically, an
After pattern cannot be the descendant of any pattern
other than a Choice pattern or another After
pattern; also the first operand of an After pattern can
neither be an After pattern nor contain any
After pattern descendants.

For Interleave, OneOrMoreGroup or After we compute the derivative in a
similar way to textDeriv but with an important twist.
The twist is that instead of applying interleave,
group and after directly to the result of
recursively applying startTagOpenDeriv, we instead use
applyAfter to push the interleave,
group or after down into the second operand
of After. Note that the following definitions ensure
that the invariants on where After patterns can occur are
maintained.

We make use of the standard Haskell function flip
which flips the order of the arguments of a function of two arguments.
Thus, flip applied to a function of two arguments
f and an argument x returns a function of one
argument g such that g(y) =
f(y, x).

Computing the derivative with respect to an attribute done in a
similar to computing the derivative with respect to a text node. The
main difference is in the handling of Group, which has to
deal with the fact that the order of attributes is not significant.
Computing the derivative of a Group pattern with respect
to an attribute node works the same as computing the derivative of an
Interleave pattern.

The case where the list of children is empty is treated as if there
were a text node whose value were the empty string. See rule
(weak match 3) in the RELAX NG spec.

childrenDeriv cx p [] = childrenDeriv cx p [(TextNode "")]

In the case where the list of children consists of a single text
node and the value of the text node consists only of whitespace, the
list of children matches if the list matches either with or without
stripping the text node. Note the similarity with
valueMatch.

Optimizations

Computing nullable

The nullability of a pattern can be determined straightforwardly as
the pattern is being constructed. Instead of computing
nullable repeatedly, it should be computed once when the
pattern is constructed and stored as a field in the pattern.

Interning patterns

Additional optimizations become possible if it is possible to
efficiently determine whether two patterns are equal. We don't want to
have to completely walk the structure of both patterns to determine
equality. To make efficient comparison possible, we intern
patterns in a hash table. Two interned patterns are equal if and only
if they are the same object (i.e. == in Java terms).
(This is similar to the way that Strings are interned to
make Symbols which can be compared for equality using
==.) To make interning possible, there are two notions
of identity defined on patterns each with a corresponding hash
function:

interned identity is simply object identity (i.e. ==
or Object.equals in Java); for a hash function, we can
use Object.hash in Java or the address of the object in
C/C++

uninterned identity uses the type of the pattern, the
interned identity of subpatterns, and the identity of any other
parts of the pattern; similarly, the uninterned hash function calls
the interned hash function on subpatterns

To intern patterns, we maintain a set of patterns implemented as a
hash table. The hash table used uninterned identity and the
corresponding uninterned hash function. When a new pattern is
constructed, any subpatterns must first be interned. The pattern is
interned by looking it up in the hash table. If it is found, we throw
the new pattern away and instead return the existing entry in the hash
table. If it is not found, we store the pattern in the hash table
before returning it. (This is basically hash-consing.)

Avoiding exponential blowup

In order to avoid exponential blowup with some patterns, it is
essential for the choice function to eliminate redundant
choices. Define the choice-leaves of a pattern to be the
concatenation of the choice-leaves of its operands if the the pattern
is a Choice pattern and the empty-list otherwise.
Eliminating redundant choices means ensuring that the list of
choice-leaves of the constructed pattern contains no duplicates. One
way to do this is to for choice to walk the choice-leaves
of one operand building a hash-table of the set of choice-leaves of
that operand; then walk the other operand using this hash-table to
eliminate any choice-leaf that has occurred in the other operand.

Memoization

Memoization is an optimization technique that can be applied to any
pure function that has no side-effects and whose return value depends
only on the value of its arguments. The basic idea is to remember
function calls. A table is maintained that maps lists of arguments
values to previously computed return values for those arguments. When
a function is called with a particular list of arguments, that list of
arguments is looked up in the table. If an entry is found, then the
previously computed value is returned immediately. Otherwise, the
value is computed as usual and then stored in the table for future
use.

The functions startTagOpenDeriv,
startTagCloseDeriv and endTagDeriv defined
above can be memoized efficiently.

Memoizing textDeriv is suboptimal because although the
textDeriv takes the string value of the text node and the
context as arguments, in many cases the result does not depends on
these arguments. Instead we can distinguish two different cases for
the content of an element. One case is that the content contains no
elements (i.e. it's empty or consists of just a string). In this case,
we can first simplify pattern using a textOnlyDeriv that
replaces each Element pattern by NotAllowed.
This can be efficiently memoized.

In this case, textOnlyDeriv will always be followed
by endTagDeriv, so we can fold the functionality
of endTagDeriv into textOnlyDeriv.

In the other case, the content of the element contains one or more
child elements. In this case, any text nodes can match only
Text patterns (because of the restrictions in section 7.2
of the RELAX NG specification). The derivative of a Text
pattern with respect to a text node does not depend on either the
value of the text node or the context. We therefore introduce a
mixedTextDeriv function, which can be efficiently
memoized, for use in this case.

Another important special case of textDeriv that can
be memoized efficiently is when we can determine statically that a
pattern is consistent with some datatype. More precisely, we can
define a pattern p to be consistent with a datatype
d if and only if for any two strings
s1s2, and any two
contexts c1c2, if
datatypeEqualds1c1s2c2,
then
textDerivc1ps1
is the same as
textDerivc2ps2.
In this case, we can combine the string and context arguments into a
single argument representing the value of the datatype that the string
represents in the context; this can be much more efficiently memoized
than the general case.

The attDeriv function can be memoized more efficiently
by splitting it into two function. The first function is a
startAttributeDeriv function that works like
startTagOpenDeriv and depends just on the
QName of the attribute. The second stage works in the
same way to the case when the children of an element contain a single
string.

Error handling

So far, the algorithms presented do nothing more than compute
whether or not the node is valid with respect to the pattern.
However, a user will not appreciate a tool that simply reports that
the document is invalid, without giving any indication of where the
problem occurs or what the problem is.

The most important thing is to detect invalidity as soon as
possible. If an implementation can do this, then it can tell the user
where the problem occurs and it can protect the application from
seeing invalid data. If we consider the XML document to be a sequence
of SAX-like events, then detecting the error as soon as
possible, means that the implementation must detect when an
initial sequence s of events is such that there is no valid
sequence of events that starts with s.

This is straightforward with the algorithm above. Detecting the
error as soon as possible is equivalent to detecting when the current
pattern becomes NotAllowed. Note that this relies on the
choice, interleave, group and
after functions recognizing the algebraic identities
involving NotAllowed. The current pattern immediately
before it becomes NotAllowed describes what was expected
and can be used to diagnose the error.

It some scenarios it may be sufficient to produce a single error
message for an invalid document, and to cease validation as soon as it
is determined that the document is invalid. In other scenarios, it
may desirable to attempt to recover from the error and continute
validation so as to find subsequent errors in the document. Jing
recovers from validation errors as follows:

If startTagOpenDeriv causes an error, then Jing first
tries to recover on the assumption that some required elements have
been omitted. In effect, it transforms the pattern by making the
first operand of each Group optional and then retries
startTagOpenDeriv. If this still causes an error, then
the purposes of validating following siblings, it ignores the
element. For the purpose of validating the element itself, it searches
the whole schema for element patterns with a name class
that contains the name of the start-tag open. If it finds one or more
such element patterns, then it uses a choice
of the content of all element patterns that have a
name-class that contains the name of the start-tag open with maximum
specificity. A name-class that contains the name by virtue of a
name element is considered more specific than one that
contains the name by virtue of a nsName or
anyName element; similarly, a name-class that contains
the name by virtue of a nsName element is considered more
specific than one that contains the name by virtue of a
anyName element. If there is no such element pattern,
then it validates only any maximal subtrees rooted in an element for
which the schema does contain an element
pattern. Anything outside the maximal subtrees is ignored.

If startAttributeDeriv causes an error, then it
recovers by ignoring the attribute.

If startTagCloseDeriv causes an error, it recovers by
replacing all attribute patterns by
empty.

If textDeriv (used only for an attribute value or for
an element that contains no child elements) causes an error, then it
recovers by replacing the first operands of all top-level
After patterns (i.e. After patterns not
inside another After pattern) by empty.

If mixedTextDeriv causes an error, it recovers by
ignoring the text node.

If endTagDeriv causes an error, it recovers by using
a choice of the second operands of all top-level
After patterns.