We've seen some VM dumps recently that
showed evidence of contention problems when multiple threads
(created, for example, using <xsl:for-each>
with the saxon:threads
attribute) were attempting lazy evaluation of the same local
variable. So I've been looking at the lazy evaluation code in Saxon
to try and understand all the permutations of how it works. A blog
posting is a good way to try and capture that understanding before I
forget it all again. But I won't go into the extra complexities of
parallel execution just yet: I'll come back to that at the end.

Lazy evaluation applies when a variable
binding, for example "let
$v := //x[@y=3]" isn't evaluated immediately when the
variable declaration is encountered, but only when the variable is
actually referenced. This is possible in functional languages because
evaluating an expression has no side-effects, so it doesn't matter
when (or how often) it is done. In some functional languages such as
Scheme, lazy evaluation happens only if you explicitly request it. In
others, such as Haskell, lazy evaluation is mandated by the language
specification (which means that a variable can hold an infinite
sequence, so long as you don't try to process its entire value). In
XSLT and XQuery, lazy evaluation is entirely at the discretion of the
compiler, and in this post I shall try to summarize how Saxon makes
use of this freedom.

Internally, when a local variable is
evaluated lazily, Saxon instead of putting the variable's value in
the relevant slot on the stack, will instead put a data structure
that contains all the information needed to evaluate the variable:
that is, the expression itself, and any part of the evaluation
context on which it depends. In Saxon this data structure is called a
Closure. The terminology isn't quite right, because it's not quite
the same thing as the closure of an inline function, but the concepts
are closely related: in some languages, lazy evaluation is
implemented by storing, as the value of the variable, not the
variable's actual value, but a function which delivers that value
when invoked, and the data needed by this function to achieve that
task is correctly called a closure. (If higher-order functions had
been available in Saxon a few years earlier, we might well have
implemented lazy evaluation this way.)

We can distinguish two levels of lazy
evaluation. We might use the term "deferred evaluation" to
indicate that a variable is not evaluated until it is first
referenced, and "incremental evaluation" to indicate that
when it is referenced, it is only evaluated to the extent necessary.
For example, if the first reference is the function call head($v),
only the first item in the sequence $v will be evaluated; remaining
items will only be evaluated if a subsequent reference to the
variable requires them.

Lazy evaluation can apply to global
variables, local variables, parameters of templates and functions,
and return values from templates and functions. Saxon handles each
case slightly differently.

We should mention some static
optimizations which are not directly related to lazy evaluation, but
are often confused with it. First, a variable that is never
referenced is eliminated at compile-time, so its initializing
expression is never evaluated at all. Secondly, a variable that is
only referenced once, and where the reference is not in any kind of
loop, is inlined: that is, the variable reference is replaced by the
expression used to initialize the variable, and the variable itself
is then eliminated. So when someone writes "let
$x := /a/b/c return $x[d=3]", Saxon turns this into the
expression "(/a/b/c)[d=3]".
(Achieving this of course requires careful attention to the static
and dynamic context, but we won't go into the details here.)

Another static optimization that
interacts with variable evaluation is loop-lifting. If an expression
within a looping construct (for example the content of xsl:for-each,
or of a predicate, or the right-hand-side of the "/"
operator) will have the same value for every iteration of the loop,
then a new local variable bound to this expression is created outside
the loop, and the original expression is replaced by a reference to
the variable. In this situation we need to take care that the
expression is not evaluated unless the loop is executed at least once
(both to avoid wasted evaluation cost, and to give the right
behaviour in the event that evaluating the expression fails with a
dynamic error.) So lazy evaluation of such a variable becomes
mandatory.

The combined effect of these static
optimizations, together with lazy evaluation, is that the order of
evaluation of expressions can be quite unintuitive. To enable users
to understand what is going on when debugging, it is therefore normal
for some of these rewrites to be suppressed if debugging or tracing
are enabled.

For global variables, Saxon uses
deferred evaluation but not incremental evaluation. A global variable
is not evaluated until it is first referenced, but at that point it
is completely evaluated, and the sequence representing its value is
held in memory in its entirety.

For local variables, evaluation is
generally both deferred and incremental. However, the rules are quite
complex.

If the static type shows that the
value will be a singleton, then it will be evaluated eagerly. [It's
not at all clear that this rule makes sense. Certainly, incremental
evaluation makes no sense for singletons. But deferred evaluation
could still be very useful, for example if the evaluation is
expensive and the variable is only referenced within a branch of a
conditional, so the value is not always needed.]

Eager evaluation is used when the
binding expression is very simple: in particular when it is a
literal or a reference to another variable.

Eager evaluation is used for
binding expressions that depend on position()
or last(), to
avoid the complexities of saving these values in the Closure.

There are some optimizations which
take precedence over lazy evaluation. For example if there are
variable references using predicates, such as $v[@x=3],
then the variable will not only be evaluated eagerly, but will also
be indexed on the value of the attribute @x.
Another example: if a variable is initialized to an expression such
as ($v, x) -
that is, a sequence that appends an item to another variable -
then we use a "shared append expression" which is a data
structure that allows a sequence to be constructed by appending to
an existing sequence without copying the entire sequence, which is a
common pattern in algorithms using head-tail recursion.

Lazy evaluation (and inlining)
need special care if the variable is declared outside a try/catch
block, but is referenced within it. In such a case a dynamic error
that occurs while evaluating the initialization expression must not
be caught by the try/catch; it is logically outside its scope.
(Writing this has made me realise that this is not yet implemented
in Saxon; I have written a test case and it currently fails.)

If none of these special circumstances
apply, lazy evaluation is chosen. There is one more choice to be
made: between a Closure
and a MemoClosure.
The common case is a MemoClosure,
and in this case, as the variable is incrementally evaluated, the
value is saved for use when evaluating subsequent variable
references. A (non-memo) closure is used when it is known that the
value will only be needed once. Because most such cases have been
handled by variable inlining, the main case where a non-memo closure
is used is for the return value of a function. Functions, like
variables, are lazily evaluated, so that the value returned to the
caller is not actually a sequence in memory, but a closure containing
all the information needed to materialize the sequence. (Like most
rules in this story, there is an important exception: tail-call
optimization, where the last thing a function does is to call itself,
takes precedence over lazy evaluation).

So let's look more closely at the
MemoClosure. A
MemoClosure is a
data structure that holds the following information:

The Expression itself (a pointer
to a node in the expression tree). The Expression object also holds
any information from the static context that is needed during
evaluation, for example namespace bindings.

A copy of the dynamic context at
the point where the variable is bound. This includes the context
item, and values of any local variables referenced by the
expression.

The current evaluation state: one
of UNREAD (no access to the variable has yet been made), MAYBE_MORE
(some items in the value of the variable are available, but there
may be more to come), ALL_READ (the value of the variable is fully
available), BUSY (the variable is being evaluated), or EMPTY
(special case of ALL_READ in which the value is known to be an empty
sequence).

An InputIterator:
an iterator over the results of the expression, relevant when
evaluation has started but has not finished

A reservoir: a list containing the
items delivered by the InputIterator so far.

Many variable references, for example
count($v), or
index-of($v, 'z')
result in the variable being evaluated in full. If this is the first
reference to the variable, that is if the state is UNREAD, the logic
is essentially

inputIterator
= expression.iterate(savedContext);

for
item in inputIterator {

reservoir.add(item);

}

state
= ALL_READ;

return
new SequenceExtent(reservoir);

(However, Saxon doesn't optimize this
case, and it occurs to me on writing this that it could.)

Other variable references, such as
head($v), or
$v[1], or
subsequence($v, 1, 5),
require only partial evaluation of the expression. In such cases
Saxon creates and returns a ProgressiveIterator,
and the requesting expression reads as many items from the
ProgressiveIterator
as it needs. Requests to get items from the ProgressiveIterator
fetch items from the reservoir to the extent they are available; on
exhaustion of the reservoir, they then attempt to fetch items from
the InputIterator until either enough items are available, or the
InputIterator is exhausted. Items delivered from the InputIterator
are copied to the reservoir as they are found.

So far so good. This has all been in
place for years, and works well. We have no evidence that it is in
any way optimal, but it has been carefully tweaked over the years to
deal with particular cases where it was performing badly. What has
changed recently is that local variables can be referenced from
multiple threads. There are two particular cases where this happens
today: when xsl:result-document
is used in Saxon-EE, it executes by default asynchronously in a new
thread; and when the extension attribute saxon:threads
is used on xsl:for-each,
the items selected by the xsl:for-each
are processed in parallel rather than sequentially.

The effect of this is that the
MemoClosure object needs to be thread-safe: multiple requests to
access the variable can come simultaneously from different threads.
To achieve this a number of methods are synchronized. One of these is
the next() method
of the ProgressiveIterator:
if two threads reference the variable at the same time, each gets its
own ProgressiveIterator,
and the next()
method on one of these iterators is forced to wait until the other
has finished.

This works, but it is risky. Brian
Goetz in his excellent book Java Concurrency in Practice
recommends that a method should not be synchronized unless (a) its
execution time is short, and (b) as the author of the method, you
know exactly what code will execute while it is active. In this case
neither condition is satisfied. The next()
method of ProgressiveIterator
calls the next()
method of the InputIterator,
and this may perform expensive computation, for example retrieving
and parsing a document using the doc()
function. Further, we have no way of analyzing exactly what code is
executed: in the worst case, it may include user-written code (for
example, an extension function or a URIResolver).
The mechanism can't deadlock with itself (because there cannot be a
cycle of variable references) but it is practically impossible to
prove that it can't deadlock with other subsystems that use
synchronization, and in the face of maliciously-written used code,
it's probably safe to assume that deadlock can occur. We
haven't seen deadlock happen in practice, but it's unsatisfactory
that we can't prove its impossibility.

So what should we do about it?

I think the answer is, add yet another
exception to the list of cases where lazy evaluation is used:
specifically, don't use it for a variable that can be referenced from
a different thread. I'm pretty sure it's possible to detect such
cases statically, and they won't be very common. In such cases, use
eager evaluation instead.

We must be careful not to do this in
the case of a loop-lifted variable, where the correct error semantics
depend on lazy evaluation. So another tweak to the rules is, don't
loop-lift code out of a multithreaded execution block.

This investigation also suggests a few
other refinements we might make.

It seems worth optimizing for the
case where the entire value of a variable is needed, since this case
is so common. The problem is, it's not easy to detect this case: a
calling expression such as count($v)
will ask for an iterator over the variable value, without giving any
indication that it intends to read the iterator to completion.

We need to reassess the rule that
singleton local variables are evaluated eagerly.

We currently avoid using lazy
evaluation for expressions with certain dependencies on the dynamic
context (for example, position()
and last()). But
in the course of implementing higher-order functions, we have
acquired the capability to hold such values in a saved copy of the
dynamic context.

We could look at a complete
redesign that takes advantage of higher-order functions and their
closures. This might be much simpler than the current design; but it
would discard the benefits of years of fine-tuning of the current
design.

I'm not convinced that it makes
sense for a MemoClosure
to defer creation of the InputIterator
until the first request for the variable value. It would be a lot
simpler to call inputIterator
= Expression.iterate(context) at the point of variable
declaration; in most cases the implementation will defer evaluation
to the extent that this makes sense, and this approach saves the
cost of the elaborate code to save the necessary parts of the
dynamic context. It's worth trying the other approach and making
some performance measurements.

As explained in my previous post, the NamePool in Saxon is a potential problem for scaleability, both because access can cause contention, and also because it has serious limits on the number of names it can hold: there's a maximum of one million QNames, and performance starts getting seriously bad long before this limit is reached.

Essentially, the old NamePool is a home-grown hash table. It uses a fixed number of buckets (1024), and when hash collisions occur, the chains of hash duplicates are searched serially. The fact that the number of buckets is fixed, and entries are only added to the end of a chain, is what makes it (reasonably) safe for read access to the pool to occur without locking.

One thing I have been doing over a period of time is to reduce the amount of unnecessary use of the NamePool. Most recently I've changed the implementation of the schema component model so that references from one schema component to another are no longer implemented using NamePool fingerprints. But this is peripheral: the core usage of the NamePool for comparing names in a query against names in a source document will always remain the dominant usage, and we need to make this scaleable as parallelism increases.

Today I've been exploring an alternative design for the NamePool (and some variations on the implementation of the design). The new design has at its core two Java ConcurrentHashMaps, one from QNames to fingerprints, and one from fingerprints to QNames. The ConcurrentHashMap, which was introduced in Java 5, doesn't just offer safe multi-threaded access, it also offers very low contention: it uses fine-grained locking to ensure that multiple writers, and any number of readers, can access the data structure simulaneously.

Using two maps, one of which is the inverse of the other, at first seemed a problem. How can we ensure that the two maps are consistent with each other, without updating both under an exclusive lock, which would negate all the benefits? The answer is that we can't completely, but we can get close enough.

Now, there are several things slightly unsafe about this. We might find that the QName doesn't exist in the map on our first look, but by the time we get to the "putIfAbsent" call, someone else has added it. The worst that happens here is that we've used up an integer from the "unique" sequence unnecessarily. Also, someone else doing concurrent read access might see the NamePool in a state where one map has been updated and the other hasn't. But I believe this doesn't matter: clients aren't going to look for a fingerprint in the map unless they have good reason to believe that fingerprint exists, and it's highly implausible that this knowledge comes from a different thread that has only just added the fingerprint to the map.

There's another ConcurrentHashMap involved as well, which is a map from URIs to lists of prefixes used in conjunction with that URI. I won't go into that detail.

The external interface to the NamePool doesn't change at all by this redesign. We still use 20-bit fingerprints plus 10-bit prefix codes, so we still have the limit of a million distinct names. But performance no longer degrades when we get close to that limit; and the limit is no longer quite so hard-coded.

My first attempt at measuring the performance of this found the expected benefits in scalability as the concurrency increases and as the size of the vocabulary increases, but the performance under more normal conditions was worse than the existing design: execution time of 5s versus 3s for executing 100,000 cycles each of which performed an addition (from a pool of 10,000 distinct names so 90% of the additions were already present) followed by 20 retrievals.

I suspected that the performance degradation was caused by the need to update two maps, whereas the existing design only uses one (it's cleverly done so that the fingerprint generated for a QName is closely related to its hash key, which enables us to use the fingerprint to navigate back into the hash table to reconstruct the original QName).

But it turned out that the cause was somewhere else. The old NamePool design was hashing QNames by considering only the local part of the name and ignoring the namespace URI, whereas the new design was computing a hash based on both the local name and the URI. Because URIs are often rather long, computing the hash code is expensive, and in this case it adds very little value: it's unusual for the same local name to be associated with more than one URI, and when it happens, the hash table is perfectly able to cope with the collision. By changing the hashing on QName objects to consider only the local name, the costs for the new design came down slightly below the current implementation (about 10% better, not enough to be noticeable).

So I feel comfortable putting this into production. There are a dozen test cases failing (out of 20,000) which I need to sort out first, but it all looks very promising.

I've been looking again at the
implementation of some of the parallel processing features in Saxon
(see my XMLPrague 2015 paper) and at how best to make use of the
facilities in the Java platform to support them. In the course of
this I've been studying Brian Goetz's excellent book Java Concurrency
in Practice, which although dated, is still probably the best text
available on the subject; and the fact that it only takes you up to
Java 6 is an advantage in our case, because we still want Saxon to
run on Java 6.

Reading the book has made me think
again about the venerable design of Saxon's NamePool, which is the
oldest thing in the product where multithreading is relevant. The
NamePool is basically a shared data structure holding QNames.

The design of the NamePool hasn't
changed much over the years. On the whole it works well, but there
are a number of limitations:

Updating the NamePool requires an
exclusive lock, and on occasions there has been heavy contention on
that lock, which reduces the effectiveness of running more threads.

We read the NamePool without
acquiring a lock. All the textbooks say that's bad practice because
of the risk of subtle bugs. We've been using this design for over a
dozen years without a single of these subtle bugs coming to the
surface, but that doesn't mean they aren't there. It's very hard to
prove that the design is thread-safe, and it might only take an
unusual JVM, or an unusual hardware architecture, or a massively
parallel application, or pathological data (such as a local name
that appears in hundreds of namespaces) for a bug to suddenly
appear: which would be a serious embarassment.

The fact that we read the NamePool
with no lock means that the data structure itself is very
conservative. We use a fixed number of hash buckets (1024), and a
chain of names within each bucket. We only ever append to the end of
such a chain. If the vocabulary is large, the chains can become
long, and searching then takes time proportional to the length of
the chains. Any attempt to change the number of buckets on the fly
is out of the question so long as we have non-locking readers. So
performance degrades with large vocabularies.

We've got a problem coming up with
XSLT 3.0 packages. We want packages to be independently compiled,
and we want them to be distributed. That means we can't bind names
to fingerprints during package construction, because the package
will have to run with different namepools at run-time. We can
probably solve this problem by doing the binding of names at package
"load" or "link" time; but it's a change of
approach and that invites a rethink about how the NamePool works.

Although the NamePool design hasn't
changed much over the years, we've changed the way it is used: which
essentially means, we use it less. Some while ago we stopped using
the NamePool for things such as variable names and function names: it
is now used only for element names, attribute names, and type names.
Around Saxon 9.4 we changed the Receiver interface (Saxon's
ubiquitous interface for passing push-mode events down a pipeline) so
that element and attribute names were represented by a NodeName
object instead of an integer fingerprint. The NodeName can hold a
name either in string form or as an integer code, or both, so this
change meant that we didn't have to allocate a NamePool fingerprint
just to pass events down this pipeline, which in turn meant that we
didn't have to allocate fingerprints to constructed elements and
attributes that were going to be immediately serialized. We also
stopped using the NamePool to allocate codes to (prefix, uri) pairs
representing namespace bindings.

These changes have been pretty
effective and it's a while since we have seen a workload suffer from
NamePool contention. However, we want to increase the level of
parallelism that Saxon can support, and the NamePool remains a
potential pinch point.

There are a number of things we can do
that would make a difference. We could for example use a Java
ReadWriteLock to allow either a single writer or multiple readers; this would allow
us to introduce operations such as reconfiguring the hash table as
the size of the vocabulary increases, without increasing contention
because of the high frequency of read access.

But let's first try and remind
ourselves what the NamePool is actually for. It is there, first and
foremost, to allow fast run-time testing of whether a particular node satisfies a
NameTest. Because we use the same NamePool when constructing a source
tree and when compiling a stylesheet or query, the node on the tree
and the NameTest in the compiled code both know the integer fingerprint of
the name, and testing the node against the NameTest therefore reduces
to a fast integer comparison. This is undoubtedly far faster than
doing a string comparison, especially one involving long URIs.

If that was the only thing we used the
NamePool for, then we would only need a single method,
allocateFingerprint(namespace, localName). What are all the other
methods there for?

Well, firstly, the NamePool knows the
mapping of fingerprints to names, so we have read-only get methods to
get the fingerprint corresponding to a name, or the name
corresponding to a fingerprint. These are a convenience, but it seems
that they are not essential. The client code that calls the NamePool
to allocate a fingerprint to a name could retain the mapping
somewhere else, so that there is no need to go back to a shared
NamePool to rediscover what is already known.

The most obvious case where this
happens is with the TinyTree. The TinyTree holds the names of
elements and attributes as fingerprints, not as strings, so
operations like the XPath local-name() and namespace-uri() functions
get the fingerprint from the TinyTree and then call on the NamePool
to translate this back to a string. We could avoid this by keeping a
map from integers to strings within the TinyTree itself. This could
potentially have other benefits: we could make fewer calls on the
NamePool to allocate fingerprints during tree construction; and
retargeting a TinyTree to work with a different NamePool would be
easier.

Secondly, there's a lot of code in the
NamePool to manage prefixes. This isn't needed for the core function
of matching a node against a NameTest, since that operation ignores
namespace prefixes. The detail here is that when we call
NamePool.allocate(), we actually supply prefix, uri, and local-name,
and we get back a 32-bit nameCode which uniquely represents this
triple; the bottom 20 bits uniquely represent the local-name/uri
pair, and it is these 20 bits (called the fingerprint) that are used
in QName comparisons. The purpose of this exercise has nothing to do
with making name comparisons faster; rather it is mainly concerned
with saving space in the TinyTree. By packing the prefix information
into the same integer as the local-name and URI, we save a few useful
bits. But there are other ways of doing this without involving the
NamePool; we could use the same few bits to index into a table of
prefixes that is local to the TinyTree itself. There are of course a
few complications; one of the benefits of the NamePool knowing about
prefixes is that it can provide a service of suggesting a prefix to
use with a given URI when the system is required to invent one: users
like it when the prefix that emerges is one that has previously been
associated with that URI by a human being. But there are probably
less expensive ways of achieving this.

Let's suppose that we reduced the
functionality of the NamePool to a single method, allocate(QName) →
int. How would we then implement it to minimize contention? A simple
and safe implementation might be

HashMap<QName, Integer> map;

int next = 0;

public synchronized int allocate(QName
q) {

Integer n = map.get(q);

if (n == null) {

int m = ++next;

map.put(q, m);

return m;

} else {

return n;

}

}

This still serializes all allocate
operations, whether or not a new fingerprint is allocated. We can
almost certainly do better by taking advantage of Java's concurrent
collection classes, though it's not immediately obvious what the best
way of doing it is. But in any case, if we can achieve this then
we've reduced the NamePool to something much simpler than it is
today, so optimization becomes a lot easier. It's worth noting that
the above implementation still gives us the possibility to discover
the fingerprint for a known QName, but not to (efficiently) get the
QName for a known fingerprint.

To get here, we need to start doing two
things:

(a) get prefixes out of the NamePool,
and handle them some other way.

(b) stop using the NamePool to discover
the name associated with a known fingerprint.

As I explained in my previous post, I've been re-examining the way functions work in Saxon. In particular, over the last week or two, I've been changing the way system functions (such as fn:string-length) work. There's a terrific amount of detail and complexity here, but I thought it might be interesting to take one simple function (fn:string-length) as an example, to see where the complexity comes from and how it can be reduced.

At first sight, fn:string-length looks pretty simple. How long is a (piece of) string? Just ask Java to find out: surely it should just map to a simple call on java.lang.String.length(). Well, no actually.

If we look to the specification, there are two complications we have to deal with. Firstly we are counting the number of Unicode characters, not (as Java does) the number of 16-bit UTF16 codepoints. In the case of surrogate pairs, one character occupies two codepoints, and that means that a naïve implementation of string-length() takes time proportional to the length of the string.

Secondly, there are two forms of the string-length() function. With zero arguments, it's defined to mean string-length(string(.)). That's different from nearly all other functions that have 0-argument and 1-argument forms, where (for example) name() means name(.). Saxon handles functions like name() by converting them statically to name(.), and that conversion doesn't work in this case. To illustrate the difference, consider an attribute code="003", defined in the schema as an xs:integer. The function call string-length(@code) returns 1 (it atomizes the attribute to produce an integer, converts the integer to the string "3", and then returns the length of this string. But @code!string-length() returns 3 - the length of the string value of the attribute node.

The other complexity applies specifically to string-length#0 (that is, the zero-argument form). Dynamic calls to context-dependent functions bind the context at the point where the function is created, not where it is called. Consider:

This will print the value "1" three hundred times. In each case the context item at the point where $f is bound is a one-digit integer, so $f() returns the length of that integer, which is always one. The context item at the point where $f() is evaluated is irrelevant.

Now let's take a look at the Saxon implementation. There's a Java class StringLength which in Saxon 9.6 is about 200 lines of code (including blank lines, comments, etc), and this does most of the work. But not all: in the end all it does is to call StringValue.getStringLength(), which is what really does the work. Atomic values of type xs:string are represented in Saxon by an instance of the class StringValue, which encapsulates a Java CharSequence: often, but not always, a String. The reason for the encapsulating class is to provide type safety on methods like Function.call() which returns a Sequence; StringValue implements AtomicValue which implements Item which implements Sequence, so the XDM data model is faithfully represented in the Java implementation classes.

In addition there's a class StringLengthCompiler which generates a bytecode implementation of the string-length function. This is another 60 or so lines.

Some functions also have a separate streaming implementation to accept streamed input, and one or two (string-join() and concat(), for example), have an implementation designed to produce streamed output. That's designed to ensure that an instruction like , which compiles down to a call on string-join() internally, doesn't actually assemble the whole output in memory, but rather writes each part of the result string to the output stream as it becomes available.

Since the introduction of dynamic function calls, many system functions have two separate implementations, one for static calls and one for dynamic calls. That's the case for string-length: the evaluateItem() method used for static calls is almost identical to the call() method used for dynamic calls. One reason this happened was because of a fear of performance regression that might occur if the existing code for static calls was generalized, rather than introducing a parallel path.

In 9.6, the implementation of dynamic calls to context-dependent functions like string-length#0 is rather fudged. In fact, the expression string-length#0 compiles into a call on function-lookup("fn:string", 0). The implementation of function-lookup() keeps a copy of both the static and dynamic context at the point where it is called, and this is then used when evaluating the resulting function. This is vastly more expensive than it needs to be: for functions like string-length#0 where there are no arguments other than the context, the function can actually be pre-evaluated at the point of creation. In the new 9.7 implementation, the result of the expression string-length#0 is a function implemented by the class ConstantFunction, which encapsulates its result and returns this result when it is called. (It's not quite as simple as this, because the constant function also has to remember its name and arity, just in case the user asks.)

The method StringValue.getStringLength() attempts to recognize cases where walking through the codepoints of the string to look for surrogate pairs is not actually necessary. In previous releases there was an extra bit kept in StringValue, set when the string was known to contain no surrogate pairs: so having walked the string once, it would never be done again. In 9.6 this mechanism is replaced with a different approach: Saxon includes several implementations of CharSequence that maintain the value as an array of fixed-size integers (8-bit, 16-bit, or 32-bit, as necessary). If the CharSequence within a StringValue is one of these classes (known collectively as UnicodeString), then the length of the string is the length of the array. And when getStringLength() is called on a string the first time, the string is left in this form, in the hope that future operations on the string will benefit. Of course, this will in some cases be counter-productive (and there's a further refinement in the implementation, which I won't go into, that's designed to overcome this).

There are a few other optimizations in the implementation of string-length() that are worth mentioning. Firstly, it's quite common for users to write

<xsl:if test="string-length($x) != 0">

Here we don't need to count surrogate pairs in the string: the string is zero-length if and only if the underlying CharSequence is zero-length. Saxon therefore does a static rewrite of such an expression to boolean(string($x)). (If $x is statically known to be a string, the call string($x) will then be further rewritten as $x.)

If string-length#1 is applied to a value that can be computed statically, then the string-length function is itself computed statically. (This optimization, for odd historical reasons, is often called "constant folding". It's possible only when there are no context dependencies.)

During type-checking, the implementation of string-join#0 keeps a note of whether a context item is known to exist. This is used during byte-code generation; if it's known that the context item won't be absent, then there is no need to generate code to check for this error condition. It's through tiny optimizations like this that generated bytecode ends up being faster than interpreted code.

In my current exercise refactoring the implementation of system functions such as string-length, I've been looking at how much of logic is duplicated either across the different implementations of a single function (streamed and unstreamed, static and dynamic, bytecode and interpreted) or across the implementations of functions that have a lot in common (such as string(), string-length(), and normalize-space()). I've found that with the exception of the core code in StringValue.getStringLength, and the optimization of string-length()=0, everything else can be vastly reduced. In place of the original StringLength class, there are now two (inner) classes StringLength_0 and StringLength_1 each of which consists of a single one-line method. The code for generating byte-code can also be considerably simplified by achieving more reuse across different functions.

The main essence of the reorganization is that the class StringLength (or rather, its two variants) are no longer Expressions, they are now Functions. Previously a call onto string-length($x) compiled to an expression, held as a node on the expression tree. Now it compiles into two object, a StringLength object which is a pure function, and a SystemFunctionCall object which is an expression that calls the function. The SystemFunctionCall object is generic across all functions, while the implementations of SystemFunction contain all the code that is specific to one function. This change was motivated primarily by the need to handle dynamic function calls (and hence first-class function objects) properly, but it has provided a stimulus for a refactoring that achieves much more than this.

So, how long is a piece of string? At least we now know how to work it out more efficiently. Sorry this little yarn wasn't shorter.

XSLT and XQuery, in their 3.0
incarnations, have become fully functional programming languages,
where functions are first class values that can be manipulated in the
same way as other values, for example they can be bound to variables
and passed as arguments to other functions. So you would expect that
functions play a pretty central role in the Saxon implementation. In
this article I shall review how functions work within Saxon, and how
I think this needs to change.

One of the things that happens when
features are added to a complex piece of software is that they tend
to be added around the edges, rather than in the core. You can tell
when something was added by how central it is to the class hierarchy.
It shouldn't be that way, but there are good reasons why it happens.
And when we look at how functions work in Saxon, that's what we see.
For example, if we had always had dynamic function calls, then we
would probably find that static function calls (where the function to
be called is known at compile time) were treated simply as a special
case of dynamic calls, much as ($x+1) is treated as a special case of
($x+$y). But because they came along later, dynamic calls are
actually handled very differently.

I've been doing work recently on the
design of Saxon's expression tree (the data structure produced by the
compilation phase for use by the execution phase of stylesheet and
query evaluation). One aspect of that has been working out how to
write the expression tree to persistent storage, and then load it
back again for execution. In doing this, I've been struck by the
sheer number of different classes that are somehow related to
functions, and by the lack of coherence between them.

The first thing we notice, which is in
itself a bit smelly, is that there is no class called Function. In
fact, for many functions, there is no object in Saxon that represents
the function itself. For example, with system functions (recall that
in XSLT 1.0 and XPath 1.0 that's the only kind there were), we have
two things: a data table containing general information about
functions, such as their type signatures and context dependencies,
and a class SystemFunctionCall which actually represents a call to
the function, not the function itself. The implementation of
functions such as name() or substring() is in a class called Name or
Substring which is a subclass of SystemFunctionCall. This in turn is
a subclass of Expression, and as such it forms a node in the
expression tree.

This works fine for static function
calls, but what happens to an expression such as substring#3 or
true#0 or name#0? Answer: different things in each of these three
cases.

For substring#3, this is a pure
function: there are no context dependencies. We create something
called a SystemFunctionItem, which is an Item (and therefore a
Sequence), we wrap this inside a Literal, and we put the Literal on
the expression tree. This is much the same as the way we handle a
constant such as "London" or 39. Internally, the
SystemFunctionItem contains a reference to an instance of the
Substring class, which is where things get a bit peculiar, because
Substring is designed to be a function call, not a function. In fact,
Substring has a dual personality, it tries to act in both roles
depending which methods you call. That's a hack, and like all hacks,
it comes back to bite you.

For true#0 there's a further bit of
hackery, because the function call true() doesn't actually generate a
SystemFunctionCall, it generates a BooleanValue: it's treated as a
constant, just like "London" or 39. But we have to support
dynamic calls on true(), so we introduced a SystemFunctionCall called
True to handle this case, even though it doesn't have the dual
personality: it acts only as a function, never as a function call.

The name#0 function is different again,
because it encapsulates context. It's defined to return the name of
the node that was the context node at the point where the name#0
expression was evaluated. So this should remind us that name#0 is not
a value, it is an expression: the function it represents has no
static existence, it can only be created dynamically by evaluating
the expression with a particular dynamic context. We solve this with
another hack: for context-dependent function "literals"
like name#0 or position#0, the compiler actually generates a call on
function-lookup("name", 0), which is an expression rather
than a value, and which has to be evaluated at run-time.

The SystemFunctionItem class implements
an internal interface called FunctionItem, and as one might expect, a
FunctionItem is an Item and therefore a Sequence: that is, it's a
run-time value. Other subclasses of FunctionItem are used for calls
to user defined functions, Java extension functions, or constructor
functions. But they are only ever used for dynamic calls, never for
static calls.

Although there is no Function class,
some functions are in fact represented as objects in their own right.
The most important is UserFunction, which represents a user-written
function in XSLT or XQuery. Another is InlineFunction (representing
an anonymous inline function), and another is
ExtensionFunctionDefinition, which represents a function whose
implementation is provided in Java code: this is used both for
user-written extension functions, and for many (but not all) vendor
supplied extension functions in the Saxon namespace. But these
classes have nothing in common with each other, there is no common
superclass. This has the consequence that not only is there one set
of machinery for static function calls and another quite different
set for dynamic calls, but that in each case, there are many
different flavours depending on what kind of function you happen to
be calling.

Getting this right involves a cleaner
data model. As always, a clean data model leads to cleanly structured
code, and the data model should be designed to accurately reflect the
specification, not for the convenience of the implementation. The
specification says we have two objects of interest, a function, and a
function call. Quite rightly, the spec no longer uses the term
"function item" as something distinct from the function: a
function is an item. There's also something called the "function
implementation" which we should recognize, because two functions
may share an implementation. For example, the expression name#0
returns a different function each time it is evaluated: these
functions differ in the data they hold (a snapshot of the dynamic
context), but they share the same implementation.

We should be trying to move towards a
structure where we have a hierarchy of subclasses of Function (for
system functions, constructor functions, user functions, Java
extension functions, etc); where the Function is an Item; where a
Function may reference a FunctionImplementation to provide for
sharing; and where a FunctionCall is as far as possible independent
of what kind of function it is calling, but simply has two flavours
for static function calls and dynamic function calls.

I've been embarking on an exercise to redesign the Saxon expression tree. It's got a number of problems that it would be nice to fix; and there's major work ahead in being able to save and restore the tree as part of a compiled XSLT 3.0 package, so it would be nice to get the structure into better shape first.

One of the problems is that there's 150-odd different implementation classes for nodes on the expression tree, and that's before you start counting individual function calls; and there's far too much duplication between these classes. Implementing a new kind of expression involves far too much code. In addition, there's far too much scope to get this code wrong, leading to obscure bugs when a new kind of expression gets involved in some particular optimization rewrite, such as function or variable inlining. There's a steady flow of bugs, perhaps a couple a month, caused by tree corruptions of one kind or another, and it would be nice to make the whole structure more robust.

One thing I want to do is to be more systematic about the way in which static context information is held on the tree. At present the principle is that each expression saves that part of the static context that it thinks it might need. For example, some expressions need run-time access to the static base URI, some to the static namespace context, and some to the register of collation names. (This last case has been simplified in 9.6 because all collation names are now global to a Configuration, though the default collation can still vary from one expression to another). A recent shock discovery is that in XPath 3.0, general comparisons (that is, the humble "=" operator) can depend on the namespace context - if one operand turns out to be untyped atomic and another is a QName, then the untyped atomic value needs to be converted to a QName using the in-scope namespaces. This means the namespace context needs to be saved with every general comparison expression, which is heavy stuff. Given that the static context is almost always the same for all expressions in a module, it would be much better rather than saving what each expression thinks it might need, to save the changes to the static context on the tree, so each expression can discover the whole static context by processing a set of deltas held with its ancestors in the tree.

This leads to another point, which is that it would be nice to have parent pointers in the expression tree. At present you can navigate from a node to its children (that is, its subexpressions), but not in the other direction. Saxon gets around this by keeping a dynamic stack of expressions visited in some of its operations on the tree (such as type-checking and optimization) so the ancestor expressions are maintained dynamically during the recursive tree-walk rather than being maintained statically on the tree.

In 9.6, mainly to support XSLT 3.0 streamability analysis, we introduced a generalized mechanism for obtaining the subexpressions of any node: the operands() method. This returns a set of Operand objects, each of which contains a pointer to the child expression itself, and also properties in effect of the parent-child expression relationship, such as the XSLT 3.0 posture and sweep properties, and the operand usage, used in the 3.0 general streamability rules. This mechanism has proved very successful (and not just for streaming) in enabling more generalized operations on the tree. But a limitation is that modtifications to the tree, such as substitution one child expression for another (which is very common during type-checking and optimization) is still entirely ad-hoc, and has to be managed independently by each class of expression node.

As a first step in redesigning the tree for 9.7, I have extended the way in which we use Operand objects. As well as being used to navigate to subexpressions, they are also now used to modify subexpressions: the Operand object has a method setChildExpression() which can be used to replace the existing child expression with another. All structural changes to the tree are required to go via this method, which is enforced by encapsulating the reference to the child expression within the Operand object. The Operand also holds a reference to its "owner" expression, so when a child expression is changed, the single setChildExpression() method can take responsibility for housekeeping such as making sure the child expression has location information for use in error reporting, and making sure that expression properties cached on the parent node (such as the inferred type) are invalidated and recomputed when the children of the expression change.

This process is complicated by the fact that the nodes on the expression tree are highly diverse, in fact they don't even all represent expressions. The tree also has to cater, for example, for XSLT patterns and XQuery FLWOR expression clauses.

Making updates go through the Operand object enables many expressions to inherit a generic implementation of common rewrite methods such as typeCheck and optimize. For example, the default action of optimize is to call optimize() on each subexpression, and if any changes have occurred, replace the subexpression with its rewritten self. The redesign means that this "replace" operation can now be done in a generic way, meaning that the default optimize() method does not need to be tailor-made for each class of expression. The same is true of other methods such as primote().

I was hoping that channelling all updates through Operand.setChildExpression() would also make it easy to maintain parent pointers in the tree. Unfortunately this is not the case. It's easy enough, when B is set as a child of A, for B's parent pointer to be updated to point to A. The problem arises when B's parent pointer was previously set to C: what happens to C's children? Can B be a child of A and C simultaneously (making it not a tree at all?). It turns out that some rewrites on the tree involve creating a new structure over existing leaf nodes in the tree, which might then be discarded if not all the conditions for optimization are met. So we've updated the parent pointers in the leaf nodes to this new superstructure, which we then discarded, reverting to the original. It's difficult then to make sure that the parent pointers are reset properly when the rewrite is abandoned. It can be done in an ad-hoc way, of course, but we're looking for something more robust: an API for tree rewriting that doesn't allow the tree to become inconsistent. This is proving hard to achieve. We may have to resort to a different approach, doing a bulk sweep of the tree to set all parent pointers before the typecheck and optimize operations on each template or function. This is intellectually unsatisfactory because it means accepting that the tree could be temporarily inconsistent in the middle of such an operation, but it may be the best option available.

This is work in progress, but it's looking promising. I appreciate that for most users it's about as interesting as the repair works to the sewers beneath your street. Hopefully, though, the bottom line will be fewer product bugs, and more ability to take the product forward into new areas.

For ever and a day, Saxon has tried to ensure that when several transformations are run using the same stylesheet and the same source document(s), any indexes built for those documents are reused across transformations. This has always required some careful juggling of Java weak references to ensure that the indexes are dropped from memory as soon as either the executable stylesheet or the source document are no longer needed.

I've now spotted a flaw in this design. It wasn't reported by a user, and it didn't arise from a test case, it simply occurred to me as a theoretical possibility, and I have now written a test case that shows it actually happens. The flaw is this: if the definition of the key includes a reference to a global variable or a stylesheet parameter, then the content of the index depends on the values of global variables, and these are potentially different in different transformations using the same stylesheet.

Fixing this is not easy, and poses some interesting dilemmas, given that no-one has hit the problem in a dozen years and it's quite likely that no-one ever will. It would be nice to roll back my discovery of the problem and pretend that it doesn't exist, but that's not my style.

One solution would be to abandon the strategy of reusing indexes. It's likely that very few users have workloads that benefit from this strategy. On the other hand, those that do could see a big performance hit. Removing an powerful optimization because of a problem that doesn't apply to these workloads is not very friendly.

The right solution is to identify those key definitions that allow indexes to be shared and distinguish them from indexes that can't be shared, and then manage the two kinds of index differently. Both the identification and the management can be done, but they introduce more complexity and hence more risk of bugs, and the whole thing feels like overkill given that we don't know any users who have the problem.

Civil engineers design structures to withstand storms that are likely to happen more than once in N years. It would be nice if we could adopt a similar approach here. But software doesn't seem to be like that; it seems to follow Murphy's law, which states that if something can go wrong, it will.

So with heavy heart, I think that there's no alternative to "doing the right thing", with all the complexity it entails.

In Saxon 9.5, a new regular expression
engine was introduced. This was obviously a risky thing to do, but
I'm not averse to such risks, and although there is often some
short-term inconvenience to users, I believe such things usually
bring long-term benefits. The risks relate to both functionality and
performance. In both cases we were able to minimize the risks thanks
to a good collection of test material, but we always knew that
contact with real life problems might throw up things that the test
suite didn't.

Why did we need the new engine? There
were several factors. On Saxon-CE, the GWT technology didn't offer
access to the JDK regex library, only to the Javascript regex engine,
which has different syntax and semantics, and no support for non-BMP
Unicode characters. We could have translated XPath regexes to
Javascript regexes the way we were doing for the JDK, but the
expansions to handle non-BMP characters are very cumbersome. (We
could also have abandoned W3C regex conformance, but that's not my
style.) For "server-side Saxon", I had been impatient with
the existing design for some time. There are bugs I reported in the
JDK regex engine 5 years ago which Sun/Oracle have shown no
inclination to fix; the approach of translating one regex dialect to
another is slow and clunky; and the JDK wasn't keeping pace with new
Unicode versions. I felt it was time we brought the regex engine
in-house.

I looked around for an open source
regex engine suitable for forking, and chose the Apache Jakarta
engine. Unlike some other options the code was clear and well
commented, the performance seemed reasonable, and the license was
compatible. I made all the changes needed to support the XPath regex
syntax and semantics, and a few other changes including some
optimizations, and it passed all the tests. In fact, it has proved
very reliable; there have been no bug reports against its
functionality.

But there have been two performance
glitches.

One is the way it handles integer
repetition bounds such as a{3,7}. The algorithm it uses for this is
the same as Saxon uses in its schema processor, which is described in
a paper by Thompson and Tobin [1]; it essentially translates such an
expression to aaaa?a?a?a?, and then generates a finite state machine
accordingly. This becomes problematic when the integer bounds start
to become significantly large; the size of the finite state machine
and the time taken to generate it both increase rapidly with size.
(For the technically minded, I'm not sure how rapidly, but even if
it's only linear, it's a significant problem.)

The second problem is the handling of
backtracking. The Jakarta engine, like most regex engines, relies on
backtracking to handle ambiguous expressions. Backtracking relies on
maintaining a stack so you can go back to where you were. In Jakarta,
this data was kept on the Java program stack; each checkpoint was
represented by a recursive function call. In many cases each
character position in the input string is a potential checkpoint, so
the depth of recursion was essentially equal to the length of the
input string. That puts a limit of around 1K to 2K on the size of the
input, which is not really acceptable.

For the 9.5 release I mitigated this
problem by introducing an optimization: in cases where repetition was
unambiguous, the recursive calls were eliminated. This catches many
cases like A*B where you know that you have to keep reading A's until
you hit a B, and when you do hit a B you know you won't have to try
again from somewhere in the middle of the sequence of A's. But
unfortunately, ambiguous expressions are common (an example would be
.*B), and we have to handle them.

So I did something that I very rarely
do; I studied how the JDK regex engine avoided this problem, by
following its activity through the debugger. I was surprised to see
that the JDK engine doesn't actually build a finite state machine.
Instead it uses a pure interpreter pattern: it builds an abstract
syntax tree representing the grammar of the regex, and then evaluates
the regex by direct interpretation of the constructs on this tree.
The maximum recursive depth with this approach is represented by the
depth of the abstract syntax tree (that is, it depends on the regex
but not on the input string), and the data needed for backtracking is
held on the heap rather than on the Java stack.

Over the last week I have been
rewriting the back end of the (ex-)Jakarta regex engine to use a
similar design. It's passing all the functionality tests; I still
have some performance testing to do, but it's looking good. The
front-end code, that is the regex parser, is largely unchanged, but
it now generates an operation tree rather than an FSM. I retained
many of the features of the existing design, such as the use of a
variety of implementations of integer sets and predicates to support
the different kind of character classes, but the operations have all
been re-implemented to work interpretively. I departed from the JDK
design by using an approach similar to Saxon's XPath interpreter, in
that the implementation of each operation returns an iterator over
all the strings matched by the relevant expression. The key method
supported by each operation is

which is passed an REMatcher (contain
all necessary information about the match operation, including the
input string and regex flags, plus support for captured groups and
backreferences), and a start position for matching; it return an
IntIterator representing a sequence of integers which are the
possible positions where a match can end.

The class handling a repeat operation
such as (a|aa)* implements this operation by repeated calls on the
underlying choice operation. The first choice operation will return
an iterator over the two possible end positions (1,2); the next time
round there are four possible positions (1,2,2,3) (there is no
attempt to eliminate duplicates), and so on.The method maintains a
stack of IntIterator objects representing these successive calls;
initially only the first hit from each iterator is used, but when
backtracking is necessary, the stack is popped, and the iterator now
at the top of the stack is called to deliver its next match.

Saxon continues to take advantage of
the optimization that recognizes unambiguous repetitions. With
patterns such as A*B, or A{5}A, there is no need for backtracking so
the stack of iterators is not used, and only the first match from
each iterator is used.

There is scope for further optimization
here. Some regular expression engines such as PCRE are known to
handle highly-ambiguous expressions much more efficiently than this
crude backtracking approach, mainly by eliminating the duplication
that arises when a backtrack attempt ends up doing the same matches
at the same positions as a previous attempt. Such improvements lie in
the future; for the moment, being as good as the Java engine in
performance, and better in features such as Unicode support and
buglessness, is enough to declare success.

The new code is currently operational
on the Saxon 9.6 development branch. In due course it will be
retrofitted to Saxon-CE, and probably also to a Saxon 9.5 maintenance
release. Probably only 10% of the original Jakarta code remains, but
that's probably enough that I can't get rid of the Apache license
that goes with it, for those who care about these things.

At the XML Summer School 2013, Tony Graham presented a lightning talk about life after libxslt 1.0. I was not present for this summer school, but it was clear from the feedback of the discussions I received that there is a major gap of XSLT 2.0 support in the large developer community of C/Perl/PHP/Python/Ruby world and associated tools that rely on libxslt.

It is a known problem, which has never, to my knowledge been addressed. At Saxonica, we wanted to try and plug this gap by porting the Saxon processor from Java to C/C++, which would enable us to communicate with the languages specified above. One of our goals, if possible was to interface with libxml and libxslt. Providing such a bridge or cross-compiled version of a full fledged Java application to C/C++ is always a daunting task. In this blog post I discuss the technical steps in our quest to achieve our goals and give some details of the experiences gained along the way. I will begin by detailing the various technologies that we tried, and how we have have ended up using a commercial Java native compiler after several failed attempts with tools that either did not work, cumbersome or were just too error prone.

LLVM

At the summer school there were discussions that the tool LLVM could do the job of compiling Java to native code. As claimed on the project website LLVM is a collection of modular and reusable compiler and toolchain technologies. The LLVM project seems very active with many projects using it to do various task, but I found it difficult to get anything working. In particular, I tried using the VMKit which relies on LLVM to compile some a simple 'Hello World' examples to machine code, but even that seemed cumbersome.

GCJ

Secondly, I looked at the GCJ technology. GCJ is a tool that I have used before, so I was confident that it would work. However, from my past experience using this tool is that it can be error prone and contains long standing bugs, which is a result of the project being dormant for several years, it seems unlikely that bugs will be fixed. The other worrying fact is that GCJ only supports up-to JDK 1.5. Nevertheless for lack of other options, I persevered with GCJ and I had much better success given that I managed to compile Saxon-HE to native machine code and actually got it to execute my example stylesheets. I had some problems because of classes that were not present in the GCJ implementation of JDK 1.5, such as the packages java.math and javax.xml. Therefore, I had to include my own version of these packages.

The next step was to create a shared library of Saxon-HE, so that I could interface it with C/C++. This proved to be a real battle, which in the end I succeeded. I decided to use Compiled Native Interface (CNI), which presents a convenient way to write Java native methods using C++. The alternative was JNI (Java Native Interface), which may be viewed as more portable. Both interfaces though have similar principles: you need a Java/CNI-aware C++ compiler, any recent version of G++ is capable, and then you must include the header file for each Java class it uses. These header files, if not automatically generated, can be done using gcjh. I soon gave up on using GCJ: I stumbled upon a few known bugs and because if I was having major issues with the setup and prerequisites required then surely users would have the same problems.

Excelsior JET

The Excelsior JET tool is the final technology we looked at and thankfully it is what we have ended up using in the alpha release. JET is a commercial product that provides a Java native compiler for both Linux and Windows platforms. What is good about this software tool is that it provides an easy to use Graphical interface to build native executables and shared libraries from jar file(s). It also has the feature to package up the software into an installer ready to be deployed onto its intended host machine. This was great for us!

There is a lot I could write about JET, but it would be a repeat of the plethora of information currently available on their website and forum. However, just to mention we started with their evaluation version which offers 90-days free usage of their software before purchasing the professional edition. Another point of interest is that Excelsior offer a free-of-charge license for use in conjunction with open-source software.

We know that there will be some sections of the open-source community that dislike the dependency upon using a commercial tool, but it is not that dissimilar from the early years of Java when the Sun compiler was freely available but not open-sourced.

Implementation notes using JET

After creating the shared library, to interface it with C/C++ I used JNI. It is possible to use JET's own Java interface to external functions called xFunction, which is recommended if starting from scratch, but having used JNI with GCJ I continued with that approach. To get started there are a few examples of invoking a library with C/C++. In essence, you need to load the library and initialize the JET run-time before you can use it, see the code below (from the file xsltProcessor.cc):

In the constructor method of XsltProcessor we see that once we have loaded the library and initialized the JET run-time we can now make calls to the environment, which has been created to get class definitions and create instance(s) of the class in the Java world. This is before we make method calls on the object.

PHP Extension in C/C++

After successfully getting XSLT transformations to work within C/C++, the next step was to try and develop a PHP extension, which would operate like libxslt. There is a lot of material on the web and books in regards to PHP extensions and I found the following guide very useful: http://devzone.zend.com/1435/wrapping-c-classes-in-a-php-extension/. I literally followed it step-by-step, adding a few steps of my own when I worked out what I was doing.

Testing

As a proof of concept I wrote a test harness in PHP which makes use of the PHP extension (see: xslt30TestSuite.php in the download library). This is a test driver designed to run the public W3C XSLT test suite at https://dvcs.w3.org/hg/xslt30-test/. The test driver in its current form requires Saxon-EE, which is not yet available in this alpha release; nevertheless, the program may serve as a useful example of how the API can be used. Note that it is written to use libXML to read the test catalog, but to use Saxon for running the tests and assessing the results.

Performance Testing

I now draw comparisons between running Saxon-HE (on Java) vs running Saxon-HE/C on C++ and on PHP on some preliminary tests. I also compare these times to libxslt (C/C++). An important aim is to get a good measure of the costs of crossing the Java/C++ boundary using JNI and also to see what the effect is with the PHP extension.

I used Saxon-HE 9.5.1.3 as the baseline. The test machine was a Intel Core i5 processor 430M laptop with 4GB memory, 2.26Ghz CPU and 3MB L3 cache, running Ubuntu 13.10 Linux. Servers Apache2 and PHP version 5.5.3-1ubuntu2. The compiler was Sun/Oracle Java 1.6.0.43.

The experiments were based on the XMark benchmark. I used query q8, which was converted into the stylesheet below. The choice of q8.xsl is because we should expect some performance bottle-necks across the implementations due to its equijoins in the query:

The times for Saxon-HE/C are without the cost of JET initialisation and loading the library, which accounted for only 4 seconds. So we observe that there is not a big overhead between C++ and the PHP
extension. The biggest cost as expected is between Java and C++, where we see on the C++/PHP platform a slowdown of ~ x2.2. We also observe that Saxon-HE/C out performs libxslt on C/C++ by ~40% on q8.

I discovered yesterday that Saxon-CE isn't applying xsl:strip-space directives to input documents. An unfortunate bug: not that many users seem too bothered by it, but conformance is always important.

A reminder: xsl:strip-space and xsl:preserve-space are designed to remove "ignorable" whitespace from the application's view. Because XML doesn't distinguish significant from insignificant whitespace (a bad design mistake), this can only be done under program control. The idea is to present the application (the stylesheet) with a view of the input tree in which the insignificant whitespace does not appear, and the xsl:strip-space and xsl:preserve-space directives allow the stylesheet author to say which whitespace is considered significant.

How to fix the bug? Server-side Saxon (what I am starting to call "Big Saxon") traditionally provides two mechanisms: either it wraps the supplied DOM in a wrapping layer that hides the offending whitespace text nodes, or it bulk-copies the supplied DOM to an internal tree structure in which the whitespace is not copied across. Both methods have disadvantages. The "wrapping" approach imposes a significant extra cost on navigation of the tree (the cost is higher than it might appear, because whitespace stripping also has to take account of xml:space attributes in the document, which means a lot of searching for attributes-of-ancestors). The bulk-copy approach imposes a significant start-up cost to the transformation, often larger than the cost of doing the real transformation, and the cost is particularly high when processing large input documents because it doubles the memory requirement.

I'm considering using a third approach: in-situ modification of the supplied DOM. For Big Saxon, I have tended to treat this option as unthinkable: you don't muck with the user's data in this way. The same DOM, after all, might be input to multiple stylesheets, perhaps concurrently (though DOM is not thread-safe so this is a bad idea....) and they might have different space-stripping directives. A Java application might pass the DOM to an XSLT stylesheet to do some processing, but might want to continue operating on the same DOM afterwards.

In Saxon-CE, however, these considerations hardly apply. In the vast majority of cases the DOM is built solely in order to act as XSLT input. Unlike Big Saxon, where one can assume that if users supply a DOM they are doing so by choice, with Saxon-CE there is no alternative: all input XML arrives this way.

In-situ clean-up of the input DOM has other attractions; it can be used to solve the other two big problems with DOM processing (both arising from a mismatch between the DOM model and the XDM model): adjacent text nodes, and namespaces. The same pre-scan that strips unwanted whitespace nodes can also concatenate adjacent text nodes (eliminating entity references and CDATA sections in the process); presenting a view of adjacent text nodes as a single node accounts for a great deal of complexity (and lines of code, and execution time) in navigating a DOM. Equally for namespaces. To determine the namespace part of the name of an element or attribute, Saxon can't simply call getNamespaceURI() on the underlying DOM node. That method is defined to have no effect if "the node was created using DOM level-1 interfaces". Saxon of course has no idea how the node was created, and DOM provides no way of finding out. So it does an optimistic call of getNamespaceURI() just in case, and if this returns null then it has no choice but to search the node's ancestors and attributes looking for namespace declarations - and since this is done every time you look at a node to see whether it satisfies an axis step, it's no wonder that Saxon can be ten times slower processing a DOM than when processing its native tree structure. We should be able to cut this cost dramatically by doing a pre-scan of the document and calling setNamespaceURI() on all element and attribute nodes.

The more I think about this, the more appealing the idea of doing the same thing on Big Saxon. The fact is, most people who supply a DOM as input to XSLT do so out of ignorance; they have no idea of the extra costs they are incurring, or of the fact that supplying SAX input would be ten times faster. It's important to deliver good performance not only to experts who know how to get the ultimate performance out of the system, but also to the average user who just copies JAXP code from somewhere on the web without any great thought or understanding. For that kind of user, in-situ modification of the supplied DOM might be the right answer. We can always continue to provide wrapping or bulk-copying as alternative configuration options, for people to invoke if they don't like their data being trampled upon.

Returning to Saxon-CE, the other complication is the HTML DOM. Running XPath against an HTML DOM is already sufficiently quirky (e.g. because of case-insensitive element names, namespace non-support, etc etc) that it's probably wisest to keep things as simple as possible on this front. I don't think we need to apply xsl:strip-space directives to the HTML document; there's nothing in the spec that would make this a conformance issue, and there's no usability imperative. So we'll leave well alone here.