Unfortunately, one of the XQuery tests within the XQTS was causing eXist-db to exhaust all available memory and crash the JVM with the dreaded java.lang.OutOfMemoryError (#1971). The test in question cbcl-subsequence-011 executes the following unassuming little XQuery:

count(subsequence(1 to 3000000000, -2147483649))

To understand why this was using all of the available memory, I did a little bit of digging to understand how eXist-db evaluates the query. There are three main steps in evaluating the query: (1) a range expression, (2) a subsequence function, and (3) a count function. Ultimately, we will discover that it is the interaction of the first two of these steps that are most important to understand when devising a solution.

The Range Expression

The range expression1 to 3000000000 is implemented by org.exist.xquery.RangeSequence. Very sensibly it is lazily evaluated, which means that we don't have to realise all 3000000000 integers which would otherwise cost us ~11.18GB of RAM; it just holds the start and end of the sequence, and provides an iterator for moving forward through the sequence. Something like:

In the code above each call to iterator.nextItem() is going to call RangeSequenceIterator#nextItem() which is going to construct a new IntegerValue object and add a reference to it to the tmp value sequence.

The exact memory required for each IntegerValue object in eXist-db is dependent on the length of the integer that it has to represent. For numbers between 1 and 3000000000, the memory use is between 80 and 84 bytes per object. So if we call nextItem 3000000000 times and keep a reference, then the memory required is between 80 * 3000000000 and 84 * 3000000000, i.e.: between 223.52 GB and 234.69 GB! My MacBook Pro which I used for testing only has 16GB of RAM, and so hence the OutOfMemoryError.

So what can we do about this?

Actually, when considering just the subsequence function, there appears to be very little we can do on the surface. We could perhaps detect the amount of memory available to the JVM and impose a limit on the number of items that can be realized by the subsequence function, raising an error to the user if that is exceeded, rather than running out of memory and crashing. I personally consider that to be an option of last resort, and not a satisfactory solution for general use.

If, however, we consider the larger picture of how the subsequence function is used, it reveals two use-cases:

As the final expression of an XQuery, where subsequence is used to limit the results that are serialized by the query processor and returned to the user.

In eXist-db when a query returns its results to a user, it does so by returning the results one after another in a streaming fashion. The problem is that the subsequence function is collecting all the results in memory, before they are returned to the user one-by-one. But... we don't need to memorize the full subsequence of the range expression, and then return it to the user. Instead, we could just calculate the first subsequence item (of the range expression) return it to the user, and then repeat that process over and over for each item in the subsequence.

As a sub-expression within a larger XQuery. Just like in our original query at the top of this article, where subsequence is a sub-expression inside the count function.

When the subequence function is used as a sub-expression, we have no advance idea what the outer-expression will do with the subseqence. For example, if we have a (simplified) query like: subsequence(1, 2, subsequence(1, 1000, $data)) we calculate the first 1000 items of $data, but then it turned out that we actually only want 2 items. That initial calculation and memorisation of those 1000 items was very wasteful. Instead, if we lazily evaluated subsequence and had the outer-expression pull each item, one after another, in a stream like fashion, we would only need to calculate 2 items from $data.

So both use-cases identify that we can use some form of streaming and lazy evaluation to avoid storing all the intermittent items described by the subequence function into memory.

Super simple! eXist-db's fn:subsequence now returns a new sequence type which is a place holder for the actual (subsequence of items). The key parts of my implementation of org.exist.xquery.value.SubSequence look like:

Those readers who have been paying close attention, will also notice that we have implemented a pair of new methods SequenceIterator#skippable() and SequenceIterator#skip(long). These methods allow an expression which is operating on a sequence to reposition to the desired starting point in the sequence at no cost, i.e.: without having to call nextItem repeatedly which we know is expensive as it allocates an object!

The Count FunctionThe count function is implemented by org.exist.xquery.functions.fn.FunCount. The key part of its implementation is rather simple, it simply asks the sequence for the number of items it contains:

result = new IntegerValue(sequence.getItemCount());

Before our implementation of the SubSequence sequence type, the sequence would have been a ValueSequence, which calculates the number of items like so:

public int getItemCount() {
sortInDocumentOrder();
return size + 1;
}

One might ask, why are we (potentially) sorting the items of the sequence into document order when we just to count them? Well, sortInDocumentOrder() also potentially does some de-duplication of items! Remember also, that when we were previously using ValueSequence we could have many thousands of realised IntegerValue objects in memory, so any deduplication and sorting here is going to be expensive. The remaining simple integer arithmetic of size + 1 is perfectly fine.

Now, that we have implemented a SubSequence sequence type, we can do much better! As shown above, SubSequence holds the fromInclusive and toExclusive positions of the subsequence within the larger sequence. When the count function is operating on a SubSequence type, to calculate the item count, it is almost as simple as the formula toExclusive - fromInclusive. We don't need to perform any expensive deduplication or sorting. My implementation of SubSequence#getItemCount() looks like:

Conclusion

I was able to prepare a set of patches (#1977) for eXist-db which now allow for working with subsequences of unlimited length without running out of memory. This was achieved by switching eXist-db's subsequence function from eager evaluation to lazy evaluation, and having it return a new iterable sequence type (called SubSequence) which streams the sub-items of the original sequence (e.g. a range expression) one by one.

I believe that the new org.exist.xquery.value.SubSequence type also opens up some new avenues for further optimisations in eXist-db. For example, it could perhaps be used for optimising positional predicate expressions like: $data[12345].