Monday, July 23, 2007

I work a lot with large XML datasets that are arranged as thousands of 1 - 10mb XML files. I spend most of my days writing transforms and transformation pipelines to process these files, which is where Kernow came from. I also like messing around with eXist (I'm yet to use it commercially, but I hope to one day) and enjoying the speed a native XML database gives you.

Requirements that regularly come up are to generate indexes and reports for the dataset. This is nice and simple using XSLT 2.0's grouping but require the whole dataset to be in memory, unless you use saxon:discard-document(). It can also be quite slow, if only because you have to read GB's from disk and parse the whole of each and every XML input file to just get the snippet that you're interested in (such as the title, or say all of the elements).

Conversely, XQuery doesn't suffer from the dataset size but lacks XSLT 2.0's grouping features. It's perfectly possible (although a bit involved - you could say "a bit XSLT 1.0") to recreate the grouping in XQuery, but it's just so much nicer in XSLT 2.0. So to get the best of both, you can use eXist's fanstastic REST style interface to select the parts of the XML you're interested in, and then use XSLT 2.0's for-each-group to arrange the results.

In the example stylesheet below I create an index by getting the <title> for each XML document, and then grouping the titles by their first letter, then sorting by title itself. I use eXist to get the <title> element, then XSLT 2.0 to do the sorting and grouping.

I have an instance of eXist running on my local machine and fully populated with the XML dataset. The function fn:eXist() takes the collection I'm interested in and the XQuery to execute against that collection, constructs the correct URI for the REST interface and calls doc() with that URI. The result is a proprietary XML format containing each tuple that I then group using xsl:for-each-group. It's worth noting the -1 value for the _howmany parameter on the query - without this it defaults to 10.