Sunday, September 04, 2011

Writing an extension step for Calabash, to use BaseX

Introduction

Writing an extension for Calabash in Java involves three different things: 1/ the Java class itself, which has to implement the interface XProcStep, 2/ binding a step name to the implementation class, and 3/ declaring the step in XProc.

Java

Let's take, as an example, a step evaluating a query using the standalone BaseX processor. The goal is not to have a fully functional step, nor to have a best-quality-ever step with error reporting and such, but rather to emphasize how to glue all the things together. The step has one input port, named source, and one output port, named result. The step gets the string value of the input port (typically a c:query element) and evaluates it as an XQuery, using BaseX. The result is parsed as an XML document and sent to the output port (it is a parse error if the result of the query is not an XML document or element). Let's start with the Java class implementing the extension step:

An extension step has to implement the Calabash interface XProcStep. Calabash provides a convenient class DefaultStep that implements all the methods with default behaviour, good for most usages. The only thing we have to do is to save the input and output for later use, and to reset them in case the step object is reused. And of course to provide the main processing in run(). The processing itself, in the run() method, we read the value from the source port, get its string value, execute it using the BaseX API, and parse the result as XML to write it to the result port.

As you can see, there is nothing in the class itself about the interface of the step: its type name, its inputs and outputs, its options, etc. This is done in two different places. First you link the step type to the implementation class, then you declare the step with XProc.

Tell Calabash about the class

Linking the step type to the implementation class is done in a Calabash config file. So you have to create a new config file, and pass it to Calabash on the command line with the option --config (in abbrev -c). The file itself is very simple, and link the step type (a QName) and the class (a fully qualified Java class name):

Declare the step

Finally, declaring the step in XProc is done using the standard p:declare-step. If it contains no subpipeline (that is, if it contains only p:input, p:output and p:option children), then it is considered as a declaration of a step the implementation of which is somewhere else; if it contains a subpipeline, then this is a step type definition, with the implementation defined in XProc itself. The declaration can be copied and pasted in the main pipeline itself, but as with any other language, the best practice is rather to declare it in an XProc library and to import this library (composed only with step declarations) within the main pipeline using p:import. In our case, we define the step type to have an input port source, an output port result (both primary), and without any option:

Packaging

Update: The mechanism described in this section has been implemented, see this blog entry.

If you want to publicly distribute your extension, you have to provide your users with 1/ the JAR file, 2/ the config file and 3/ the library file. Thus the user needs to correctly configure Java with the JAR file, to correctly configure Calabash with the config file, and to use a suitable URI in the p:import/@href in his/her pipeline. This is a lot of different places where the user can make a mistake.

The EXPath Packaging open-source implementation for Calabash does not support Java extension steps yet, but it is planned to support them, in order to handle that configuration part automatically. The goal is to have the library author to define an absolute URI for the XProc library (declaring the steps), which the user uses in p:import, regardless of where it is actually installed (it will be resolved automatically). The details (classpath setting, XProc library resolving, and Calabash config) should then be handled by the packaging support. Once the package of the extension step has been installed in the repository, one can then execute the following pipeline (note the import URI has changed):

1 Comments:

Dear Florent, thanks for this blog entry. In the following, I have listed two quick alternatives for evaluating XQuery expressions in BaseX. The first version directly communicates with the XQuery processor of BaseX and caches the serialized byte stream (bypassing the string conversion):

In both variants, the result is completely serialized before it is passed on to Saxon's node builder. If the intermediate result gets very large, we could try in a second step to merge the serializer and input stream.