Thursday, June 25, 2009

Divide and Conquer, or XPath, XSLT, XQuery and XProc packaging

Packaging of various X* technologies seems to be of interest for a lot
of people for now. And of course it is for me. But it seems everyone
comes with its own idea of packaging, as well as a different scope.
So to add to the complexity yet, I will present here my own ideas on
that matter. Hopefully, I will try to tidy up the different concepts
and to identify the different needs. And as always, I like to speak
about concrete. To ease further discussions, if only that. So I will
introduce a prototype of a packaging system for X* libraries and
extensions for Saxon.

Packaging is nothing in itself. It is always related to something
else (a language, a technology, a framework...) Packaging is just a
mean to ease sharing and delivering something in the scope of that
"something else." The several files in an ODF document are packaged
in a single ZIP file, with a pre-defined structure, to make it
possible for an application to use its content. The important point
is not the structure in itself, but rather the information it gathers.

I have followed some very interesting discussions about X* packging
during the last few weeks, with very interesting people. Rapidly, I
have seen everyone were talking about slightly (or not) different
things. The most important point where people have different views
IMHO, is the scope of packaging.

As with most of modern languages, an XML developer may have to deliver
different pieces of software, depending on the project: libraries,
standalone applications, or web applications built for a specific
framework. If you look at Java for instance, this is reflected quite
clearly in its various packaging formats: JAR files for libraries and
applications, WAR files for web applications, EAR files for entire
enterprise applications...

WAR files contain Java classes, as JAR files. But the structure is
quite different, and there are a few other files, describing what is
in the package: "that class is a servlet class, conforming to the
definition of servlet and coded to live in a servlet container, with a
precise lifecycle," or "the package depends on this JAR file."

The same way, you can package XSLT libraries or XQuery modules,
telling a processor that when a stylesheet or a module imports a
specific URI, some functions are available (provided as plain XSLT
stylesheets, XQuery modules, or extension functions.) Or you can
package an entire web application using XProc to control the overall
processes, XQuery to query XML databases and XSLT for the presentation
layer (sounds very MVC, doesn't it?) But those packages are really
different beasts: when the first example just need to package some
XSLT, XQuery, Java, whatever code, alonside a simple cataloging
system, the second example require to define a complete web framework,
its lifecycle, how script can plug into this and exchange information
with it ("this XProc pipeline has to be evaluated on an HTTP GET on
http://www.example/app/theuri, it knows you will provide it with
request information as a wa:http-request element, as we agreed upon,
and that XSLT stylesheet has to be applied to its result; by the way
it will access runtime information by using the extension functions
you provide.")

There has been some work on XRX frameworks, and clearly it would be
beneficial for anybody (users, but also implementors,) to have such a
standard packaging format for entire applications following their
rules (as WAR and EAR files can be to Java.) And they would benefit
also from a more low-level packaging format dedicated to package X*
libraries, and would build upon them. But they really are at
different levels, and I think it is fundamental to make the
distinction between both concepts.

As part of the EXPath project, and because I think this is the first
step X* technologies need for several years to enable the delivery of
libraries, I am particularly interested in a library packaging
format.

To illustrate that, I've built a very simple prototype of a package
manager for Saxon. On the one hand you have a simple GUI to install
and delete packages in a repository, and on the other hand you have a
shell script to launch Saxon (setting the classpath for extension
functions and setting catalogs to resolve XSLT imports refering to
libraries.) If those tools are built around a well-defined, open
package format, other implementations could be written (for eXist, for
MarkLogic, XQilla, Zorba... but also for oXygen, providing a one-click
implementation to install a package and then being able to enable it
in some scenarii.)

You can find the manager at http://www.fgeorges.org/purl/20090624/.
You should be able to run it simply by clicking on one of the links on
the launch.html page (through Java Web Start,) but you can also
download the JAR file (look also in the lib/ sub-directory,) putting
both JAR files in the classpath and running Java the usual way, with
the main class org.expath.pkg.saxon.PackageManagerGUI (there is also a
text interface with org.expath.pkg.saxon.PackageManagerTextUI.) You
first have to set up an environment variable EXPATH_REPO, pointing to
a directory (that will be your EXPath Packaging repository, just
create an empty directory.) The interface is very simple: choose the
install item in the file menu, and select the package file you want to
install. To remove a package, select it in the list of installed
modules and select delete in the menu.

Once a module is installed, you can use it via Saxon by adding the
additional JARs to the classpath as needed (for extension functions)
and by setting up the XML Catalogs support. The following script does
that for you: http://www.fgeorges.org/purl/20090624/saxon. It needs a
few environment variables: EXPATH_REPO as explained above,
APACHE_XML_RESOLVER_JAR must point to the Apache XML Commons Resolver
(see http://xml.apache.org/commons/, and be sure to pick the resolver
JAR) and SAXON_HOME must point to the directory containing the Saxon
JARs.

But what about the package format itself? In this prototype, this is
a simple ZIP file, with the following structure:

where expath-pkg.xml is the package descriptor, and expath-http-client
is the directory containing one module (here the EXPath HTTP Client
module.) This module is implemented as a Java extension, besides a
frontend XSLT stylesheet that take care of Saxon-specifics to bind to
the Java functions. During the install, an XML Catalogs file is
created, to resolve the URI http://www.expath.org/mod/http-client.xsl
to that stylesheet, in the local repository. One stylesheet can then
simply import that URI and use the functions of the module. The real
package for the HTTP Client can be downloaded at the same place:
http://www.fgeorges.org/purl/20090624/expath-http-client-saxon-0.3.zip.

There are of course still a lot of work defining exactly the package
format, how to handle dependencies, improving the implementation...
But I think that gives the big picture. If you are interested, here
is what the package descriptor looks like:

We can see the package contains one module, namely "EXPath HTTP
Client," version 0.3. The URIs are used to create an XML catalog.
This version of the package contains all the dependencies (the JARs
used by the Java implementation of the extension functions,) but they
can be also left out, and configured with the following element:

The GUI does not take them into account yet, but it should propose to
automatically download JARs when possible, and give the user a list of
libraries and their homepage when a manual download is required. But
of course, the same format can be used to package standard XSLT
stylesheets, without any Java features, just by mapping the main entry
point files to their public URIs.

Of course, this format will be particularly useful once precisely
defined in an open spec, and if several processors support it (either
natively, or through external managers.)

To end this post, I would like to introduce an idea from Jim Fuller:
CXAN. I am sure most of you know CTAN for TeX, or CPAN for Perl.
They are central, organized repositories of libraries for those
languages, accessible throught HTTP. With a proper packaging format,
it would be possible to set up such a web repository gathering XPath,
XSLT, XQuery and XProc libraries and applications, installable
automatically with a manager that would install a package from its
name, handling dependencies and the like. But for sure, that is yet a
step forward.

4 Comments:

When I read this part "but it should propose to automatically download JARs when possible" I actually started thinking of Maven. Perhaps we could actually reuse Maven in someway for the dependency management?

About dependencies, that's clearly the part that needs more work. I do not want to rely on the user having Maven installed, but it would be interesting to add the Maven info *in addition* to the homepage, name and version of the dependency (and maybe to a direct link to the JAR, if available.)

The point is to have in any case enough info to install by hand, but we can add optional info for any dependency manager, *in addition*.