Abstract

This document contains requirements for the development of an XML Processing Model and Language, which are intended to describe and specify the processing relationships between XML resources.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

Appendices

1 Introduction

A large and growing set of specifications describe processes operating on XML documents. Many applications will depend on the use of more than one of these specifications. Considering how implementations of these specifications might interact raises many issues related to interoperability. This specification contains requirements on an XML Pipeline Language for the description of XML process interactions in order to address these issues. This specification is concerned with the conceptual model of XML
process interactions, the language for the description of these interactions, and the inputs and outputs of the overall process. This specification is not generally concerned with the implementations of actual XML processes participating in these interactions.

2 Terminology

An XML Information Set or "Infoset" is the name we give to any implementation of a data model for XML which supports the vocabulary as defined by the XML Information Set recommendation [xml-infoset-rec].

An XML Pipeline is a conceptualization of a flow of a configuration of steps and their parameters. The XML Pipeline defines a process in terms of order, dependencies, or iteration of steps over XML information sets.

A parameter is input to a Step or an XML Pipeline in addition to the Input and Output Document(s) that it may access. Parameters are most often simple, scalar values such as integers, booleans, and URIs, and they are most often named, but neither of these conditions is mandatory. That is, we do not (at this time) constrain the range of values a parameter may hold, nor do we (at this time) forbid a Step from accepting anonymous parameters.

The ability to parse an XML document and pass infoitems between components without building a full document information set.

3 Design Principles

The design principles described in this document are requirements whose compliance with is an overall goal for the specification. It is not necessarily the case that a specific feature meets the requirement. Instead, it should be viewed that the whole set of specifications related to this requirements document meet that overall goal specified in the design principle.

Technology Neutral

Applications should be free to implement XML processing using appropriate technologies such as SAX, DOM, or other infoset representations.

Platform Neutral

Application computing platforms should not be limited to any particular class of platforms such as clients, servers, distributed computing infrastructures, etc. In addition, the resulting specifications should not be swayed by the specifics of use in those platform.

Small and Simple

The language should be as small and simple as practical. It should be "small" in the sense that simple processing should be able to stated in a compact way and "simple" in the sense the specification of more complex processing steps do not require arduous specification steps in the XML Pipeline Specification Document.

Infoset Processing

At a minimum, an XML document is represented and manipulated as an XML Information Set. The use of supersets, augmented information sets, or data models that can be represented or conceptualized as information sets should be allowed, and in some instances, encouraged (e.g. for the XPath 2.0 Data Model).

Straightforward Core Implementation

It should be relatively easy to implement a conforming implementation of the language but it should also be possible to build a sophisticated implementation that implements its own optimizations and integrates with other technologies.

Address Practical Interoperability

An XML Pipeline must be able to be exchanged between different software systems with a minimum expectation of the same result for the pipeline given that the XML Pipeline Environment is the same. A reasonable resolution to platform differences for binding or serialization of resulting infosets should be expected to be address by this specification or by re-use of existing specifications.

XML Pipelines need to support existing XML specifications and reuse common design patterns from within them. In addition, there must be support for the use of future specifications as much as possible.

Arbitrary Components

The specification should allow use any component technology that can consume or produce XML Information Sets.

Control of Inputs and Outputs

An XML Pipeline must allow control over specifying both the inputs and outputs of any process within the pipeline. This applies to the inputs and outputs of both the XML Pipeline and its containing steps. It should also allow for the case where there might be multiple inputs and outputs.

Control of Flow and Errors

An XML Pipeline must allow control the explicit and implicit handling of the flow of documents between steps. When errors occur, these must be able to be handled explicitly to allow alternate courses of action within the XML Pipeline.

4.7 Error Handling and Fall-back [req-error-handling-fallback]

4.8 Support for the XPath 2.0 Data Model [req-xdm]

XML Pipelines must support the XPath 2.0 Data Model to allow support for XPath 2.0, XSLT 2.0, and XQuery as steps.

Note:

At this point, there is no consensus in the working group that minimal conforming implementations are required to support the XPath 2.0 Data Model.

4.9 Allow Optimization [req-allow-optimization]

An XML Pipeline should not inhibit a sophisticated implementation from performing parallel operations, lazy or greedy processing, and other optimizations. [xml-core-wg]

4.10 Streaming XML Pipelines [req-streaming-pipes]

An XML Pipeline should allow for the existence of streaming pipelines in certain instances as an optional optimization. [xml-core-wg]

5 Use cases

This section contains a set of use cases that support our requirements and will inform our design. While there is a want to address all the use cases listed in this document, in the end, the first version of those specifications may not solve all the following use cases. Those unsolved use cases may be address in future versions of those specifications.

To aid navigation, the requirements can be mapped to the use cases of this section as follows:

5.11 Make Absolute URLs [use-case-make-absolute-urls]

For all elements or attributes whose type is xs:anyURI, resolve the value against the base URI to create an absolute URI. Replace the value in the document with the resulting absolute URI.

This example assumes preservation of infoset ([base URI]) and PSVI ([type definition]) properties from step to step. Also, there is no way to reorder these steps as the schema doesn't accept xml:base attributes but the expansion requires xs:anyURI typed values.

5.16 XQuery and XSLT 2.0 Collections [use-case-collections]

In XQuery and XSLT 2.0 there is the idea of an input and output collection and a pipeline must be able to consume or produce collections of documents both as inputs or outputs of steps as well as whole pipelines.

For example, for input collections:

Accept a collection of documents.

Apply a single XSLT 2.0 transformation that processes the collection and produces another collection.

This document provides a test scenario that will be used to create validation management scripts using a range of existing techniques, including those used for program compilation, etc.

The steps required to validate our sample document are:

Use ISO 19757-4 Namespace-based Validation Dispatching Language (NVDL) to split out the parts of the document that are encoded using HTML, SVG and MathML from the bulk of the document, whose tags are defined using a user-defined set of markup tags.

Use a set of Schematron rules stored in check-metadata.xml to ensure that the metadata of the HTML elements defined using Dublin Core semantics conform to the information in the document about the document's title and subtitle, author, encoding type, etc.

Validate the SVG components of the file using the standard W3C schema provided in the SVG 1.2 specification.

Use the Schematron rules defined in SVG-subset.xml to ensure that the SVG file only uses those features of SVG that are valid for the particular SVG viewer available to the system.

Validate the MathML components using the latest version of the MathML schema (defined in RELAX-NG) to ensure that all maths fragments are valid. The schema will make use the datatype definitions in check-maths.xml to validate the contents of specific elements.

Use MathML-SVG.xslt to transform the MathML segments to displayable SVG and replace each MathML fragment with its SVG equivalent.

Use the ISO 19757-8 Document Schema Renaming Language (DSRL) definitions in convert-mynames.xml to convert the tags in the local nameset to the form that can be used to validate the remaining part of the document using docbook.dtd.

Use the IS0 19757-7 Character Repertoire Definition Language (CRDL) rules defined in mycharacter-checks.xml to validate that the correct character sets have been used for text identified as being Greek and Cyrillic.

Convert the Docbook tags to HTML so that they can be displayed in a web browser using the docbook-html.xslt transformation rules.

Each validation script should allow the four streams produced by step 1 to be run in parallel without requiring the other validations to be carried out if there is an error in another stream. This means that steps 2 and 3 should be carried out in parallel to steps 4 and 5, and/or steps 6 and 7 and/or steps 8 and 9. After completion of step 10 the HTML (both streams), and SVG (both streams) should be recombined to produce a single stream that can fed to a web browser. The flow is illustrated in the
following diagram:

Running XSLT on a very large document isn't typically practical. In these cases, it is often the case that a particular element, that may be repeated over-and-over again, needs to be transformed. Conceptually, a pipeline could limit the transformation to a subtree by:

Limiting the transform to a subtree of the document identified by an XPath.

For each subtree, cache the subtree and build a whole document with the identified element as the document element and then run a transform to replace that subtree in the original document.

For any non-matches, the document remains the same and "streams" around the transform.

This allows the transform and the tree building to be limited to a small subtree and the rest of the process to stream. As such, an arbitrarily large document can be processed in a bounded amount of memory.

5.30 Adding Navigation to an Arbitrarily Large Document [use-case-add-nav]

For a particular website, every XHTML document needs to have navigation elements added to the document. The navigation is static text that surrounds the body of the document. This navigation is added by:

Matching the head and body elements using a XPath expression that can be streamed.

Inserting a stub for a transformation for including the style and surrounding navigation of the site.

For each of the stubs, transformations insert the markup using a subtree expansion that allows the rest of the document to stream.

In the end, the pipeline allows arbitrarily large XHTML document to be processed with a near-constant cost.

(source: Alex Milowski)

5.31 Fallback to Choice of XSLT Processor [use-case-fallback-choice]

A step in a pipeline produces multiple output documents. In XSLT 2.0, this is a standard feature of all XSLT 2.0 processors. In XSLT 1.0, this is not standard.

A pipeline author wants to write a pipeline that, at compile-time, the implementation chooses XSLT 2.0 when possible and degrades to XSLT 1.0 when XSLT 2.0 is not supported. In the case of XSLT 1.0, the step will use XSLT extensions to support the multiple output documents--which again may fail. Fortunately, the XSLT 1.0 transformation can be written to test for this.

(source: Alex Milowski)

5.32 No Fallback for XQuery Causes Error [use-case-no-fallback-error]

As the final step in a pipeline, XQuery is required to be run. If the XQuery step is not available, the compilation of the pipeline needs to fail. Here the pipeline author has chosen that the pipeline must not run if XQuery is not available.