Revision as of 14:58, 1 October 2010

1 A gentle introduction to the Haskell XML Toolbox

The Haskell XML Toolbox (HXT) is a collection of tools for processing XML with Haskell. The core component of the Haskell XML Toolbox is a domain specific language consisting of a set of combinators for processing XML trees in a simple and elegant way. The combinator library is based on the concept of arrows. The main component is a validating and namespace aware XML-Parser that supports almost fully the XML 1.0 Standard. Extensions are a validator for RelaxNG and an XPath evaluator.

2 Background

The Haskell XML Toolbox is based on the ideas of HaXml and HXML, but introduces a more general approach for processing XML with Haskell. HXT uses a generic data model for representing XML documents, including the DTD subset, entity references, CData parts and processing instructions. This data model makes it possible to use tree transformation functions as a uniform design of XML processing steps from parsing, DTD processing, entity processing, validation, namespace propagation, content processing and output.

HXT has grown over the years. Components for XPath, XSLT, validation
with RelaxNG, picklers for conversion from/to native Haskell data,
lazy parsing with tagsoup, input via curl and native Haskell HTTP
and others have been added. This has led to a rather large package
with a lot of dependencies.

To make the toolbox more modular and to reduce the dependencies on
other packages, hxt has been split
into various smaller packages since version 9.0.0.

3 Resources

3.1 Home Page and Repositoy

3.2 Packages

The package hxt forms the core of the toolbox. It contains a validating XML parser and a HTML parser, which tries to read any text as HTML, a DSL for processing, transformin and generating XML/HTML, and so called pickler for conversion from/to XML and native Haskell data.

hxt-unicode contains decoding function from various encoding schemes to Unicode. The difference of these functions compared to most of those available on hackage are, that these functions are lazy even in the case of encoding errors (thanks to Henning Thielemann).

hxt-regex-xmlschema contains a lightweight and efficient regex-library. There is full Unicode support, the standard syntax defined in the XML-Schema doc is supported, and there are extensions available for intersection, difference, exclusive OR. The package is self contained, no other regex library is required.

3.3 Installation

When installing hxt with cabal, one does not have to deal with all the
basic packages. Just a

cabal install hxt

does the work for the core toolbox. When HTTP access is requred, install at least one of
the packages hxt-curl or hxt-http. All other packages can be installed
on demand any time later.

3.4 Upgrade from HXT versions < 9.0

HXT-9 is not downwards compatible. The splitting into smaller
packages required some internal reorganisation and changes of some type
declarations.
To use the main features of the core package, add an

import Text.XML.HXT.Core

to your sources, instead of

Text.XML.HXT.Arrow

.

The second major change was the kind of configuration and option handling.
This was done previously by lists of key-value-pairs implemented as string.
The growing number of options and the untyped option values have led to
unreliable code. With HXT-9 options are represented by functions with
type save argument types instead of strings. This option handling has to be
modified when switching to the new version.

4 The basic concepts

4.1 The basic data structures

Processing of XML is a task of processing tree structures. This is can be done in Haskell in a very elegant way by defining an appropriate tree data type, a Haskell DOM (document object model) structure. The tree structure in HXT is a rose tree with a special XNode data type for storing the XML node information.

The generally useful tree structure (NTree) is separated from the node type (XNode). This allows for reusing the tree structure and the tree traversal and manipulation functions in other applications.

4.2 The concept of filters

Selecting, transforming and generating trees often requires routines, which compute not only a single result tree, but a (possibly empty) list of (sub-)trees. This leads to the idea of XML filters like in HaXml. Filters are functions, which take an XML tree as input and compute a list of result trees.

type XmlFilter = XmlTree ->[XmlTree]

More generally we can define a filter as

type Filter a b = a ->[b]

We will do this abstraction later, when introducing arrows. Many of the functions in the following motivating examples can be generalised this way. But for getting the idea, the

XmlFilter

is sufficient.

The filter functions are used so frequently, that the idea of defining a domain specific language with filters as the basic processing units comes up. In such a DSL the basic filters are predicates, selectors, constructors and transformers, all working on the HXT DOM tree structure. For a DSL it becomes necessary to define an appropriate set of combinators for building more complex functions from simpler ones. Of course filter composition, like (.), becomes one of the most frequently used combinators. There are more complex filters for traversal of a whole tree and selection or transformation of several nodes. We will see a few first examples in the following part.

The first task is to build filters from pure functions, to define a lift operator. Pure functions are lifted to filters in the following way:

Predicates are lifted by mapping False to the empty list and True to the single element list, containing the input tree.

This is a rather comfortable situation, with these filters we don't have to deal with illegal argument errors. Illegal arguments are just mapped to the empty list.

When processing trees, there's often the case, that no, exactly one, or more than one result is possible. These functions, returning a set of results are often a bit imprecisely called nondeterministic functions. These functions, e.g. selecting all children of a node or all grandchildren, are exactly our filters. In this context lists instead of sets of values are the appropriate result type, because the ordering in XML is important and duplicates are possible.

Working with filters is rather similar to working with binary relations, and working with relations is rather natural and comfortable, database people know this very well.

Two first examples for working with nondeterministic functions are selecting the children and the grandchildren of an XmlTree which can be implemented by

Of course we need choice combinators. The first idea is an if-then-else filter,
built up from three simpler filters. But often it's easier and more elegant to work with simpler binary combinators for choice. So we will introduce the simpler ones first.

The meaning is the following: If f computes a non-empty list as result, f succeeds and this list is the result, else g is applied to the input and this yields the result. There are two other simple choice combinators usually written in infix notation,

These choice operators become useful when transforming and manipulating trees.

4.4 Tree traversal filter

A very basic operation on tree structures is the traversal of all nodes and the selection and/or transformation of nodes. These traversal filters serve as control structures for processing whole trees. They correspond to the map and fold combinators for lists.

The simplest traversal filter does a top down search of all nodes with a special feature. This filter, called

4.5 Arrows

We've already seen, that the filters

a ->[b]

are a very

powerful and sometimes a more elegant way to process XML than pure
function. This is the good news. The bad news is, that filter are not
general enough. Of course we sometimes want to do some I/O and we want
to stay in the filter level. So we need something like

type XmlIOFilter = XmlTree ->IO[XmlTree]

for working in the IO monad.

Sometimes it's appropriate to thread some state through the computation
like in state monads. This leads to a type like

type XmlStateFilter state = state -> XmlTree ->(state,[XmlTree])

And in real world applications we need both extensions at the same
time. Of course I/O is necessary but usually there are also some
global options and variables for controlling the computations. In HXT,
for instance there are variables for controlling trace output, options
for setting the default encoding scheme for input data and a base URI
for accessing documents, which are addressed in a content or in a DTD
part by relative URIs. So we need something like

type XmlIOStateFilter state = state -> XmlTree ->IO(state,[XmlTree])

We want to work with all four filter variants, and in the future
perhaps with even more general filters, but of course not with four

sets of filter names, e.g.

deep, deepST, deepIO, deepIOST

.
This is the point where

newtype

s and

class

es

come in. Classes are needed for overloading names and

newtype

s are needed to declare instances. Further the
restriction of

XmlTree

as argument and result type is

not neccessary and hinders reuse in many cases.

A filter discussed above has all features of an arrow. Arrows are
introduced for generalising the concept of functions and function
combination to more general kinds of computation than pure functions.

A basic set of combinators for arrows is defined in the classes in the

Control.Arrow

module, containing the above mentioned

(>>>),(<+>), arr

.

In HXT the additional classes for filters working with lists as result type are

defined in

Control.Arrow.ArrowList

. The choice operators are
in

Control.Arrow.ArrowIf

, tree filters, like

getChildren, deep, multi,...

in

Control.Arrow.ArrowTree

and the elementary XML specific
filters in

Text.XML.HXT.XmlArrow

.

In HXT there are four types instantiated with these classes for
pure list arrows, list arrows with a state, list arrows with IO
and list arrows with a state and IO.

The mini XML document in file hello.xml is read and
a document tree is built. Then this tree is converted into a string
and written to standard output (filename: -). It is decorated
with an XML declaration containing the version and the output
encoding.

For processing HTML documents there is a HTML parser, which tries to
parse and interpret rather anything as HTML. The HTML parser can be
selected by calling

readDocument [withParseHTML yes,...]

The available read and write options can be found in the hxt

module

Text.XML.HXT.Arrow.XmlState.SystemConfig

5.2 Pattern for a main program

A more realistic pattern for a simple Unix filter like program has
the following structure:

This program has the same functionality as our first example,
but it separates the arrow from the boring option evaluation and
return code computation.

In line (0) the system is configured with the list of options.
These options are then used as defaults for all read and write operation.
The options can be overwritten for single read/write calls
by putting config options into the parameter list of the
read/write function calls.

The interesing line is (1).

readDocument

generates a tree structure with a so called extra

root node. This root node is a node above the XML document root
element. The node above the XML document root element is neccessary
because of possible other elements on the same tree level as the XML
root, for instance comments, processing instructions or whitespace.

Furthermore the artificial root node serves for storing meta
information about the document in the attribute list, like the
document name, the encoding scheme, the HTTP transfer headers and
other information.

To process the real XML root element, we have to take the children of
the root node, select the XML root element and process this, but
remain all other children unchanged. This is done with

processChildren

and the

when

choice
operator.

processChildren

applies a filter elementwise to

all children of a node. All results form processing the list of children from
the result node.

The structure of internal document tree can be made visible

e.g. by adding the option

withShowTree yes

to the

writeDocument

arrow in (3).

This will emit the tree in a readable
text representation instead of the real document.

In the next section we will give examples for the

processDocumentRootElement

arrow.

5.3 Tracing

There are tracing facilities to observe the actions performed
and to show intermediate results

In (0) the system trace level is set to 1, in default level 0
all trace messages are suppressed. The three trace messages (1)-(3)
will be issued, but also readDocument and writeDocument will
log their activities.

How a whole document and the internal tree structure can be traced,
is shown in the following example

The whole tree is searched for text nodes (1) and for image elements (2), from the image elements
the alt attribute values are selected as plain text (3), this text is transformed into a text node (4).

6.3 Selecting text and ALT attribute values (2)

Let's refine the above filter one step further. The text from the alt attributes shall be marked in the output
by surrounding double square brackets. Empty alt values shall be ignored.

takes three arguments, the element name (or tag name), a list of arrows for the construction of attributes, not empty in (3), and a list of arrows for the contents. Text content is generated in (2) and (4).

(2) is the column from the previous example but the URL has been made active
by embedding the URL in an A-element (2.1). In (3) there are two

new combinators,

(&&&)

(3.1) is an arrow for applying two
arrows to the same input and combine the results into a pair.

arr2

works like

arr

but it lifts a binary function into an arrow
accepting a pair of values.

arr2 f

is a shortcut for

arr (uncurry f)

. So width and height are combined into an X11 like

geometry spec. (4) adds the ALT-text.

7.4 A page about all images within a HTML page: 2. Refinement

The generated HTML page is not yet very useful, because it usually
contains relative HREFs to the images, so the links do not work.
We have to transform the SRC attribute values into absolute URLs.
This can be done with the following code:

(1). This arrow uses the global system state of HXT, in which the base URL
of a document is stored. For editing the SRC attribute value, the attribute list

of the image elements is processed with

processAttrl

.
With the

`when` hasName "src"

only SRC attributes are manipulated (3). The real work is done in (4): The URL is selected with

getChildren

, a text node, and converted into a string (

xshow

), the URL is transformed into an absolute URL
with

mkAbsURI

(6). This arrow may fail, e.g. in case of illegal
URLs. In this case the URL remains unchanged (

`orElse` this

).

The resulting String value is converted into a text node forming the new
attribute value node (7).

Because of the use of the global HXT state in

mkAbsURI

mkAbsRef

and

imageTable2

need to have the more specialized signature

IOStateArrow s XmlTree XmlTree

.

8 Transformation examples

8.1 Decorating external references of an HTML document

In the following examples, we want to decorate the external references
in an HTML page by a small icon, like it's done in many wikis.
For this task the document tree has to be traversed, all parts
except the intersting A-Elements remain unchanged. At the end of the list of children of an A-Element we add an image element.

an element name, a list of arrows for computing the attributes and a
list of arrows for computing the contents. The content of the image element is

empty (10). The attributes are constructed with

sattr

(9).

sattr

ignores the arrow input and builds an attribute form

the name value pair of arguments.

8.2 Transform external references into absolute references

In the following example we will develop a program for
editing a HTML page such that all references to external documents
(images, hypertext refs, style refs, ...) become absolute references.
We will see some new, but very useful combinators in the solution.

The task seems to be rather trivial. In a tree travaersal
all references are edited with respect to the document base.
But in HTML there is a BASE element, allowed in the content of HEAD
with a HREF attribute, which defines the document base. Again this
href can be a relative URL.

We start the development with the editing arrow. This gets
the real document base as argument.

Input to this arrow is the HTML element, (0) to (5) is the arrow for selecting
the BASE elements HREF value, parallel to this the system base URL is read

with

getBaseURI

(6) like in examples above. The resulting
pair of strings is piped into

expandURI

(7), the arrow version of

expandURIString

. This arrow ((1) to (7)) fails in the absense

of a BASE element. in this case we take the plain document base (8).
The selection of the BASE elements is not yet very handy. We will define
a more general and elegant function later, allowing an element path as selection argument.

In the third step, we will combine the to arrows. For this we will use

a new combinator

($<)

. The need for this new combinator

is the following: We need the arrow input (the document) two times,
once for computing the document base, and second for editing the
whole document, and we want to compute the extra string parameter
for editing of course with the above defined arrow.

Even the attribute selection can be expressed by XPath,
so (1) and (2) can be combined into

computeBaseRef
=(( xshow (getXPathTrees "/html/head/base@href")...

The extra

xshow

is here required to convert the

XPath result, an XmlTree, into a string.

XPath defines a
full language for selecting parts of an XML document.
Sometimes it's rather comfortable to make selections of this
type, but the XPath evaluation in general is more expensive
in time and space than a simple combination of arrows, like we've