Personal tools

Views

HXT/Conversion of Haskell data from/to XML

From HaskellWiki

1 Serializing and deserializing Haskell data to/from XML

With so called pickler functions and arrows, it becomes rather easy
and straightforward to convert native Haskell values to XML and vice
versa. The module Text.XML.HXT.Arrow.Pickle and submodules
contain a set of picklers (conversion functions) for simple data types
and pickler combinators for complex types.

2 The idea: XML pickler

For conversion of native Haskell data to and from external
representations two functions are necessary: One for generating the external
representation and one for reading/parsing the representation. Read and Show often form such a pair of functions.

A so-called pickler is a value with two such conversion functions, but it needs to keep track of the external representation during encoding and decoding, too. So the simplest form of a pickler converting between a value of type a and a sequence of

In a programming pearl paper [1] Andrew Kennedy has described how to define primitive picklers plus a set of pickler combinators to (de-)serialize from and to (Byte-)Strings.

The HXT picklers are an adaptation of these pickler combinators. The difference to Kennedys approach is that the external representation is not a list of Chars but a list of XmlTrees. The basic picklers for the primitve types (Int, Bool,...) will convert simple values into XML text nodes, and the picklers for creating XML element and attribute nodes are new:

In XML there are two places for storing information: The attributes and the element contents. Furthermore, the pickler contains a third component for
type information. This enables the derivation of a DTD from a set of picklers, but in the following examples we do not need this component. With the predefined picklers and pickler combinators, we don't have to look very much into these internals. Let's start with an example.

3 Example: Processing baseball league data

3.1 The XML data structure

In this first example we are dealing with baseball league data, taken from the so- called XML Bible. The complete source for this example is included in the
HXT distribution in directory examples/arrows/AGentleIntroductionToHXT/PicklerExample/. First let's get some idea about the structure of the XML data. The structure is not defined by a DTD or schema, so we have to guess some things. Here is a part of the example XML file:

3.2 The Haskell data model

Let's first analyze the underlying data model and then define an
appropriate set of Haskell data types for the internal representation.

The root type is a Season, consisting of a year an a set of Leagues

The Leagues are all identified by a String and consist of a set of Divisions, so it's a Map.

The Divisions are also identified by a String and consist of a list of Teams, so it's again a Map

A Team has three components, a teamName, a city, and a list of Players

A Player has a lot of attributes, for simplicity of the example in the internal model we will not take all fields into account. Just six fields are included, the firstName, the lastName, the position, atBats, hits and era. All others will be ignored.

3.3 The predefined picklers

HXT contains a class XmlPickler defining a single function xpickle
for overloading the xpickle function name:

class XmlPickler a where
xpickle :: PU a

For simple data types there is an instance for XmlPickler which uses the primitive pickler xpPrim for conversion from and to XML text nodes. This primitive pickler is available for all types supporting Read and Show:

A Season value is mapped onto an element SEASON with xpElem.
This constructs/reads the XML SEASON element. The two components of Season are wrapped into a pair with xpWrap. xpWrap needs a pair of functions for a 1-1 mapping between Season and (Int, Leagues).
The first component of the pair, the year is mapped onto an attribute YEAR.
The attribute value is handled with the predefined pickler for Int.
The second one, the Leagues are handled by xpLeagues.

xpLeagues has to deal with a Map value. This can't done directly, but the
Map value is converted to/from a list of pairs with xpWrap and (fromList, toList).
Then the xpList is applied for the list of pairs. Each pair will be represented by an LEAGUE
element, the name is mapped to an attribute NAME, the divisions are handled by xpDivisions.

(xpText is used to encode attribute or tag text, but note that you must use xpText0 instead wherever the empty string is a legal value, because xpText doesn't handle the case of unpickling 'nothing' from the XML.)

This application reads in the complete data used in HXT/Practical/Simple2 from file simple2.xml
and unpickles it into a Season value.
This value is processed (dummy: print out) by processSeason
and pickled again into new-simple2.xml

A program is a statement, and four variants of statement are defined, assignments, sequences, branches and loops. The expressions have five variants, constants, identifiers, unary and binary expressions.
The operators are realized as enumeration types.

For developing the picklers, there are two new aspects. This example contains sum data types and it's a recursive structure.

The root pickler is xpProgram which wraps the main statement in a program element.
The program element is decorated with a fixed attribute, defining a name space declaration,
just for demonstrating the use of the xpAddFixedAttr.

For the operators two variants are shown. The UnOp is converted with read/show (xpPrim),
The Op is in XML represented by a number (xpWrap (toEnum, fromEnum)).

The Expr and Stmt picklers are a bit more interesting. We have to select a pickler for every
constructor of the data type. This is done by mapping each variant to a number and then index a list of picklers
with this number. For all variants the values are converted with xpWrap into simple values or tuples,
and then these values are mapped to XML elements. The simple fields are encoded in attributes, the complex
(and recursive) are encoded as child elements.

The withRemoveWS configuration option is necessary because
the XML document was formatted and filled up with redundant
whitespace when written.

5 A few words of advice

These picklers are a powerful tool for de-/serializing from/to XML.
Only a few lines of code are needed for serializing as well as for
deserializing.
But they are absolutely intolerant when dealing with invalid XML.
They are intended to read machine generated XML, ideally generated by the same pickler.
When unpickling hand written XML or XML generated by foreign tools, please validate the XML
before reading, preferably with RelaxNG or XML Schema, because of the more powerful
validation schema than DTDs.

When designing picklers, one must be careful to put enough markup
into the XML structure, to read the XML back without the need
for a lookahead and without any ambiguities. The simplest case of a not working pickler is a pair of primitve picklers e.g. for some text. In this case
the text is written out and concatenated into a single string, when parsing the XML, there will only be a single string and the pickler will fail because of a missing value for the second component. So at least every primitive pickler must be combined with an xpElem or xpAttr.

It's possible to define various picklers per data type,
and picklers can be used one way, just for serializing into XML/HTML.
So this approach can also be used to easily generate parts of a HTML document.

Please do not try to convert a whole large database into a single XML file
with this approach. This will run into memory problems when reading the data,
because of the DOM approach used in HXT. In the HXT distribution, there is
a test case in the examples dir performance, where the pickling and unpickling is done
with XML documents containing 2 million elements. This is the limit for a 1G Intel box (tested with ghc 6.8).

There are two strategies to overcome these limitations. The first is a SAX like
approach, reading in simple tags and text elements and not building a tree structure,
but writing the data instantly into a database.
For this approach the Tagsoup package can be useful. The disadvantage is the programming
effort for collecting and converting the data.

The second and recommended way is, to split the whole bunch of data into smaller pieces, unpickle these and
link the resulting documents together by the use of 'hrefs.

6 More Examples

Exxamples dealing with direct conversion to/from XML without
the use of picklers can be found under HXT/Practical.

This is an example for reading and writing XML without the use of
picklers. It was developed before the picklers were added to HXT.
The code shows that it's much more effort to implement a conversion
than with the technique described above.