Login

Using Perl with XML (part 1)

Converted your little black book into XML, and don’t know what to do next? This article gets you started on the path to being an XML guru, demonstrating how to use Perl’s SAX parser to parse and convert your XML into Web-friendly HTML.Unless you’ve spent the last couple of years under a rock, you know about XML,
the Extensible Markup Language, and its extended family of related technologies
– XSLT, XPath, XLink, WDDX et al. You may even have used it a couple of times,
creating XML documents to describe your address book or your CD collection, and
marveled at how simple the process is.

It’s at this point that most
novice developers hit a brick wall. Sure, they know the theory of creating an
XML document, and they may even understand why the technology is so widely
discussed and praised. But when it comes to actually doing something with it –
converting that XML-encoded CD collection into something that can be read by a
browser, for example – there’s simply not enough information out there to
provide guidance on how to take the next step.

Over the next few pages, I
will be attempting to rectify this a little bit, with an explanation of how you
can convert your XML data into browser-readable HTML. Since XML is nothing but
plain text, it makes sense to use a language which specializes in parsing text
documents…which is where Perl, that granddaddy of text processing languages,
comes in. There are a couple of basic approaches to parsing XML data; this
two-part article will explore the Perl implementation of each, together with
examples and illustrations.

I’ll try and keep it simple – I’m going to
use very simple XML sources, so you don’t have to worry about namespaces, DTDs
and PIs – although I will assume that you know the basic rules of XML markup,
and of Perl scripting. So let’s get this show on the road.{mospagebreak
title=Getting Down To Business} Before we get into the nitty-gritty of XML
parsing with Perl, I’d like to take some time to explain how all the pieces fit
together.

In case you don’t already know, XML is a markup language
created to help document authors describe the data contained within a document.
This description is accomplished by means of tags, very similar in appearance to
regular HTML markup. However, where HTML depends on pre-defined tags, XML allows
document authors to create their own tags, immediately making it more powerful
and flexible. There are some basic rules to be followed when creating an XML
file, and a file can only be processed if these rules are followed to the
letter.

Once a file has been created, it needs to be converted, or
“transformed”, from pure data into something a little more readable. XSL, the
Extensible Style Language, is typically used for such transformations; it’s a
powerful language that allows you to generate different output from the same XML
data source. For example, you could use different XSL transformations to create
an HTML Web page, a WML deck, and an ASCII text file…all from the same source
XML.

There’s only one problem here: most browsers don’t come with an XML
parser or an XSL processor. The latest versions of Internet Explorer and
Netscape Gecko do support XML, but older versions don’t. And this brings up an
obvious problem: how do you use an XML data source to generate HTML for these
older browsers?

The solution is to insert an additional layer between the
client and the server, which takes care of parsing the XML and returning the
rendered output to the browser. And that’s where Perl comes in – it supports XML
parsing, through add-on DOM and XML packages, and even has a package to handle
XSL transformations through the Sablotron processor.

As I’ve said
earlier, there are two methods to parse XML data with Perl, and each one has
advantages and disadvantages. I’ll explain both approaches, together with simple
examples to demonstrate how to use them in your own applications.{mospagebreak
title=Let’s Talk About SAX} The first of these approaches is SAX, the Simple API
for XML. A SAX parser works by traversing an XML document and calling specific
functions as it encounters different types of tags. For example, I might call a
specific function to process a starting tag, another function to process an
ending tag, and a third function to process the data between them.

The
parser’s responsibility is simply to parse the document; the functions it calls
are responsible for processing the tags found. Once the tag is processed, the
parser moves on to the next element in the document, and the process repeats
itself.

Perl comes with a SAX parser based on the expat library created
by James Clark; it’s implemented as a Perl package named XML::Parser, and
currently maintained by Clark Cooper. If you don’t already have it, you should
download and install it before proceeding further; you can get a copy
from CPAN (http://www.cpan.org/).

Once my data is in XML-compliant format, I need to decide
what I’d like the final output to look like.

Let’s say I want it to look
like this:

As you can see, this is a simple
table containing columns for the book title, author, price and rating. (I’m not
using all the information in the XML file). The title of the book is printed in
italics, while the numerical rating is converted into something more
readable.

Next, I’ll write some Perl code to take care of this for
me.{mospagebreak title=Breaking It Down} The first order of business is to
initialize the XML parser, and set up the callback functions.

The parser is initialized in the ordinary way – by
instantiating a new object of the Parser class. This object is assigned to the
variable $xp, and is used in subsequent function calls.

# initialize parser
$xp = new XML::Parser();

The next step is to specify the functions to be executed when
the parser encounters the opening and closing tags of an element. The
setHandlers() method is used to specify these functions; it accepts a hash of
values, with keys containing the events to watch out for, and values indicating
which functions to trigger.

In this case, the user-defined functions start() and end()
are called when starting and ending element tags are encountered, while
character data triggers the cdata() function.

Obviously, these aren’t the
only types of events a parser can be set up to handle – the XML::Parser package
allows you to specify handlers for a diverse array of events; I’ll discuss these
briefly a little later.

The next step in the script above is to open the
XML file, read it and parse it via the parsefile() method. The parsefile()
method will iterate through the XML document, calling the appropriate handling
function each time it encounters a specific data type.

# parse XML
$xp->parsefile(“library.xml”);

In case your XML data is not stored in a file, but in a
string variable – quite likely if, for example, you’ve generated it dynamically
from a database – you can replace the parsefile() method with the parse()
method, which accepts a string variable containing the XML document, rather than
a filename.

Once the document has been completely parsed, the script will
proceed to the next line (if there is one), or terminate gracefully. A parse
error – for example, a mismatched tag or a badly-nested element – will cause the
script to die immediately.

As you can see, this is fairly simple –
simpler, in fact, than the equivalent process in other languages like PHP or
Java. Don’t get worried, though – this simplicity conceals a fair amount of
power.{mospagebreak title=Call Me Back} As I’ve just explained, the start(),
end() and cdata() functions will be called by the parser as it progresses
through the document. We haven’t defined these yet – let’s do that next:

Each time the parser encounters a starting tag, it calls
start() with the name of the tag (and attributes, if any) as arguments. The
start() function then processes the tag, printing corresponding HTML markup in
place of the XML tag.

I’ve used an “if” statement, keyed on the tag name,
to decide how to process each tag. For example, since I know that <book>
indicates the beginning of a new row in my desired output, I replace it with a
<tr>, while other elements like <title> and <author>
correspond to table cells, and are replaced with <td> tags.

In case
you’re wondering, I’ve used the lc() function to convert the tag name to
lowercase before performing the comparison; this is necessary to enforce
consistency and to ensure that the script works with XML documents that use
upper-case or mixed-case tags.

Finally, I’ve also stored the current tag
name in the global variable $currentTag – this can be used to identify which tag
is being processed at any stage, and it’ll come in useful a little further
down.

The end() function takes care of closing tags, and looks similar to
start() – note that I’ve specifically cleaned up $currentTag at the end.

The cdata() function is called whenever the parser encounters
data between an XML tag pair. Note, however, that the function is only passed
the data as argument; there is no way of telling which tags are around it.
However, since the parser processes XML chunk-by-chunk, we can use the
$currentTag variable to identify which tag this data belongs
to.

Depending on the value of $currentTag, an “if” statement is used to
print data with appropriate formatting; this is the place where I add italics to
the title, a currency symbol to the price, and a text rating (corresponding to a
numerical index) from the @ratings array.

Here’s what the finished script
(with some additional HTML, so that you can use it via CGI) looks like:

You can now add new items to your XML document, or edit
existing items, and your rendered HTML page will change accordingly. By
separating the data from the presentation, XML has imposed standards on data
collections, making it possible, for example, for users with no technical
knowledge of HTML to easily update content on a Web site, or to present data
from a single source in different ways.{mospagebreak title=Random Walk} In
addition to elements and CDATA, Perl also allows you to set up handlers for
other types of XML structures, most notably PIs, entities and notations (if you
don’t know what these are, you might want to skip this section and jump straight
into another, more complex example on the next page). As demonstrated in the
previous example, handlers for these structures are set up by specifying
appropriate callback functions via a call to the setHandlers() object
method.

Here’s a quick list of the types of events that the parser can
handle, together with a list of their key names (as expected by the
setHandlers() method) and a list of the arguments that the corresponding
callback function will receive.

In this case, the setHandlers() method knows that it has to
call the subroutine pih() when it encounters a processing instruction in the XML
data; this user-defined pih() function is automatically passed the PI target and
the actual command to be executed. Assuming the command is a Perl command – as
indicated by the target name – the function passes it on to eval() for
execution.{mospagebreak title=What’s For Dinner?} Here’s another, slightly more
complex example using the SAX parser, and one of my favourite meals.

This time, my Perl script won’t be using an “if” statement
when I parse the file above; instead, I’m going to be keying tag names to values
in a hash. Each of the tags in the XML file above will be replaced with
appropriate HTML markup.

In this case, I’ve set up two hashes, one for opening tags
and one for closing tags. When the parser encounters an XML tag, it looks up the
hash to see if the tag exists as a key. If it does, the corresponding value
(HTML markup) is printed. This method does away with the slightly cumbersome
branching “if” statements of the previous example, and is easier to read and
understand.

Here’s the output:

That’s about it for the moment. Over the last few
pages, I’ve discussed using Perl’s XML::Parser package to process an XML file
and mark up the data within it with HTML tags. However, just as there’s more
than one way to skin a cat, there’s more than one way to process XML data with
Perl. In the second part of this article, I’ll be looking at an alternative
technique of parsing an XML file, this time using the DOM. Make sure you come
back for that one!

Note: All examples in this article have been tested on
Linux/i586 with Perl 5.005. Examples are illustrative only, and are not meant
for a production environment. YMMV!