If you're a complete XML newbie and struggling with jargon like
'element', 'entity', 'DTD', 'well formed' etc, you could try
XML101.com. The site
has no Perl content and a strong Microsoft/IE bias but you can come
back here when you're finished :-)

On the other hand if you've worked with XML a bit and think you
pretty much know it, the tutorial at skew.org will
test the boundaries of your knowledge.

The reference documentation is embedded in the Perl code of
each module in 'POD' (Plain Old Documentation) format. There are
a number of ways you might gain access to this documentation:

The perldoc command will locate
the module file, extract the POD text and format it for reading on
screen. For example, if you want to read the documentation
for the XML::Parser module, you would type:
perldoc XML::Parser

Some Perl distributions (notably ActiveState Perl)
include the documentation in HTML format. Under Windows, you should find
this under: Start->Programs->ActiveState
ActivePerl. If your distribution does not
include the HTML files, you can create them using
pod2html

HTML documentation for various Perl modules is also
provided on various Internet sites. You can try searching for XML on
Perldoc.com or on search.cpan.org for a list of XML
documentation.

If all else fails, you can locate the module and open
it directly in a text editor. Once again, using
XML::Parser as an example, you would look for
a file called Parser.pm in a directory called
XML under lib. Once you
have opened the file you can search for '=head' to locate POD.

2. Selecting a Parser Module

2.1.

Don't select a parser module.

If you want to use Perl to solve a specific problem, it's
possible that someone has already solved it and published their module
on CPAN. This will allow you to ignore the details of the XML layer
and start working at a higher level. Here's a random selection of
CPAN modules which work with XML data but provide a higher level
API:

If you want to use XML to transmit data across a network
to use or provide 'web services', take a good look at
SOAP::Lite (forget about the 'Lite' moniker, this
is a serious piece of work).

Perhaps you've played around with the Glade GUI builder
and discovered it uses XML to store the interface definitions. The
Gtk2::GladeXML module already knows how to
read those files and turn them into a working GUI with only a few
lines of Perl code (see this
article for an intro).

Maybe you've had the brilliant idea that you could
serialise your Perl objects to XML format to support your new killer RPC
over SMTP protocol. Well before you start coding, try installing SPOPS
(Simple Perl Object Persistence with Security) you might find it already
does exactly what you need (actually to go one step further, the
aforementioned SOAP::Lite actually supports SMTP
as a transport).

There are dozens of other examples of existing Perl modules which
work with XML data in domain-specific formats and allow you to get on
with the job of using that data. Remember, search.cpan.org is your friend.

2.2.

The Quick Answer

For general purpose XML processing with Perl, XML::LibXML
is usually the best choice. It is stable, fast and powerful. To make the most of the
module you need to learn and use XPath expressions. The documentation for XML::LibXML is
its biggest weakness.

Other modules may be better suited to particular niches - as discussed below.

2.3.

Tree versus stream parsers

If you really do need to work with data in XML format, you need to
select a parser module and there are many to chose from. Most modules
can be classified as using either a 'tree' or a 'stream' model.

A tree based parser will typically parse the
whole XML document and return you a data structure made up of 'nodes'
representing elements, attributes, text content and other components of
the document.

A stream based parser on the other hand, sends
the data to your program in a stream of 'events' as the XML is
parsed.

A tree based module will typically provide you with an API of
functions for searching and manipulating the tree. The Document Object
Model (DOM) is a standard API implemented by a number of modules. Other
modules use non-standard APIs to take advantage of the many conveniences
available to Perl programmers.

To use a stream based module, you typically write some handler or
'callback' functions and register them with the parser. As each
component of the XML document is read in and recognised, the parser will
call the appropriate handler function to give you the data. SAX (the
Simple API for XML) is a standard object-oriented API implemented by all
the stream-based parsers (except parsers written before SAX
existed).

2.4.

Pros and cons of the tree style

Programmers new to XML may find it easier to get started with
a tree based parser - with one method call your document is parsed and
available to your code.

Portability may be an issue for tree style code. Even the modules
which support a DOM API differ enough that you will generally have to
change your code if you need to switch to another parser module. The DOM
itself is language-neutral, which may be an advantage if you're coming
from C or Java but Perl programmers may find some of the constructs
clumsy.

The memory requirements of a tree based parser can be surprisingly
high. Because each node in the tree needs to keep track of links to
ancestor, sibling and child nodes, the memory required to build a tree
can easily reach 10-30 times the size of the source document. You
probably don't need to worry about that though unless your documents are
multi-megabytes (or you're running on lower spec hardware).

Some of the DOM modules support XPath - a powerful expression
language for selecting nodes to extract data from your tree. The
full power of XPath simply cannot be supported by stream based parsers
since they only hold a portion of the document in memory.

2.5.

Pros and cons of the stream style

Stream based parsers can (but don't always) offer better memory
efficiency than tree based parsers, since the whole document does not
have to be parsed before you can work with it.

SAX parsers also score well for portability. If you use the SAX
API with one parser module, you can almost certainly swap to another
SAX parser module without changing a line of your code.

The SAX approach encourages a very modular coding style. You can
chain SAX handlers together to form a processing pipeline - similar in
spirit to a Unix command pipeline. Each link (or 'filter') in the chain
performs a well-defined function. The individual components tend to have
a high degree of reusability.

SAX also has applications which are not tied to XML. Modules exist
that can generate SAX event streams from the results of database queries
or the contents of spreadsheet files. Downstream filter modules neither
know nor care whether the original source data was in XML format.

One 'gotcha' with the stream based approach is that you can't be
sure that a document is error-free until the end of the parse. For this
reason, you may want to confirm that a document is well-formed before
you pass it through your SAX pipeline

2.6.

How to choose a parser module

Choice is a good thing - there are many parser modules to choose
from simply because no one solution will be appropriate in all cases.
Get stuck in, if you should discover you made the wrong choice, it's
probably not going to be hard to change and you'll have some experience
on which to base your next choice. The following advice is one person's
view - your mileage may vary:

First of all, make sure you have XML::Parser
installed - but don't plan to use it. Other modules provide layers on
top of XML::Parser - use them. Further justification for this apparently
contradictory advice can be found in the XML::Parser description below.

If your needs are simple, try XML::Simple.
It's loosely classified as a tree based parser although the 'tree' is
really just nested Perl hashes and arrays. You may need to swot up on
Perl references (see: perldoc perlreftut) to take
advantage of this module.

If you're looking for a more powerful tree based approach, try
XML::LibXML for a standards compliant DOM or
XML::Twig for a more 'Perl-like' API. Both of
these modules support XPath.

If you've decided to use a stream based approach, head
directly for SAX. The XML::SAX distribution
includes a base class you can use for your filters as well as a very
portable parser module written entirely in Perl
(XML::SAX::PurePerl). You will probably also want
to install XML::SAX::Expat which uses the same
C-based parser library ('expat' by James Clark) as
XML::Parser, for faster parsing.

Finally, the latest trendy buzzword in Java and C# circles is
'pull' parsing (see www.xmlpull.org). Unlike SAX, which 'pushes' events at your
code, the pull paradigm allows your code to ask for the next bit when
it's ready. This approach is reputed to allow you to structure your code
more around the data rather than around the API. Eric Bohlman's
XML::TokeParser offers a simple but powerful
pull-based API on top of XML::Parser. There
are currently no Perl implementations of the XMLPULL API.

2.7.

Rolling your own parser

You may be tempted to develop your own Perl code for parsing XML.
After all, XML is text and Perl is a great language for working with
text. But before you go too far down that track, here are some points
to consider:

Smart people don't. (Actually a number of really smart
people have - that's why there's a range of existing parsers to chose
from).

It's harder than you think. The first major hurdle is
encodings. Then you'll have to handle
DTDs - even if you're not doing validation. The feature list will also
need to include numeric and named entities, CDATA sections, processing
instructions and well-formedness checks. You probably should support
namespaces too.

If you haven't done all that, it's not XML. It might
work for that subset of XML that you deem to be important, but if you
can't exchange documents with other parties, what's the
point?

Even if it works it will be slow.

If none of the existing modules have an API that suits your needs,
write your own wrapper module to extend the one that comes closest.

3. CPAN Modules

This section attempts to summarise the most commonly used XML modules
available on CPAN. Many of the modules require that you have certain
libraries installed and have a compiler available to build the Perl wrapper
for the libraries (binary builds are available for some platforms).
Except where noted, the parsers are non-validating.

You can find more in-depth comparisons of the modules and example
source code in the Ways to Rome articles
maintained by Michel Rodriguez.

3.1.

XML::Parser

XML::Parser is a Perl wrapper around James
Clark's 'expat' library - an XML parser written in C. This module was
originally written by Larry Wall and maintenance was taken over by Clark
Cooper. It is fast, complete, widely used and reliable.
XML::Parser offers both tree and stream
interfaces. The stream interface is not SAX, so don't use it for any new
code. The tree interfaces are not a lot of fun to work with either.
They're non-standard (no DOM support), not OO and offer no real API. The
reason you might want XML::Parser is because it
provides a solid base which is used by other modules with better
APIs.

Before you rush off and try to install
XML::Parser, make sure that you haven't got it
already - it comes standard with ActiveState's Perl on all supported
platforms. If you do need to install it, you'll need to install the
expat library first (which you can get from expat.sourceforge.net) and
you will need a compiler.

Most of the documentation you need for the
XML::Parser API can be accessed using
perldoc XML::Parser, however some of the low-level
functionality is documented in perldoc
XML::Parser::Expat.

3.2.

XML::LibXML

XML::LibXML provides a Perl wrapper around
the GNOME Project's libxml2 library. This module was originally written
by Matt Sergeant and Christian Glahn and is now actively maintained by
Petr Pajas. It is very fast, complete and stable. It can run in
validating or non-validating modes and offers a DOM with XPath support.
The DOM and associated memory management is implemented in C which offers
significant performance advantages over DOM trees built from Perl
datatypes. The XML::LibXML::SAX::Builder module
allows a libxml2 DOM to be constructed from SAX events.
XML::LibXML::SAX is a SAX parser based on the
libxml2 library.

XML::LibXML can also be used to parse HTML
files into DOM structures - which is especially useful when converting
other formats to XML or using XPath to 'scrape' data from web
pages.

The libxml2 library is not part of the
XML::LibXML distribution. Precompiled
distributions of the libxml2 library and the
XML::LibXML Perl wrapper are available for most
operating systems. The library is a standard package in most Linux
distributions; it can be compiled on numerous other platforms; and it is
bundled with PPM packages of XML::LibXML for
Windows.

For early access to upcoming features such as W3C Schema and RelaxNG
validation, you can access the CVS version of XML::LibXML at:

cvs -d :pserver:anonymous@axkit.org:/home/cvs co XML-LibXML

3.3.

XML::XPath

Matt Sergeant's XML::XPath module was the first
Perl DOM implementation to support XPath. It has largely been supplanted by
XML::LibXML which is better maintained and more
powerful.

3.4.

XML::DOM

Enno Derksen's XML::DOM implements the
W3C DOM Level
1 tree structure and API. It should not be your first choice of
DOM module however, since it lacks XPath and namespace support and it is
significantly slower than libxml.

TJ Mather is currently the maintainer of this package.

3.5.

XML::Simple

Grant McLean's XML::Simple was originally
designed for working with configuration files in XML format but it can be
used for more general purpose XML work too. The 'simple tree' data
structure is nothing more than standard Perl hashrefs and arrays - there
is no API for finding or transforming nodes. This module is not suitable
for working with 'mixed content'. XML::Simple has
it's own frequently
asked questions document.

Although XML::Simple uses a tree-style, the
module also supports building the tree from SAX events or using a simple
Perl data structure to drive a SAX pipeline.

Although DOM modules can be very convenient, they can also be
memory hungry. If you need to work with very large documents, you may
find XML::Twig by Michel Rodriguez to be a good
solution. Rather than parsing your whole document and returning one
large 'tree', this module allows you to define elements which can be
parsed as discrete units and passed to your code as 'twigs' (small
branches of a tree). You can also define whether the 'uninteresting
bits' between the twigs should be discarded or streamed to STDOUT as they
are processed.

Another advantage of XML::Twig is that it is
not constrained by the tyranny of DOM compliance. Instead, it offers a
number of conveniences to help the experienced Perl programmer feel right
at home. XML::Twig also supports XPath
expressions. The module's official home page for is http://www.xmltwig.com/.

3.7.

Win32::OLE and MSXML.DLL

If you're using a Windows platform and you're not worried about
portability, Microsoft's MSXML provides a DOM implementation with
optional validation and support for both XPath and XSLT. MSXML is a
COM component and as such can be accessed from Perl using
Win32::OLE. Unfortunately this means you can't
get at the documentation with the usual perldoc
command, so here's a code snippet to get you started:

Although written in Perl, Matt Sergeant's
XML::PYX is really designed for working with XML
files using shell command pipelines. The PYX notation allows you to
apply commands like grep and sed to
specific parts of the XML document (eg: element names, attribute values,
text content). For example, this one-liner provides a report of how many
times each type of element is used in a document:

pyx doc.xml | sed -n 's/^(//p' | sort | uniq -c

This one creates a copy of an XML document with all
attributes stripped out:

pyx doc1.xml | grep -v '^A' | pyxw > doc2.xml

And this one spell checks the text content of a document skipping
over markup text such as element names and attributes:

pyx talk.xml | sed -ne 's/^-//p' | ispell -l | sort -u

Eric Bohlman's XML::TiePYX is an alternative
Perl PYX implementation which employs tied filehandles. One of the
aims of the design was to work around limitiations of the Win9x
architecture which doesn't really support pipes. Using this module
you can print PYX format data to a filehandle
and well-formed XML will actually be written. Conversely, you can
read from an XML file and <FILEHANDLE> will
return PYX data.

Sean McGrath has written an article
introducing PYX on XML.com. PYX can be addictive - especially
if you're an awk or sed wizard, but if you find you're using Perl in
your pipelines you should consider switching to SAX.

3.9.

XML::SAX

The XML::SAX distribution includes a
number of things you're likely to need if you're working with SAX.

XML::SAX::ParserFactory is used to get a
parser object without having to know which parsers are installed.

Any SAX filters you develop should inherit from
XML::SAX::Base. This will save you time as well
as ensuring your filters 'play nicely' with other SAX components.

XML::SAX::PurePerl is a SAX parser module
written entirely in Perl. It's performance isn't great but if you
don't have a compiler on your system it may be your only option.

Authors of the Perl SAX spec and the modules which implement it
include Ken MacLeod, Matt Sergeant, Kip Hampton and Robin Berjon.

3.10.

XML::SAX::Expat

This module implements a SAX2 interface around the expat library,
so it's fast, stable and complete. XML::SAX::Expat doesn't link expat
directly but through XML::Parser. Hence, this
module requires XML::Parser, and doesn't compile
any XS code on installation. If you have XML::SAX
and XML::Parser installed, you'll want to install
this module to improve on the speed and encoding support offered by
XML::SAX::PurePerl.

Robin Berjon is the author of this module although he claims to
have stolen the code from Ken MacLeod.

3.11.

XML::SAX::ExpatXS

This module, analogous to XML::SAX::Expat,
implements a SAX2 interface around the expat library but it links
expat directly. This is why XML::SAX::ExpatXS
is even faster than XML::SAX::Expat. There is
no dependence on XML::Parser but either some XS
code must be compiled or a binary package installed.

This module was started by Matt Sergeant and completed by
Petr Cimprich.

3.12.

XML::SAX::Writer

The XML::SAX::Writer module is used to
generate XML output from a SAX2 pipeline. This module can receive
pluggable consumer and encoder objects. The default encoder is based
on Text::Iconv.

XML::SAX::Writer was created by
Robin Berjon.

3.13.

XML::SAX::Machines

Once you start accumulating SAX filter modules and fitting them
together into SAX pipelines, you may find writing the glue code is a
little tedious and error prone. Barrie Slaymaker found this and built
XML::SAX::Machines to solve the problem so you
don't have to. Using this module, you can reduce this code:

XPathScript is a stylesheet language comparable to XSLT, for
transforming XML from one format to another (possibly HTML, but
XPathScript also shines for non-XML-like output).

Like XSLT, XPathScript offers a dialect to mix verbatim portions of
documents and code. Also like XSLT, it leverages the powerful
"templates/apply-templates" and "cascading stylesheets" design
patterns, that greatly simplify the design of stylesheets for
programmers. The availability of the XPath query language inside
stylesheets promotes the use of a purely document-dependent,
side-effect-free coding style. But unlike XSLT which uses its own
dedicated control language with an XML-compliant syntax, XPathScript
uses Perl which is terse and highly extendable.

As of version 0.13 of XML::XPathScript, the module can use either
XML::LibXML or XML::XPath
as its parsing engine. Transformations can be performed either using
a shell-based script or, in a web environment, within AxKit.

3.15.

How can I install XML::Parser under
Windows?

The ActiveState Perl distribution includes many CPAN modules in
addition to the core Perl module set. XML::Parser
is one of these additional modules, so you've already got it.

3.16.

How can I install other binary modules under Windows?

ActiveState Perl includes the 'Perl Package Manager' (PPM) utility
for installing modules. You use PPM from a command window (DOS prompt)
like this:

C:\> ppm
ppm> install XML::Twig

You must be connected to the Internet to use PPM as it connects
to ActiveState's web site to download the packages you select. Refer
to the HTML documentation which accompanies ActiveState Perl for
instructions on using PPM through a firewall.

One disadvantage of PPM is that you can only install packages
that have been packaged in PPM format. You don't have to wait for
ActiveState to package modules though as you can tell PPM to use other
package repositories. For example many of the XML modules are available
through Randy Kobe's archive which can be accessed like this:

Many of the CPAN modules are written entirely in Perl and don't
require a compiler, so you can use the CPAN.pm
module/shell which comes with Perl. The only thing you need is
nmake - a windows version of make which you can
download from:

It's a self extracting archive so run it and move the resulting
files into your windows (or winnt) directory.

Then go to a command window (DOS prompt) and run:

perl -MCPAN -e shell

The first time you run the CPAN shell, you will be asked a number
of questions by the automatic configuration process. Accepting the
default is generally pretty safe. You'll be asked where various programs
are on your system (eg: gzip, tar, ftp etc). Don't worry if you don't
have them since CPAN.pm will use the
Compress::Zlib,
Archive::Tar and Net::FTP
modules if they are installed - and they are part of the ActiveState Perl
distribution. Also don't worry if you make a mistake, you can repeat the
configuration process at any time by typing this command at the 'cpan>'
prompt:

o conf init

If you're behind a firewall, when you're asked for an FTP or HTTP
proxy enter it's URL like this:

http://your.proxy.address:port/

You can probably use http:// for both FTP and HTTP (depending on
your proxy).

After you've selected a CPAN archive near you, you will finally get
a 'cpan>' prompt. Then you can type:

install XML::SAX

and sit back while CPAN.pm downloads, unpacks, tests and installs
all the relevant code in all the right places.

3.18.

"could not find ParserDetails.ini"

A number of people have reported encountering the error "could not
find ParserDetails.ini in ..." when installing or attempting to use
XML::SAX. ParserDetails.ini
is used by XML::SAX::ParserFactory to determine
which SAX parser modules are installed. It should be created by the
XML::SAX installation script and should be updated
automatically by the install script for each SAX parser module.

If you are installing XML::SAX
manually you must run Makefile.PL. Unpacking the
tarball and copying the files into your Perl lib directory will not
work.

During the initial installation, if you are asked whether
ParserDetails.ini should be updated, always say yes.
If you say no, the file will not be created.

If you are using ActivePerl, the following command should
resolve the problem:

ppm install http://theoryx5.uwinnipeg.ca/ppms/XML-SAX.ppd

Once you have successfully installed
XML::SAX, you should consider installing a module
such as XML::SAX::Expat or
XML::LibXML to replace the slower pure-Perl parser
bundled with SAX.

If you are packaging XML::SAX in an
alternative distribution format (such as RPM), your post-install script
should check if ParserDetails.ini exists and if it
doesn't, run this command:

4. XSLT Support

Matt Sergeant's XML::LibXSLT is a Perl
wrapper for the GNOME project's libxslt library. The XSLT
implementation is almost complete and the project is under active
development. The library is written in C and uses libxml2 for XML
parsing. Matt's testing found that it was about twice as fast as
Sablotron.

4.2.

XML::Sablotron

Sablotron
is an XML toolkit implementing XSLT 1.0, DOM Level2 and XPath 1.0. It is
written in C++ and developed as an open source project by Ginger Alliance. Since the XSLT
engine is written in C++ and uses expat for XML
parsing, it's pretty quick. XML::Sablotron
is a Perl module which provides full access to the Sablotron API
(including a DOM with XPath support).

4.3.

XML::XSLT

This module aims to implement XSLT in Perl, so as long as you have
XML::Parser working you won't need to compile
anything to install it. The implementation is not complete, but work is
continuing and you can join the fun at the project's SourceForge page. The
XML::XSLT distribution includes a script you can
use from the command line like this:

Matt has also written XML::Filter::XSLT
which allows you to do XSLT transformations in a SAX pipeline. It
currently requires XML::LibXSLT but it is intended
to work with other XSLT processors in the future. In case you're
wondering, yes it does build a DOM of your complete document which it
transforms and then serialises back to SAX events. For this reason, it
might not be appropriate for multi-gigabyte documents.

4.5.

AxKit

If you're doing a lot of XML transformations (particularly for
web-based clients), you should take a long hard look at AxKit. AxKit is a Perl-based
(actually mod_perl-based) XML Application server for Apache. Here are some
of AxKit's key features:

Data can come from XML or any SAX data source (such as
a database query using XML::Generator::DBI)

stylesheets can be selected
based on just about anything (file suffix, UserAgent, QueryString,
cookies, phase of the moon ...)

transformations can be specified
using a variety of languages including XSLT (LibXSLT or Sablotron),
XPathScript (a Perl-based transformation language) and XSP (a tag-based
language)

caching of transformed documents can be handled
automatically or using your own custom scheme

5. Encodings

5.1.

Why do we need encodings?

Text documents have long been encoded in ASCII - a 7 bit code
comprising 128 unique characters of which 32 are reserved for
non-printable control functions. The remaining 96 characters are barely
sufficient for variants of English, less than ideal for western european
languages and totally inadequate for just about anything else. 'Point
solutions' have been applied with different 8 or 16 bit codes used in
different regions. One obvious limitation of such solutions is that a
single document cannot contain text in two or more languages from
different regions. The recognised solution to these problems is
Unicode/ISO-10646 - two separate but compatible standards which
aim to provide standard character mappings for all the world's languages
in a single character set.

All XML parsers are required to implement Unicode, but
we can't get away from the fact that most electronic documents in
existence today are not in Unicode. Even documents that have been
produced recently are unlikely to be Unicode. Therefore, XML
parsers are also able to work with non-Unicode documents - as long
as each document contains an encoding declaration which the parser
can use to map characters to the Unicode character set.

One particularly smart thing the Unicode designers did was to make
the first 128 characters of Unicode the same as ASCII, so pure ASCII
documents are also Unicode. The 'Latin 1' character set (ISO-8859-1) is
a popular 8 bit code which adds a further 96 printable characters to
ASCII and is commonly used in Western Europe. The extra 96 characters
are also mapped to the identical character numbers in Unicode (ie: ASCII
is a subset of ISO-8859-1 which is a subset of Unicode). Note however
that although Unicode provides a number of ways to encode characters
above 0x7F, none are quite the same as ISO-8859-1.

Since Unicode supports character positions higher than 256, a
representation of those characters will obviously require more than one
8-bit byte. There is more than one system for representing Unicode
characters as byte sequences. UTF-8 is one such system. It uses a
variable number of bytes (from 1 to 4 according to RFC3629) to represent
each character. This means that the most common characters (ie: 7 bit
ASCII) only require one byte.

In UTF-8 encoded data, the most significant bit of each byte will
be 0 for single byte characters and 1 for each byte of a multibyte
character. This is obviously not compatible with 8-bit codes such as
Latin1 in which all characters are 8 bits and all characters beyond 127
have the high bit set. Parsers assume their data is UTF-8 unless another
encoding is declared, so if you feed Latin1 data, into an XML parser
without declaring an encoding, the parser will most likely choke on the
first character greater than 0x7F. If you are interested in the gory
details, read on...

The number of leading 1 bits in the first byte of a multi-byte
sequence is equal to the total number of bytes. Each of the follow-on
bytes will have the first bit set to 1 and the second to zero. All
remaining bits (shown as 'x' below) are used to respresent the character
number.

UTF-16 encoding is an alternative byte representation of Unicode
which for most cases amounts to a fixed-width 16 bit code. ASCII and
Latin1 characters (the first 256 characters) are represented as normal
but with a preceding 0x00 byte. Although UTF-16 is conceptually simpler
than UTF-8 (and is the native encoding used by Java), two major drawbacks
mean it is not the preferred format for C or Perl:

You could obviously convert a UTF-8 encoded string to some other
encoding, but before we get on to that, let's look at what you can do
with it in its 'natural state'.

If you wish to display the string in a web browser, no conversion
is necessary. Modern browsers can understand UTF-8 directly, as can be
seen on this
page on the kermit project web site (some characters in the page
will not display correctly without the correct fonts installed but that's
a font issue rather than an encoding issue).

Or if you can't control the headers, include this meta tag in the
document:

<meta http-equiv="Content-type" content="text/html; charset=utf-8">

For a 'low tech' alternative, you might find that UTF-8 text is
readable on your display if you simply print it to STDOUT. XFree86
version 4.0 supports Unicode fonts and Xterm supports UTF-8 multibyte
characters (assuming your locale is set correctly).

A growing number of editors support UTF-8 files and you can even
write your Perl scripts in UTF-8 (since 5.6) using your native language
for identifier names.

Perl versions prior to 5.6 had no knowledge of UTF-8 encoded
characters. You can still work with UTF-8 data in these older Perl
versions but you'll probably need the help of a module like
Unicode::String to deal with the non-ASCII
characters.

The built-in functions in Perl 5.6 and later are UTF-8 aware so for
example length will return the number of characters
rather than the number of bytes in a string, and ord
can return values greater than 255. The regular expression engine will
also correctly match against multi-byte characters and character classes
have been extended to include Unicode properties and block ranges.

None of this added functionality comes at the expense of support
for binary data. Perl's internal SV data structure (used to represent
scalar values) includes a flag to indicate whether the string value is
UTF-8 encoded. If this flag is not set, byte semantics will be used by
all functions that operate on the string, eg: length
will return the number of bytes regardless of the bit patterns in the
data.

You can include include Unicode characters in your string literals
using the hex character number in an extended \x sequence. For example,
this declaration includes the Euro symbol:

my $price_label = "\x{20AC}9.99";

length reports that this string contains 5
characters and under 'use bytes' it will report a length of 7
bytes.

This regular expression will match a string which starts with
the Euro symbol:

my $euro = "\x{20AC}";
/^$euro/ && print;

And here's a regular expression that will match any upper case
character - not just A-Z, but any character with the Unicode upper case
property:

/\p{IsUpper}/

You can read more in perldoc perlunicode and
perldoc utf8.

5.5.

What can Perl 5.8 do with a UTF-8 string?

The Unicode support in Perl 5.6 had a number of omissions and bugs.
Many of the shortcomings were fixed in Perl 5.8 and 5.8.1. One major
leap forward in 5.8 was the move to Perl IO and 'layers' which allows
translations to take place transparently as file handles are read from or
written to. A built-in layer called ':encoding' will automatically
translate data to UTF-8 as it is read, or to some other encoding as it is
written. For example, given a UTF-8 string, this code will write it out
to a file as ISO-8859-1:

The new core module 'Encode' can be used to translate between
encodings (but since that usually only makes sense during IO, you might
as well just use layers) and also provides the 'is_utf8' function for
accessing the UTF-8 flag on a string.

5.6.

How can I convert from UTF-8 to another encoding?

This being Perl, there's more than one way to do it ...

XML numeric character entitiesIf you are outputting XML, but for some reason do not wish to use
UTF-8 (perhaps your editor does not support it), you can convert all
characters beyond position 127 to numeric entities with a regular
expression like this:

use utf8; # Only needed for 5.6, not 5.8 or later
s/([\x{80}-\x{FFFF}])/'&#' . ord($1) . ';'/gse;

Andreas Koenig has supplied an alternative regular
expression:

s/([^\x20-\x7F])/'&#' . ord($1) . ';'/gse;

This version does not require 'use utf8' with Perl 5.6; does not
require a version of Perl which recognises \x{NN} and handles characters
outside the 0x80-0xFFFF range.

Even if you are outputting Latin1, you will need to use a technique
like this for all characters beyond position 255 (eg: the Euro symbol)
since there is no other way to represent them in Latin1.

This technique can be used for the character content of elements
and attribute values. It cannot be used for the element or attribute
names since the result would not be well-formed XML.

Note

Remember, in XML the number in a numeric character entity
represents the Unicode character position regardless
of the document encoding.

Perl 5.8 IO layersYou can specify an encoding translation layer as you open a file
like this:

You can also push an encoding layer onto an already open filehandle
like this:

binmode(STDOUT, ':encoding(windows-1250)');

Perl 5.6 tr/// operator (deprecated)Perl 5.6 offers a way of converting between UTF-8 and Latin1 8 bit
byte strings using the 'tr' operator. This will no longer work in
Perl 5.8 and later. To quote from the 5.8 release notes:

The tr///C and tr///U features have been removed and will
not return; the interface was a mistake. Sorry about that.

--perldelta

Just to make quite sure that you know exactly which code to avoid
using, here is an example of translating from UTF-8 ('U') to 8 bit
Latin1 ('C'):

$string =~ tr/\0-\x{FF}//UC; # Don't do this

Perl 5.6 and later: pack()/unpack()All versions of Perl from 5.6 on support 'U' and 'C' in
pack/unpack templates. You
can use this to split a UTF-8 string into characters and reassemble
them into a Latin1-style byte string. For example:

The first assignment creates a UTF-8 string 9 characters long
(but 10 bytes long). The second assignment creates a Latin-1 encoded
version of the string.

Unicode::StringThe Unicode::String module pre-dates Perl
5.6 and works with older and newer Perl versions. You turn your
string into a Unicode::String object (which is
represented internally in UTF-16) and then call methods on the object
to convert it to alternative encodings. For example:

Text::IconvThe Text::Iconv module is a wrapper around
the iconv library common on Unix systems (and available for Windows too).
You can use this module to create a converter object which maps from one
encoding to another and then pass the object a string to convert:

The biggest hurdle with using Text::Iconv is
knowing which conversions the iconv library on your system can handle
(the module offers no way to list supported encodings). Another problem
is that even if your system supports the encoding you require, it may
give it a non-standard name. For example, the code above worked on both
Linux and Solaris 8.0 but Solaris 2.6 required '8859-1' and Win32
required 'iso-8859-1'.

XML::SAX::WriterIf you're using SAX to generate or transform XML, you can tell
XML::SAX::Writer which output encoding to use
like this:

my $writer = XML::SAX::Writer->new(EncodeTo => 'ISO8859-1');

Internally, XML::SAX::Writer uses
Text::Iconv to do the conversion so the same
caveats about portability of encoding names apply here too.

5.7.

What does 'use utf8;' do?

In Perl 5.8 and later, the sole use of the 'use utf8;' pragma is to
tell Perl that your script is written in UTF-8 (ie: any non-ASCII or
multibyte characters should be interpreted as UTF-8). So if your code is
plain ASCII, you don't need the pragma.

The original UTF8 support in Perl 5.6 required the pragma to
enable wide character support for builtin functions (such as length)
and the regular expression engine. This is no longer necessary in 5.8
since Perl automatically uses character rather than byte semantics
with strings that have the utf8 flag set.

You can find out more about how Unicode handling changed in
Perl 5.8 from the perl58delta.pod
file that ships with Perl.

5.8.

What are some commonly encountered problems with encodings?

Surprise - it's UTF-8!By far the most common problem people have is that they didn't
expect the parsing process to translate their data into UTF-8. Whether
this is an actual problem or merely a perceived problem is the subject of
some debate. Sure, you may need to change the encoding when you output
your data, but doing all your processing in Unicode will lead to less
pain in the long term. The preceding section gives you plenty of ways
to deal with UTF-8.

Windows code pagesMany Windows users assume that since they use Latin1 characters
they should specify an encoding of 'iso-8859-1'. It's more likely
however that their characters are encoded using Microsoft's code page
1252. This is an extension to ISO-Latin1 which replaces some of the
control characters with printable characters. For strict Latin1 text it
shouldn't matter, but if your text contains 'smart quotes', daggers,
bullet characters, the Trade Mark or the Euro symbols it's not
iso-8859-1. XML::Parser version 2.32 and later
include a CP1252 mapping which can be used with documents bearing this
declaration:

<?xml version='1.0' encoding='WINDOWS-1252' ?>

Web FormsIf your Perl script accepts text from a web form, you are at the
mercy of the client browser as to what encoding is used - if you save the
data to an XML file, random high characters in the data may cause the
file to not be 'well-formed'. A common convention is for browsers to
look at the encoding on the page which contains the form and to translate
data into that encoding before posting. You declare an encoding by using
a 'charset' parameter on the Content-type declaration, either in the
header:

print CGI->header('text/html; charset=utf-8');

or in a meta tag in the document itself:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

If you find you've received characters in the range 0x80-0x9F, they
are unlikely to be ISO Latin1. This commonly results from users
preparing text in Microsoft Word and copying/pasting it into a web form.
If they have the 'smart quotes' option enabled, the text may contain
WinLatin1 characters. The following routine can be used to 'sanitise'
the data by replacing 'smart' characters with their common ASCII
equivalents and discarding other troublesome characters.

Note: It might be safer to simply reject any input with characters
in the above range since it implies the browser ignored your charset
declaration and guessing the encoding is risky at best.

6. Validation

The XML Recommendation says that an XML document is 'valid' if it
has an associated document type declaration and if the document complies
with the constraints expressed in it.

At the time the recommendation was written, the SGML Document
Type Definition (DTD) was the established method for defining a class
of documents; and validation was the process of confirming that a
document conformed to its declared DTD.

These days, there are a number of alternatives to the DTD and the
term validation has assumed a broader meaning than simply DTD
conformance. The most visible alternative to the DTD is the W3C's own
XML Schema.
Relax
NG is a popular alternative developed by OASIS.

If you design your own class of XML document, you are perfectly
free to select the system for defining and validating document
conformance, which suits you best. You may even chose to develop your
own system. The following paragraphs describe Perl tools to
consider.

6.1.

DTD Validation Using XML::Checker

Enno Derksen's XML::Checker module implements DTD validation in
Perl on top of XML::Parser. The recommended way
to use XML::Checker is via one of the two
convenience modules included in the distribution:

XML::DOM::ValParser can be used anywhere
you would use XML::DOM. It works the same
way except that it performs DTD validation.

XML::Checker::Parser can be used anywhere
you would use XML::Parser. It works the same
way except that it performs DTD validation.

Here's a short example to get you going. Here's a DTD saved
in a file called /opt/xml/xcard.dtd:

You can play around with adding and removing elements from the
document to get a idea of what happens when validation errors
occur. You'll also want to refer to the documentation for the
'SkipExternalDTD' option for more robust handling of external
DTDs.

6.2.

DTD Validation Using XML::LibXML

The libxml library supports DTD validation
although this is turned off by default in XML::LibXML. Once you have created an XML::LibXML object, you
can enable validation like this:

$parser->validation(1);

Validation using XML::LibXML is much faster
than with XML::Checker but if you want to know why
a document fails validation you'll find that
XML::LibXML's diagnostic messages are not as
helpful.

The libxml2 distribution (which underlies
XML::LibXML) includes a command line validation
tool, written in C, called xmllint. You can use it like this:

xmllint --valid --noout filename.xml

6.3.

W3C Schema Validation With XML::LibXML

XML::LibXML provides undocumented support
for validating against a W3C schema. Here's an example of how you might
use it (contributed by Dan Horne):

The referenced XSD schema file and sample XML document can be
downloaded from the W3C
Schema Primer.

The xmllint command line validator included in the libxml2
distribution can also do W3C schema validation. You can use it like
this:

xmllint --noout --schema po.xsd po.xml

6.4.

W3C Schema Validation With XML::Xerces

XML::Xerces provides a wrapper around the
Apache project's Xerces parser library. Installing Xerces can be
challenging and the documentation for the Perl API is not great, but it's
the most complete offering for Schema validation from Perl.

6.5.

W3C Schema Validation With XML::Validator::Schema

Sam Tregar's XML::Validator::Schema allows
you to validate XML documents against a W3C XML Schema. It does not
implement the full W3C XML Schema recommendation, but a useful
subset. Here's an example:

Kip Hampton has written an
article describing how a combination of Perl and XPath can
provide a quick, lightweight solution for validating documents.

6.7.

XML::Schematron

Kip has also written the XML::Schematron
module which can be used with either XML::XPath or
XML::Sablotron to implement validation based on
Rick Jelliffe's Schematron.

7. Common Coding Problems

7.1.

How should I handle errors?

Most of the Perl parsing tools will simply call
die if they encounter an error (eg: an XML file
which is not well-formed). You can trap these failures using
eval. Once the eval has
completed, you can check the special variable '$@'. This will contain
the text of the error message if a failure occurred or will be undefined
otherwise. For example:

The '$@' variable contains the scalar value which was passed to
die. In some cases, this value will be a reference
to an 'exception' object. For simple error handling, you can still just
print the value out in a double quoted string, but
for more complex handling you might want to check the type of the
exception or in some cases call methods on it to get the error code etc.
XML::SAX::Exception is one such
implementation.

The reason is that parsers are not required to give you all of an
element's character data in one chunk. The number of characters you get
in each chunk may depend on the parser's internal buffer sizes, newline
characters in the data, or (as in our example) embedded entities. It
doesn't really matter what causes the data to be split - you just have
to be prepared to handle it.

The usual approach is to accumulate data in the character event
and defer processing it until the end element event. Here's a sample
implementation using XML::Parser:

Of course you probably won't be coding to the
XML::Parser API. It's more likely you'll be using
SAX. In which case, the answer is much simpler. Just include Robin
Berjon's XML::Filter::BufferText in your pipeline
and stop worrying.

7.3.

How can I split a huge XML file into smaller chunks

When your document is too large to slurp into memory, the DOM,
XPath and XSLT tools can't really help you. You could write your own SAX
filter fairly easily, but Michel Rodriguez has written a general
solution so you don't have to. You'll find it bundled with XML::Twig
from version 3.16.

8. Common XML Problems

The error messages and questions listed in this section are not really
Perl-specific problems, but they are commonly encountered by people new to
XML:

8.1.

'xml processing instruction not at start of external entity'

If you include an XML declaration, it must be the very first thing
in the document - it cannot even be preceded by whitespace or blank
lines. For example, this would be 'well formed' XML as long as the
'<' and the '?' are the first and second characters in the
file:

A well formed XML document can contain only one root element. So,
for example, this would be well formed:

<para>Paragraph 1</para>

while this would not:

<para>Paragraph 1</para>
<para>Paragraph 2</para>

8.3.

'not well-formed (invalid token)'

There are a number of causes of this error, here are some common
ones:

Unquoted attributesAll attribute values must always be quoted in XML. For example,
this would be well formed:

<item name="widget"></item>

while this would not:

<item name=widget></item>

Bad encoding declarationAn incorrect or missing encoding declaration can cause this. By
default the encoding is assumed to be UTF-8 so if your data is (say)
ISO-8859-1 encoded then you must include an encoding declaration. For
example:

<?xml version='1.0' encoding='iso-8859-1'?>

8.4.

'undefined entity'

XML only pre-defines the following named character entities:

&lt; <
&gt; >
&amp; &
&quot; "
&apos; '

If your XML includes HTML-style named character entities (eg:
&nbsp; or &uuml;) you have two choices:

You could replace the named entities with numeric entities. For
example the non-breaking space character is at position 160 (hex A0) so
you could represent it with: &#160; (or &#x00A0;). Similarly,
you could represent a lower case U-umlaut as &#252; (or
&#x00FC;).

Alternatively, you could define your own named character entities
in your DTD or in an 'internal subset' of a DTD. For example:

The XML spec defines legal
characters as tab (0x09), carriage return (0x0D), line feed
(0x0A) and the legal graphic characters of Unicode. This specifically
excludes control characters, so this would not be well-formed:

<char>&#3;</char>

Their really is no easy or standard way to include control
characters in XML - binary data must be encoded (for example using
MIME::Base64).

8.6.

Embedding Arbitrary Text in XML

Any text you include in your XML documents must not contain the
angle brackets or ampersand characters in an unescaped form. Manually
escaping these characters can be tedious when you want to include a block
of program code or HTML. You can use a CDATA section to indicate to the
parser that the text within it should not be parsed for markup. For
example, this XML document ...

When you parse a document, your code has no way of knowing if a
particular piece of text came from a CDATA section - and you probably
shouldn't care.

Note

CDATA is for character data - not binary
data. If you need to include binary data in your document, you should
encode it (perhaps using MIME::Base64) when you
write the document and decode it during parsing.

8.7.

Using XPath with Namespaces

People often experience difficulty getting their XPath expressions
to match when they first use XML::LibXML to
process an XML document containing namespaces. For example, consider
this XHTML document with an embedded SVG section:

The elements in the SVG section each use the namespace prefix
's' which is bound to the URI 'http://www.w3.org/2000/svg'. The
prefix 's' is completely arbitrary and is merely a mechanism for
associating the URI with the elements. As a programmer, you will
perform matches against namespace URIs not prefixes.

The elements in the XHTML wrapper do not have namespace prefixes,
but are bound to the URI 'http://www.w3.org/1999/xhtml' by way of the
default namespace declaration on the opening <html> tag.

You might expect that you could match all the 'h1' elements using
this XPath expression ...

//h1

... however, that won't work since the namespace URI is effectively
part of the name of the element you're trying to match.

One approach would be to fashion an XPath query which ignored the
namespace portion of element names and matched only on the 'local name'
portion. For example:

//*[local-name() = 'h1']

A better approach is to match the namespace portion as well. To
achieve that, the first step is to use
XML::LibXML::XPathContext to declare a namespace
prefix. Then, the prefix can be used in the XPath expression:

Note

The XML::LibXML::XPathContext module
has been included in the XML::LibXML distribution
since version 1.61. Prior to that it was in its own separate
distribution on CPAN.

9. Miscellaneous

9.1.

Is there a mailing list for Perl and XML?

Yes, the perl-xml mailing list is kindly hosted by ActiveState.
The list info
page has links to the searchable list archive as well as a form
for subscribing. The list has moderate traffic levels (no messages some
days, a dozen messages on a busy day) with a knowledgeable and helpful
band of subscribers.

9.2.

How do I unsubscribe from the perl-xml mailing list?

The list info
page links through to an unsubscribe function. Every message
sent to the list also includes an 'unsubscribe' link which makes it all
the more mystifying that this really is a frequently asked
question.

9.3.

What happened to Enno?

This is one of the great mysteries of Perl/XML and no answer is
available here.

Enno Derksen wrote a number of XML related Perl modules (including
XML::DOM and
XML::Checker::Parser) which were packaged into a
distribution called 'libxml-enno'. No one has heard from Enno in quite a
while and TJ Mather has assumed the role of maintainer for some of the
modules. Many of us have benefitted from his work so if you're out
there Enno - thanks.

Corrections, Contributions and Acknowledgements

This document is a 'work in progress'. A number of questions are
still being worked on and will be added when they are complete.

If you wish to report an error or contribute information for
inclusion in this document, please email the author at:
<grantm@cpan.org>.

I wish to gratefully acknowledge the assistance of the community of
subscribers to the 'perl-xml' mailing list. Their knowledge and advice
has been invaluable in preparing this document.