Xerces Perl: The Perl API to the Apache Xerces XML parser

Current Release: XML::Xerces 2.7.0-0

XML::Xerces is the Perl API to the Apache project's Xerces XML
parser. It is implemented using the Xerces C++ API, and it provides
access to most of the C++ API from Perl.

Because it is based on Xerces-C, XML::Xerces provides a
validating XML parser that makes it easy to give your application the
ability to read and write XML data. Classes are provided for parsing,
generating, manipulating, and validating XML documents. XML::Xerces
is faithful to the XML 1.0 recommendation and associated standards
(DOM levels 1, 2, and 3, SAX 1 and 2, Namespaces, and W3C XML
Schema). The parser provides high performance, modularity, and
scalability, and provides full support for Unicode.

XML::Xerces implements the vast majority of the Xerces-C API (if
you notice any discrepancies please mail the
list). The exception is some functions in the C++ API which
either have better Perl counterparts (such as file I/O) or which
manipulate internal C++ information that has no role in the Perl
module.

Support

The online users mailing list is the place for any
questions. It is at: p-dev@xerces.apache.org

Available Platforms

The code has been tested on the following platforms:

Linux

Cygwin

Windows

Mac OS X

BSD

Solaris

AIX

Tru64

Build Requirements

ANSI C++ compiler

Builds are known to work with the GNU C compiler, and other platform
specific compilers (such as VC++ on Windows and Forte on
Solaris). Contributions in this area are always welcome :-).

Perl5

Note

Required version: 5.6.0

XML::Xerces now supports Unicode. Since Unicode support wasn't
added to Perl until 5.6.0, you will need to upgrade in order to use this
and future versions of XML::Xerces. Upgrading to at least to the
latest stable release, 5.6.1, is recommended.

If you plan on using Unicode, I strongly recommend upgrading
to Perl-5.8.x, the latest stable version. There have been significant
improvements to Perl's Unicode support.

The Apache Xerces C++ XML Parser

Note

Required version: 2.7.0

(which can be downloaded from
the apache archive) You'll need both the library and header files,
and to set up any environment variables that will direct the
XML::Xerces build to the directories where these reside.

Unpack the archive

Getting Xerces-C

If the Xerces-C library and header files are installed on your system
directly, e.g. via an rpm or deb package, proceed to the directions for
building XML::Xerces.

Otherwise, you must download Xerces-C from www.apache.org. If
there is a binary available for your architecture, you may use it,
otherwise you must build it from source. If you wish to make
Xerces-C available to other applications, you may install it
however it is not necessary to do so in order to build XML::Xerces.
To build XML::Xerces from an uninstalled Xerces-C set the
XERCESCROOT environment variable the top-level directory of the source
directory (i.e. the same value it needs to be to build Xerces-C):

export XERCESCROOT=/home/jasons/xerces-2.7.0/

OPTIONAL: If you choose to install Xerces-C on your system, you
need to set the XERCES_INCLUDE and XERCES_LIB environment variables:

export XERCES_INCLUDE=/usr/include/xerces
export XERCES_LIB=/usr/lib

Build XML::Xerces

Go to the XML-Xerces-2.7.0-0 directory.

Build XML::Xerces as you would any perl package that you
might get from CPAN:

perl Makefile.PL

make

make test

make install

Using XML::Xerces

XML::Xerces implements the vast majority of the Xerces-C API (if you
notice any discrepancies please mail the list). Documentation for this API
are sadly not available in POD format, but the Xerces-C html documentation
is available online.

For more information, see the examples in the samples/ directory.
and the test scripts located in the t/ directory.

Special Perl API Features

Even though XML::Xerces is based on the C++ API, it has been modified
in a few ways to make it more accessible to typical Perl usage, primarily in
the handling:

String I/O

The native data type for Xerces-C is the XMLCh* which is a
UTF16 encoded string and in Perl strings are encoded in
UTF8. All conversion back and forth between Perl and Xerces-C is
handled automatically by XML::Xerces.

In fact a lot of effort is made to convert Perl variables
into strings before passing them to Xerces-C. So any method
that accepts an XMLCh* in Xerces-C will accept any non-undef
value using Perl's built-in stringification mechanism.

List I/O

Any function that in the C++ API returns a DOMNodeList
(e.g. getChildNodes() and getElementsByTagName()
for example) will return different types if they are called in a list
context or a scalar context. In a scalar context, these functions return a
reference to a XML::Xerces::DOMNodeList, just like in C++
API. However, in a list context they will return a Perl list of
XML::Xerces::DOM_Node references. For example:

Hash I/O

Any function that in the C++ API returns a
DOMNamedNodeMap (getEntities() and
getAttributes() for example) will return different types if
they are called in a list context or a scalar context. In a scalar
context, these functions return a reference to a
XML::Xerces::DOMNamedNodeMap, just like in C++ API. However,
in a list context they will return a Perl hash. For example:

Combined List/Hash classes (XMLAttDefList)

Any function that in the C++ API returns a XMLAttDefList
(getAttDefList() for SchemaElementDecl and DTDElementDecl), will
always return an instance of XML::Xerces::XMLAttDefList. However,
there are two Perl specific API methods that can be invoked on the
object: to_list() and to_hash().

# get the XML::Xerces::XMLAttDefList.
my $attr_list = $element_decl->getAttDefList();
# return a list of XML::Xerces::XMLAttDef instances
my @list = $attr_list->to_list();
# returns a hash of the attributes, where the keys are the
# result of calling getFullName() on the attributes, and the
# values are the XML::Xerces::XMLAttDef instances.
my %attrs = $attr_list->to_hash();

Void* handling

Any function in the C++ API that accepts a void*, for example
setProperty() in DOMBuilder and SAX2XMLReader, must be handled
specially. Currently, all void* methods convert their arguments
to a string before passing them to Xerces-C. In the future, when
other data types are needed, this functionality will be
expanded. If you locate a case in which you need this support,
please alert the development team (p-dev@xerces.apache.org).

Serialize API

The DOMWriter class is used for serializing DOM hierarchies. See
t/DOMWriter.t or samples/DOMPrint.pl
for details.

For less complex usage, just use the serialize() method defined for all
DOMNode subclasses.

Implementing {Document,Content,Error}Handlers from Perl

Thanks to suggestions from Duncan Cameron, XML::Xerces now has a
handler API that matches the currently used semantics of other Perl XML
API's. There are three classes available for application writers:

PerlErrorHandler (SAX 1/2 and DOM 1)

PerlDocumentHandler (SAX 1)

PerlContentHandler (SAX 2)

Using these classes is as simple as creating a perl subclass of the
needed class, and redefining any needed methods. For example, to override
the default fatal_error() method of the PerlErrorHandler class we can
include this piece of code within our application:

Handling exceptions ({XML,DOM,SAX}Exception's)

Some errors occur outside parsing and are not caught by the parser's
ErrorHandler. XML::Xerces provides a way for catching these errors
using the PerlExceptionHandler class. Usually the following code
is enough for catching exceptions:

eval{$parser->parser($my_file)};
XML::Xerces::error($@) if $@;

Wrap any code that might throw an exception inside an eval{...} and
call XML::Xerces::error() passing $@, if $@ is set.

There are a default methods that prints out an error message and calls
die(), but if more is needed, see the files t/XMLException.t,
t/SAXException.t, and t/DOMException.t for details on how to roll your own
handler.

XML::Xerces::XMLUni unicode constants

XML::Xerces uses many constant values for setting of features, and
properties, such as for XML::Xerces::SAX2XMLReader::setFeature(). You can
hard code the strings or integers into your programs but this will make
them vulnerable to an API change. Instead, use the constants defined in
the XML::Xerces::XMLUni class. If the API changes, the constants will be
updated to reflect that change. See the file docs/UMLUni.txt for a
complete listing of the constant names and their values.

Sample Code

XML::Xerces comes with a number of sample applications:

SAXCount.pl: Uses the SAX interface to
output a count of the number of elements in an XML document

SAX2Count.pl: Uses the SAX2 interface
to output a count of the number of elements in an XML document

DOMCount.pl: Uses the DOM interface to
output a count of the number of elements in an XML document

DOMPrint.pl: Uses the DOM interface to
output a pretty-printed version of an XML file to STDOUT

DOMCreate.pl: Creates a simple XML
document using the DOM interface and writes it to STDOUT

DOM2hash.pl: Uses the DOM interface to
convert the file to a simple hash of lists representation

EnumVal.pl: Parses and input XML document
and outputs the DTD information to STDOUT