Perl Parser Performance

There was one dominant XML parser in Perl a few years ago; parsing
an XML document was synonymous for using the XML::Parser module. The module
written by Larry Wall and Clark Cooper worked as an interface to James Clark's
expat XML parser, and it didn't leave much room for competitors. Traditional Perl
modules for XML processing were built on the top of XML::Parser.

But times are changing. Other C/C++ parsers, such as libxml2 or Xerces C++
have entered the scene, and so have their Perl extensions. Perl XML folks have
developed Perl SAX,
a Perlish counterpart of Java SAX interface.
Currently, CPAN contains several parsing modules.
This article compares the performance of five PerlSAX 2 parsers freely available
from CPAN. The old good XML::Parser is also included to serve as a baseline.

I must state that this article isn't an independent study as I'm a maintainer
of one of the modules involved in this test. The original purpose of this test
was to check if the new parser performs as expected. But the methodology is
to my best knowledge neutral; all data and sources are available and results are
easily reproducible for anyone.

Modules Involved

Six Perl parsers are included in the test. All of them can be downloaded
from CPAN. C or C++ libraries required by individual Perl modules (
expat,
libxml2,
Xerces C++)
are also open source. Since Perl modules have lengthy names with a good deal of
colons, I use my own abbreviations to refer to them within this article:

XML::Parser[PARS], v2.34
Expat wrapper, has a specific event-based API different from PerlSAX 2.

XML::SAX::ExpatXS[EXPXS], v1.00
Expat wrapper, branched the sources of XML::Parser, makes use of XML::SAX::Base.

XML::LibXML, v1.58
Contains two PerlSAX 2 parsers: XML::LibXML::SAX
[LXML], XML::LibXML::SAX::Parser
[LXMLP], both are interfaces to libxml2
and both make use of XML::SAX::Base. While LXML is a true streaming parser,
older LXMLP builds a DOM tree inside. It has been deprecated by LXML
in fact. The reason to include both LXML and LXMLP to the test is to
compare the performance of the two different approaches.

XML::Xerces[XERC], v2.5.0-0
Interface to Xerces C++. It works with PerlSAX 2 handlers but its API differs
from the specs in some aspects.

There is one more PerlSAX 2 compliant parser not included in this
test. XML::SAX::PurePerl belongs to the XML::SAX package and serves as
a pure-Perl fallback parser. Once you install XML::SAX you have a parser,
regardless of external libraries installed in your system. However, this parser
is considerably slower than those built around C/C++ libraries. I have
dropped it from the test as I don't want to compare apples to oranges.

Test Documents

I have facilitated the selection of appropriate test XML documents by reusing
test documents created by Clark Cooper for purposes of his benchmark
of XML Parsers published on XML.com in May 1999. REC.xml is the XML version
of the XML 1.0 specification (REC-xml-19980210.xml); the other documents (med.xml,
chrmed.xml, big.xml,
chrbig.xml) are mechanically
expanded versions of REC.xml to get various sizes and markup densities.

What I was missing in Clark's selection were smaller and more dense
documents typical for the Web. Accordingly I have added two additional real-world files: gingerall.xml, an XHTML file downloaded
from the gingerall.org site, and
rss10.xml, an RSS 1.0 file originating from the
recently hibernated xmlhack.com. Table 1
contains the complete set of XML documents I use within this test:

Test Method

All the modules are tested using a single Perl
script. EXP, EXPXS, LXML, and LXMLP are treated
exactly in the same way; this can be seen as a proof of the Perl SAX2 concept.
XERC shares the same handler but it requires an extra treatment in the
constructor and a parsing method call. PARS has both API and a handler
of its own.

Handlers are as simple as they can be; each callback function simply counts
how many times it has been called. Each parser retrieves each document 10 times
subsequently; the parsing time is measured with the Time::HiRes module.

Results

The results are broken down by the markup density of the test documents. The
density makes much more difference than the size of documents. This is not really
a surprise for streaming parsers. Even DOM-based LXMLP keeps the pace with
the others as long as there is enough memory available. Figures 1, 2, and 3 graph
the performance of the parser modules for medium, low, and high markup density.
The values shown in the figures are proportional times of processing; the fastest
parser shows 100% for each document.

Figure 1.Performance Comparison for Medium Markup Density Files

PARS leads with a significant margin. Streaming XS extensions
EXPXS, XERC, and LXML follow some distance ahead of
LXMLP and EXP. Most of this is explainable by the architectural
approaches used. Event-based processing requires a lot of function calls; and
these calls are expensive in Perl. One more function call per event most likely
reduces the performance of EXP. LXMLP does pretty well,
demonstrating that libxml2 builds and access DOM really fast. The modules using
XML::SAX::Base as a base class (EXPXS, EXP, LXML,
LMXLP) have the handicap of an additional Perl function call for each
event as well. This is a common tradeoff between performance and compatibility.
Since PARS and EXPXS have a comparable code base, most of the
performance difference between the two parsers should be caused by object
overhead and subclassing XML::SAX::Base.

Figure 2.Performance Comparison for Low Markup Density Files

PARS (and EXP, which builds on PARS) performs
significantly worse for documents with higher proportions of text. The reason is
simple -- expat reports one character() event for each line and one more for each line
break. Hence it generates many more events than other parsers. And again
the "function calls are expensive" mantra; more callbacks mean less performance.
The difference in number of calls can be huge. For
example, PARS and EXP generate 142,415 character events
for chrbig.xml, while EXPXS and
LXMLP need as little as 3,981 events (EXPXS is also expat-based
but it joins consequent characters before entering the Perl space).

Figure 3.Performance Comparison for High Markup Density Files

XERC appears to be somewhat faster for small and dense documents; it
beats EXPXS in this category. My guess is this is due to the
XML::SAX::Base initial overhead being proportionally more significant for small
files. Real processing times (see details in Table 2) show that any of the
modules taking part in this test are perfectly serviceable to parse web-sized
documents. Table 2 shows the overall results with links to the raw data
produced by the test script.

The average proportional time in the last column has no universal relevance
as it strongly depends on the selection of documents. This is simply a way to
express how the parsers have performed in this test with a single number, but
don't take it too seriously, please.

I would like to avoid making a final evaluation of the tested parsers. Anyone
can make their own conclusions based on the above facts. All the modules perform
well enough in common scenarios. Moreover, apart from pure performance, other
aspects must be taken into account when choosing a Perl parser, such as
compatibility, compliance, stability, or dependencies. Hopefully, the offerings are sufficient for most of us.