The cElementTree Module

The cElementTree module is a C implementation of the
ElementTree API, optimized for fast parsing
and low memory use.
On typical documents, cElementTree is 15-20 times faster than the Python
version of ElementTree, and uses 2-5 times less memory. On modern hardware,
that means that documents in the 50-100 megabyte range can be manipulated
in memory, and that documents in the 0-1 megabyte range load in zero time
(0.0 seconds). This allows you to drastically simplify many kinds of XML
applications.

If you’re using Linux or BSD systems, check your favourite package
repository for python-celementtree or py-celementtree
packages. Note that some distributors have included cElementTree in
their ElementTree package. Mac OS X users may want to check the
Fink
repository.

To install binary distributions from effbot.org, download and run
the installer, and follow the instructions on the screen. If the
installer cannot find your Python interpreter, see this page.

To install from sources, simply unpack the distribution archive,
change to the distribution directory, and run the setup.py script
as follows:

$ python setup.py install

See the README and CHANGES files for more on installation, licensing
(BSD style), changes since the last version, etc.

cElementTree is designed to work with Python 2.1 and newer.
The iterparse mechanism is currently only supported for Python
2.2 and later.
Earlier Python versions are not supported (let me know if you need
support for 2.0 or 1.5.2). For best performance, use Python 2.4.

Note:
Mandriva Linux ships with broken Python configuration files, and cannot
be used to build Python extensions that rely on distutils feature
tests. For a workaround, see this thread.

The cElementTree module is designed to replace the ElementTree module
from the standard elementtree package. In theory, you should be able
to simply change:

from elementtree import ElementTree

to

import cElementTree as ElementTree

in existing code, and run your programs without any problems (note that
cElementTree is a module, not a package). (Let me know if you find
that something you rely on doesn’t work as expected.)

cElementTree contains one new function, iterparse, which
works like parse, but lets you track changes to the tree while it is
being built. You can also modify and remove elements during the parse, as
in this example, which processes “record” elements as they arrive, and then
removes their contents from the tree.

Here are some benchmark figures, using a number of popular XML toolkits
to parse a 3405k document-style XML file, from disk to memory:

library

time

space

notes

xml.dom.minidom (Python 2.1)

6.3 s

80000k

(1)

gnosis.objectify

2.0 s

22000k

(5)

xml.dom.minidom (Python 2.4)

1.4 s

53000k

(1)

ElementTree 1.2

1.6 s

14500k

ElementTree 1.2.4/1.3

1.1 s

14500k

cDomlette (C extension)

0.540 s

20500k

(1)

PyRXPU
(C extension)

0.175 s

10850k

(2)

lxml.etree (C extension)

(4)

(4)

(3)

libxml2 (C extension)

0.098 s

16000k

(3)

readlines (read as utf-8)

0.093 s

8850k

cElementTree (C extension)

0.047 s

4900k

readlines (read as ascii)

0.032 s

5050k

The figures may of course vary somewhat depending on Python version,
compiler, and platform. The above was measured with Python 2.4, using
prebuilt Windows installers (as published by the maintainers) for all C
extensions. If you want further details about the tests,
drop me a line.

Several other toolkits were tested, but failed to parse the test
file (which uses both non-ASCII characters and namespaces). Toolkits
that parse namespaces but don’t handle them properly are included,
though (see notes 2 and 5, below).

For comparision, here are some benchmarks for event-based parsers
(using the same file as above, and enough dummy handlers to be able
to handle complete elements and their character data contents):

library

time

throughput

xml.sax (Python 2.1)

0.330 s

10300 k/s

xml.sax (Python 2.4)

0.292 s

11700 k/s

xml.parsers.expat

0.184 s

18500 k/s

cElementTree XMLParser

0.124 s

27500 k/s

sgmlop

0.092 s

37000 k/s

cElementTree iterparse

0.071 s

48000 k/s

Note 1) For these toolkits, the looping variant of my
benchmark behaves very badly, resulting in unexpected memory growth and
wildly varying parsing times (typically 150-300% of the values in the
table). Strategic use of forced garbage collection (gc.collect())
will usually make things better. Be careful.

Note 2) Even with namespace handling enabled, PyRXPU
returns namespace prefixes instead of namespace URI:s, which
makes it pretty much useless for namespace-aware XML processing. I’ve
included it anyway, since it’s often put forth as the fastest XML
parser you can get for Python.

Note 3) Tests on other platforms indicate that libxml2
is closer to cElementTree than this benchmark indicates. This is most
likely a compiler-related issue (I’m using “official” Windows binaries
for this benchmark, but so will most other users).

Note 4) There are no Windows binaries for lxml.etree (dead link) yet, but
it uses libxml2’s parser and object model, so the timings for this
test should be very close to those for libxml2.

Note 5) An undocumented function (config_nspace_sep)
must be called to enable namespace parsing. With that in place, the library
parses the file without problems, but the resulting data structure depends
on the namespace prefixes used in the file, rather than the namespace URI:s
(also see note 2).