Contents

About the format

A complete phyloXML document has a root node with the tag "phyloxml". Directly under the root is a sequence of "phylogeny" elements (phylogenetic trees), possibly followed by other arbitrary data not included in the phyloXML spec. The main structural element of these phylogenetic trees is the Clade: a tree has a clade attribute, along with other attributes, and each clade contains a series of clades (and other attributes), recursively.

The child nodes and attributes of each XML node are mapped onto classes in the PhyloXML module, keeping the names the same where possible; the XML document structure is closely mirrored in the Phyloxml objects produced by Bio.Phylo.PhyloXMLIO.read(), and the Phylogeny objects produced by Bio.Phylo.read() and parse().

These functions work with Phylogeny objects (derived from BaseTree.Tree) from the Bio.Phylo.PhyloXML module. This standard API is enough for most use cases.

PhyloXMLIO

Within Bio.Phylo, the I/O functions for the phyloXML format are implemented in the PhyloXMLIO sub-module. For access to some additional functionality beyond the basic Phylo I/O API, or to skip specifying the 'phyloxml' format argument each time, this can be imported directly:

from Bio.Phyloimport PhyloXMLIO

The read() function returns a single Bio.Phylo.PhyloXML.Phyloxml object representing the entire file's data. The phylogenetic trees are in the "phylogenies" attribute, and any other arbitrary data is stored in "other".

If you aren't interested in the "other" data, you can use parse() to iteratively construct just the phylogenetic trees contained in the file -- this is exactly the same as calling Phylo.parse() with the 'phyloxml' format argument.

PhyloXMLIO.write() is similar to Phylo.write(), but also accepts a Phyloxml object (the result of read() or to_phyloxml()) to serialize. Optionally, an encoding other than UTF-8 can be specified.

PhyloXMLIO also contains a utility called dump_tags() for printing all of the XML tags as they are encountered in a phyloXML file. This can be helpful for debugging, or used along with grep or sort -u on the command line to obtain a list of the tags a phyloXML file contains.

Core classes

Container for phylogenies; not used by the top-level Bio.Phylo I/O functions

Phylogeny

Derived from Tree -- the global tree object

Clade

Derived from Subtree -- represents a node in the object tree, and local info

Other

Represents data included in the phyloXML file but not described by the phyloXML specification

Annotation types

(to do)

Integrating with the rest of Biopython

The classes used by this module inherit from the Phylo module's generalized BaseTree classes, and therefore have access to the methods defined on those base classes. Since the phyloXML specification is very detailed, these subclasses are kept in a separate module, Bio.Phylo.PhyloXML, and offer additional methods for converting between phyloXML and standard Biopython types.

The PhyloXML.Sequence class contains methods for converting to and from Biopython SeqRecord objects -- to_seqrecord() and from_seqrecord(). This includes the molecular sequence (mol_seq) as a Seq object, and the protein domain architecture as list of SeqFeature objects. Likewise, PhyloXML.ProteinDomain objects have a to_seqfeature() method.

Performance

This parser is meant to be able to handle large files, meaning several thousand external nodes. (Benchmarks of relevant XML parsers for Python are here.) It has been tested with files of this size; for example, the complete NCBI taxonomy parses in about 100 seconds and consumes about 1.3 GB of memory. Provided enough memory is available on the system, the writer can also rebuild phyloXML files of this size.

The read() and parse() functions process a complete file in about the same amount of CPU time. Most of the underlying code is the same, and the majority of the time is spent building Clade objects (the most common node type). For small files (smaller than ncbi_taxonomy_mollusca.xml), the write() function serializes the complete object back to an equivalent file slightly slower than the corresponding read() call; for very large files, write() finishes faster than read().

Here are some times on a 2.00GHz Intel Xeon E5405 processor (only 1 CPU core used) with 7.7GB memory, running the standard Python 2.6.2 on Ubuntu 9.04, choosing the best of 3 runs for each function:

File

Ext. Nodes

Size (uncompressed)

Read (s)

Parse (s)

Write (s)

apaf.xml

38 KB

0.01

0.01

0.02

bcl_2.xml

105 KB

0.02

0.02

0.04

ncbi_taxonomy_mollusca.xml

5632

1.5 MB

0.51

0.49

0.80

tol_life_on_earth_1.xml

57124

46 MB

10.28

10.67

10.36

ncbi_taxonomy_metazoa.xml

73907

33 MB

15.76

16.15

10.69

ncbi_taxonomy.xml

263691

31 MB (unindented)

109.70

109.14

32.39

On 32-bit architectures, psyco might improve these times significantly, at the risk of increasing memory usage. (I haven't tested it.) For comparison, the Java-based parser used in Forester and ATV (see below) reads the same files about 3-5 times as quickly, or up to 15x for the largest file.

For Python 2.4, performance depends on which ElementTree implementation is used. Using the original pure-Python elementtree, reading/parsing takes about twice as much time for all file sizes, but writing is only significantly slower for very large files.