PhyloPhlAn

PhyloPhlAn: microbial Tree of Life using 400 universal proteins

PhyloPhlAn is a computational pipeline for reconstructing highly accurate and resolved phylogenetic trees based on whole-genome sequence information.
The pipeline is scalable to thousands of genomes and uses the most conserved 400 proteins for extracting the phylogenetic signal. PhyloPhlAn also
implements taxonomic curation, estimation, and insertion operations.

The main features of PhyloPhlAn are:

completely automatic, as the user needs only to provide the (unannotated) protein sequences of the input genomes (as multifasta files of peptides - not nucleotides)

very high topological accuracy and resolution because of the use of up to 400 previously identified most conserved proteins

the possibility of integrating new genomes in the already reconstructed most comprehensive tree of life (3,171 microbial genomes)

Common commands and examples

"De novo" phylogenetic tree building with any sets of genomes

If you would like to build a phylogenetic tree using any set of private or public genomes all you need to do is creating a folder in the input folder and copy inside one multifasta file (with extension ".faa") for each genome containing the peptidic sequences. If you call this folder "my_genomes" here is the command you need to call:

$ ./phylophlan.py -u my_genomes

when finished, the resulting tree will appear in the output/my_genomes folder.

Example 1: Corynebacterium "de novo" phylogenetic tree building

You can try out this operation (-u) using an example included in the PhyloPhlAn package you downloaded called example_corynebacteria and stored in the input folder. In contains a protein multifasta file for each of the 30 genomes available for the Corynebacterium genus as February 2012 plus two Streptomyces genomes as a meaningful outgroup. As mentioned above, the command for obtaining the phylogenetic tree is:

$ ./phylophlan.py -u example_corynebacteria --nproc 4

Using 4 threads (specified with --nproc 4) this operation should take no more than 4-5 minutes, but even using one processor only (default) should give you the results in 10 minutes or so.

Also the full three of life reported above has been originally generated in this way. Notice that the concatenated alignment used to generate the tree with FastTree is stored in data/example_corynebacteria/aln.fna and can be used as input for other phylogenetic reconstruction software such as RAxML or Mega among many others.

Inserting new genomes to the tree of life

PhyloPhlAn let you insert a genome (or a set of genomes) into the already built microbial tree of life (containing >3,000 genomes, see figure and tree files above). Also in this case you need to create a dedicated folder (e.g. my_genomes_to_insert) in the input folder to store the protein multifasta files of interest. The command is:

$ ./phylophlan.py -i my_genomes_to_insert --nproc 16

If possible, we would recommend to use as many threads as possible (--nproc) because this operation is quite computationally demanding as it requires the alignments with other 3,000 genomes to be updated and the full tree of life to be rebuilt.

The resulting tree file output/my_genomes_to_insert/my_genomes_to_insert.tree.int.nwk can be inspected with tree visualization software to check where the new genomes are rooted and their relations with already well characterized strains.

Example 2: inserting Lactobacillus and Sulfolobus genomes into the tree of life

As an example of insertion, we included in the input folder contained in the PhyloPhlAn package, three genomes recently sequenced and not yet included into the PhyloPhlAn tree and repository. These are two Lactobacillus and one Sulfolobus genomes available in IMG (accessions 2511231185, 2519899592, and 2524023197 respectively).

$ ./phylophlan.py -i example_insertion --nproc 16

The resulting file example_insertion.tree.int.nwk now contains the thousands of genomes in the PhyloPhlAn repository as well as the three "new" genomes.

Imputing taxonomic labels for newly integrated genomes

You can also ask PhyloPhlAn to try to automatically assign a taxonomic labels to the genomes integrated into the tree of life (-i option introduced above). This is possible simply adding the -t flag (for taxonomic analysis) to the same command line:

$ ./phylophlan.py -i -t my_genomes_to_insert --nproc 16

In addition to the output/my_genomes_to_insert/my_genomes_to_insert.tree.int.nwk file, you will obtain tab-separated text files with the most confident taxonomic predictions for your genomes in the output/my_genomes_to_insert/ folder.

Example 3: predicting the taxonomic labels of three "new" genomes

Suppose you don't know the taxonomic labels of the Lactobacillus and Sulfolobus genomes used as examples above, possibly because of insufficient phenotipic characterization or because you obtained them with metagenomic assembly. You can call the PhyloPhlAn taxonomic imputation pipeline as:

As expected, the all three genomes are assigned to the right genera. The two lactobacilli could also be assigned to the right species (s__rhamnosus) whereas PhyloPhlAn does not find enough support to assign the Sulfolobus genome to the "acidocaldarius" species.

All command line options and parameters

$ ./phylophlan.py -h
usage: phylophlan.py [-h] [-i] [-u] [-t] [--tax_test TAX_TEST] [-c]
[--cleanall] [--nproc N] [-v]
[PROJECT NAME]
NAME AND VERSION:
PhyloPhlAn version 0.99 (8 May 2013)
AUTHORS:
Nicola Segata (nsegata@hsph.harvard.edu) and Curtis Huttenhower (chuttenh@hsph.harvard.edu)
DESCRIPTION
PhyloPhlAn is a computational pipeline for reconstructing highly accurate and resolved
phylogenetic trees based on whole-genome sequence information. The pipeline is scalable
to thousands of genomes and uses the most conserved 400 proteins for extracting the
phylogenetic signal.
PhyloPhlAn also implements taxonomic curation, estimation, and insertion operations.
positional arguments:
PROJECT NAME The basename of the project corresponding to the name of the input data folder inside
input/. The input data consist of a collection of multifasta files (extension .faa)
containing the proteins in each genome.
If the project already exists, the already executed steps are not re-ran.
The results will be stored in a folder with the project basename in output/
Multiple project can be generated and they safetely coexists.
optional arguments:
-h, --help show this help message and exit
-i, --integrate Integrate user genomes into the PhyloPhlAn tree
-u, --user_tree Build a phylogenetic tree using user genomes only
-t, --taxonomic_analysis
Check taxonomic inconsistencies and refine/correct taxonomic labels
--tax_test TAX_TEST nerrors:type:taxl:tmin:tex:name (alpha version, experimental!)
-c, --clean Clean the final and partial data produced for the specified project.
(use --cleanall for removing general installation and database files)
--cleanall Remove all instalation and database file leaving untouched the initial compressed data
that is automatically extracted and formatted at the first pipeline run.
Projects are not remove (specify a project and use -c for removing projects).
--nproc N The number of CPUs to use for parallelizing the blasting
[default 1, i.e. no parallelism]
-v, --version Prints the current PhyloPhlAn version and exit

External Software Dependencies

muscle version v3.8.31 or higher must be present in the system path and called "muscle"

usearch version v5.2.32 (notice that version 6 is currently NOT supported) must be present in the system path and called "usearch"

FastTree version 2.1 or higher must be present in the system path and called "FastTree"

Acknowledgements

The authors of PhyloPhlAn would like to thank Ashlee Earl and the Human Microbiome Project Strains Working Group for insightful suggestions, Morgan
Price for his helpful comments on applying FastTree, and Levi Waldron, Joshua Reyes and Timothy Tickle for their suggestions on methodology and tree
visualization