Monday, July 29, 2013

Just submitted our first paper to PeerJ - the awesome new, open access journal aimed at tipping the entire publishing establishment on its head. I'm looking forward to a smooth review process -- hopefully as sleek and helpful as the submission process. <UPDATE: our paper was accepted. I will post a link to the paper once it's online!>

Our manuscript presents PhyBin, a computer program aimed at binning precomputed
sets of trees in Newick format, a file format produced by the majority of tree
building software. As we assert in the manuscript, PhyBin is a utility
rather than a complete solution; it can serve as a component in many genomics
pipelines, and provides a useful addition to the landscape of tools for
dissecting and visualizing large numbers of trees. After the user applies their chosen ortholog
prediction and tree-building algorithms, PhyBin offers a quick way to visualize
and browse the different evolutionary histories, either binned by topology and
sorted by bin size, or in the form of a full hierarchical clustering based on
Robinson-Foulds (RF) distance: i.e. a tree of
trees.

In the manuscript, we explore to functionalities in PhyBin: 1) the ability to bin trees with identical topologies and 2) the ability to cluster similar trees by RF distance. Lots of folks interested in the "landscape" of topologies produced by orthologous genes across a genome use RF distance as a measure of topological similarity. What is RF distance? It is essentially the number of different steps you'd have to take to create one tree out of another -- it's the edit distance between two topologies. So, according to the original Robinson-Foulds publication, for example, the trees below (trees 1 and 2) are edit distance 2 apart because in order to convert one to the other, you must collapse a node and then reform it.

PhyBin does some pretty neat pre-processing of trees to facilitate comparissons. For example, you can set a branch length threshold to collapse branches that are essentially noise in your dataset (say, from very closely related taxa). It also checks your dataset for number of taxa and is quite robust to file formatting. Then what PhyBin does is calculate the edit distance for a large group of trees (a distance matrix) and then also displays these distances as a tree of trees - as a dendrogram that links each tree in the dataset to each other based on the edit distances between them (how you'd get from one to another). In our manuscript, we used a set of orthologs generated from 10 published Wolbachia genomes. Here's what the dendrogram looks like for those 508 trees without (A) and with (B) clustering by RF distance. In the figure, you can see that many of the trees in the Wolbachia ortholog set are similar, they cluster into 9 large clusters, many of which support the monophyly of Wolbachia supergrops but others (in fact, a good fraction!) which do not.