Pages

Wednesday, March 30, 2011

I just added a minor update to my ltt() function. The only significant difference between this function and plotLtt() in Rabosky's {laser} package, is that my function can also plot trees with non-contemporaneous (i.e., extinct) tips. The code for ltt() can be found on my R-phylogenetics page (direct link here).

Basically, in preparing for a talk that I will be giving tomorrow, I realized that my function relied on a "cladewise" order of the edges of the input "phylo" object. This is the order that is created by read.tree() (as well as my functions read.newick() and read.simmap()). However, some other function in R return trees with edges in "pruningwise" order, which is more efficient for post-order tree traversal and pruning. [For more information about this, type ?reorder.phylo at the prompt.]

Luckily, {ape} has a function that can switch from "pruningwise" to "cladewise" orders, or vice versa. So, the bug with ltt() was an easy fix. I simply added the line:

tree

to the code before beginning calculation.

[In addition to this change, I also moved "tol" - the tolerance threshold for rounding errors in node heights - to be an optional input instead of fixed. This probably won't affect most users.]

Friday, March 25, 2011

For my class this semester on "Applied Phylogenetic Methods" (which, based on a chat today with Ron Etter's postdoc Rob Jennings, I've now concluded should more properly be called "Phylogenetics Demystified"), I recently programmed an implementation of phylogeny estimation using the unweighted least squares method (on my R-phylogenetics page, code here [update, v0.2]) - originally developed by Cavalli-Sforza & Edwards (1967), and described in detail by Felsenstein (2004, which is also our textbook for the class). Felsenstein (2004) calls the least squares methods the statistically "best-justified" distance matrix methods (p. 148). To my knowledge there is no direct implementation of least squares in any R package [although it should be possible to call the fastME program from within R and minimum evolution uses the least squares branch lengths].

As the title of this post suggests, this implementation is really just for fun, and should not be considered a serious function for phylogeny inference! This is because it is really not programmed efficiently and is thus very, very slow for more than a modest number of taxa. It also uses only nearest neighbor interchange to search treespace (capitalizing on the nni() function of the very useful {phangorn} package) thus will tend to get stuck on "islands" of suboptimal trees (not connected to better trees by a single NNI) if the starting tree is not very good. Nonetheless, the function does appear to work. If we give a matrix containing the true evolutionary distances between species (i.e., the distances in the true tree), then the program will tend to find the correct topology and branch lengths given a reasonable starting tree.

The basic principle of least squares phylogeny estimation is straightforward. We would like to find a tree and set of branch lengths that minimizes the summed squared deviations between our observed distances (i.e., the set of distances that serve as input for our analysis) and the hypothesized distances (i.e., the patristic distances in our inferred tree).

So, in other words, our model is as follows: D = Xv+e; in which D is a vector of distances between species; X is a design matrix based on a hypothesized topology (to be discussed below); v is a vector of edge lengths; and e, of course, is our residual error.

Given a topology to solve for our estimated edge lengths we can just use our standard least squares estimating equation: v = (X'X)-1X'D. We can then calculate the sum of squares residual error (the quantity we want to minimize) for a given tree as: Q = (D-Xv)'(D-Xv).

So, there are two things we want to find. We want to find v that minimize Q conditioned a tree topology (given by X), which we can easily do using least squares estimation; but we also want to find the tree that minimizes Q given our least squares estimate of v.

We first need to compute X given a tree. I have not described the design matrix X to this point, but X is an N=n(n-1)/2 (for n species) by m=2n-3 (for a fully bifurcating unrooted tree) matrix. Here, you can probably see that N is just the number of distances between species; and m is the number of edges in the tree. We put a "1" in the Xij if the jth edge can be found on the path between the pair of species represented in the ith row of the matrix. To make this a little clearer, consider the unrooted, fully labeled tree at right and the design matrix calculated with my function phyloDesign() below:

Now, with this design matrix we can compute our least squares branches and Q, and it is then only a matter of finding the tree (and thus the design matrix, X) that minimizes Q. To do this, I have used nni() from the {phangorn} package to design a greedy (but nonetheless remarkably slow) algorithm for finding the least squares tree. nni() returns all the one-step away NNI trees for a given tree. So, I just returned all these, compute X, v, and Q for each tree, pick the lowest one, and repeat until Q no longer improves from NNIs. Obviously, this algorithm will easily get stuck on low-lying "hills" in treespace. I'm not familiar with other more aggressive tree rearrangement functions available in R (for instance, TBR), but perhaps my readers are and if so, they can tell me!

To see how this works, let's take the distance matrix for 8 mammals given by Felsenstein (2004; p. 163), as follows:

which results in a tree that makes a fair bit of sense. Note that the zero length branch in this tree was actually negative in the LS tree, but my program sets negative branch lengths to zero by default (although this can be changed by the user).

Tuesday, March 22, 2011

I am hiring a postdoctoral researcher for my lab - to start on or after Aug. 1, 2011. I am particularly interested in applicants with experience studying evolutionary ecology of lizards; and/or computational methods for evolutionary biology (particularly phylogenetic methods). The initial appointment is for just one year, but we would ideally renew this appointment for multiple years (contingent on funding).

If interested, please check out the official HR job listing here. For more information about my lab, see my lab webpage. Finally, please feel free to email me to get more information about the position (or post questions of general interest to the blog).

Monday, March 21, 2011

A couple of users (e.g., here) have identified an annoying property of both brownie.lite() and evol.vcv() [both available on my R-phylogenetics page] - that is, the functions order the estimated evolutionary rates or matrices by the order in which the corresponding mapped states appear in the tree (not, for instance, in numeric or alphabetical order). For instance, consider the following code (which also gives a method to simulate Brownian rate variation on the tree):

We can see that the estimated rates are the same (as we would expect), but their order in $sig2.multiple and $vcv.multiple has flipped.

This property (ordering the rates by their temporal appearance, rather than alphanumerically) was by design, not chance; however I can see how it might be annoying for someone analyzing a dataset containing many SIMMAP trees.

Fortunately, there is an easy fix. Just rearrange the columns of $mapped.edge in the order that you want the mapped states to be ordered. For instance:

Thursday, March 17, 2011

I have posted a new function to compute phylogenetic signal for continuous traits using two methods: the lambda method of Pagel (1999) and Blomberg et al. (2003)'s K-statistic. Rich Glor already created a very helpful wiki page on how to do this in R using several different functions. My function, phylosig() (code here), should make this even easier. With phylosig() you can specify method (presently either "K" or "lambda") and optionally return a P-value for either the randomization test of Blomberg et al. or a likelihood ratio test against the null hypothesis that lambda=0.

For instance, after loading our function from source, to compute K and conduct a hypothesis test of the null hypothesis of no signal we just type:

Tuesday, March 15, 2011

Dave Collar helpfully identified a bug in v0.2 of read.simmap() that was preventing the function from properly reading in a dataset of multiple trees. [Since this function was a complete rewrite of v0.1, this was not a problem in the earlier edition.] This bug has now been fixed and the code (v0.3) is available on my R-phylogenetics page (direct link to code here).

Are any of my readers aware of an existing function to do Matrix Representation Parsimony (MRP) supertree estimation (Baum 1992; Ragan 1992; also see Sanderson et al. 1998; Bininda-Emonds et al. 2007) in R? I'm not, but I was thinking it would be a fun - and easy - thing to program. Consequently I have just posted a function for this, called mrp.supertree(), on my R-phylogenetics page (direct link to code here). This function could also be used to get the MRP consensus tree (that is, if all the input phylogenies have the same set of taxa; e.g., see Felsenstein 2004 pp. 527-528).

I programmed this by using two handy functions: the function prop.part() in {ape} (which returns all the non-trivial bipartitions for a given tree) and pratchet() in {phangorn} (Schliep 2010), which estimates the MP tree using the parsimony ratchet method of Nixon (1999).

The way MRP supertree estimation works is by first constructing N (for N trees) ni x mi binary matrices (for ni species and mi non-trivial bipartitions in the ith tree). We score the tree matrices by putting a 1 in the j,kth position if the jth taxon is found in the kth bipartition, for each bipartition, and a 0 otherwise. We then accumulate these matrices into a supermatrix which contains union(speciesi) rows and sum(mi) columns. We put a "?" into the supermatrix if a species is missing from the tree. Finally, our MRP supertree is the MP tree estimated for this binary supermatrix.

To see how well this function does, let's first generate a random reference tree with 50 species:

> true.tree

Now let's randomly prune 40-60% of the species from this tree ten times:

A few notes on this implementation. Firstly, I set pratchet() to return all equally parsimonious trees (if several are found, mrp.supertree() will return a "multiPhylo" object) - but this is not the same as a guarantee to find all most parsimonious trees. This could only be guaranteed with an exhaustive search. Secondly, pratchet() does not always find the MP tree - so you might run mrp.supertree() multiple times if the parsimony score is greater than the theoretically minimal score.

One teeny-tiny problem though. Although this function is much faster than sim.char() in {geiger} for a single simulation, e.g.:

> tree> system.time(x user system elapsed 80.87 0.00 80.99

compared to:

> system.time(x user system elapsed 0.18 0.00 0.19

[using v0.2 of fastBM() on an Intel Core i5 @ 3.20GHz], it can actually be much slower for multiple simulations. For instance, sim.char() takes basically the same time to run 100 simulations as it does to run 1:

> system.time(X user system elapsed 80.76 0.02 80.87

whereas the only way to run multiple simulations on the same tree using fastBM() has been to loop the function many times, e.g.

Although fastBM() still does better in this example, it has gone from being 400 times faster - to a mere 4 times faster! That is because when we just loop calls to fastBM(), computation time increases in direct proportion to the number of simulations (in this case ~0.2s x 100 = 20s). By contrast, sim.char() only has one very hard calculation to do - the eigen-decomposition of the matrix given by vcv(tree). Once this calculation is complete, the function needs very little additional time to run more simulations.

It is not difficult to see where this is going. For any 400 simulations sim.char() and fastBM() take the same amount of time, and for any more than that sim.char() becomes faster.

Today, I decided to add the capacity to run multiple simulations - without looping over the tree many times - into fastBM(). The new version (tagged v0.4) is available on my R-phylogenetics page (direct link to code here). This was really pretty easy to do. The only tricky parts involved evaluating boundary violations across all the layers of the simulated data array, and then reflecting if node values were outside the bound. To do this, I first created an internal function reflect():

Saturday, March 12, 2011

I'm sad to report on this blog the news that Walter Fitch, pioneer in the field of molecular phylogenetics, long time professor and former chair of biology at UC Irvine, and member of the National Academy of Sciences, passed away yesterday morning. I learned of his passing today via the Evoldir email listserve. The original email message sent from the current chair of biology at UC Irvine can be read here. To learn more about the contributions of Fitch to molecular phylogenetics refer to one of many references online (e.g., 1, 2) or, especially, read Joe Felsenstein's book "Inferring Phylogenies." Among Walter Fitch's many lasting contributions to phylogenetic systematics is a fast algorithm for computing the parsimony score of a site pattern on a tree (Fitch 1971) - now known as the "Fitch parsimony algorithm."

Tuesday, March 8, 2011

I just made a few significant updates to my function to create the phylomorphospace() plots of Sidlauskas (2008). This new version can be downloaded from my R phylogenetics page (direct link to code here).

First, I removed the requirement that the user provide a matrix of the estimated states at internal nodes of the tree. Now, if this is left out the function will simply use the {ape} function ace() to estimate these values using maximum likelihood.

Second, I liked the appearance of Luke's ecomorph plot a lot, so I decided to make a slightly modified version of this design the default for the function. This means that internal nodes are small filled circles, and terminal nodes are larger filled circles (although I did not use the same size disparity as in Luke's plot - see below). In addition, plotting tip labels is now optional and can be turned off by setting label=FALSE.

Finally, third, I added some basic error checking to make the function less buggy.

Monday, March 7, 2011

Sam Price has created a wiki tutorial for my brownie.lite() function that conducts the rate variation analysis described in O'Meara et al. (2006) and implemented in Brian's C++ program Brownie; and for my function to read stochastic character map formatted trees, read.simmap(). The tutorial is far more detailed than anything that I could have prepared and comes with an empirical example to test. The link to the tutorial is here. The R functions brownie.lite() and read.simmap() are available from my R phylogenetics page (direct links to code are here and here, respectively).

I just posted up a function to create a star phylogeny (that is, a completely unresolved multifurcating tree) on my R phylogenetics page (code here). Very simple, and it seems likely that this can already be done with an existing R function that I am not aware of.

Basically, recalling back to the structure of a "phylo" object in memory, we just need to create an edge matrix containing n+1 (for n species) in the left column and 1:n in the right column:

edgeedge[,1]

Then we can optionally assign branch lengths, if they have been provided by the user:

if(!is.null(branch.lengths)) edge.length=branch.lengths

Finally we combine these elements into a list and assign the class "phylo":

Sunday, March 6, 2011

David Bapst was today interested in computing the number of tips (leaves) descended from each internal node in the tree. These can be obtained (for each node above the root) for the {ape} "phylo" object phy using the following command:

> require(geiger) # load the geiger package> desc> Ndesc

I thought it might be faster to count the number of descendants as we read our tree into memory. This is possible because by the time our Newick tree parser reaches a ")" in the string, all of the descendant edges and tips already exist in memory. If we are diligent about assigning counts to all internal nodes, to get the count for the current node all we have to do is add the counts for its two (or more) daughters.

To allow David to see if this is indeed faster, I have added this to my existing (and otherwise not particularly useful) function read.newick(). Now this function creates a "phylo" object with the additional vector $Ndesc containing the number of tips descended from each node above the root. The order of $Ndesc is the row order of $edge[,2]. I have posted this to my R phylogenetics page (direct link to code here).

Doing this was easy. I just added a simple while() loop which runs after a descendant node phenotype is calculated. The while() condition evaluates whether the phenotype is on the interval given by bounds, and if not, then it returns the phenotype to within the bounds by bouncing it back by the amount that it would have otherwise exceeded the boundary.

Friday, March 4, 2011

Using the {ape} function drop.tip() we can easily excise a single taxon or a list of taxa from our "phylo" tree object in R. However, it is not immediately obvious how to prune the tree to include, rather than exclude, a specific list of tips. Trina Roberts (now at NESCent) shared a trick to do this with me some time ago, and I thought I'd pass it along to the readers of this blog.

About this blog

This web-log chronicles the development of new tools for phylogenetic analyses in the phytools R package. Unless you a reading a very recent page of the blog, I recommend that you install the latest CRAN version of phytools (or latest beta release) before attempting to replicate any of the analyses of this site. That is because the linked functions may be archived, and very likely have been replaced by newer versions.