Category: phylogenetics

Yesterday my paper [cite]10.1111/j.1558-5646.2012.01574.x[/cite] appeared in early view in Evolution,As the open access copy doesn’t appear on pubmed for a while, you can access my author’s copy here. so I’d like to take this chance to share the back-story and highlight my own view on some of our findings, and the associated package on CRAN.Just submitted, meanwhile, the code is always on github.

I didn’t set out to write this paper. I set out to write a very different paper, introducing a new phylogenetic method for continuous traits that estimates changes in evolutionary constraint. This adds even more parameters than already present in rich models multi-peak OU process, and I wanted to know if it could be justified — if there really was enough information to play the game we already had, before I went and made the situation even worse. Trying to find something rigorous enough to hang my hat on, I ended up writing this paper.

The short of it

There’s essentially three conclusions I draw from the paper.

AIC is not a reliable way to select models.

Certain parameters, such as \(\lambda\), a measure of “phylogenetic signal,” [cite]10.1038/44766[/cite] are going to be really hard to estimate.

BUT as long as we simulate extensively to test model choice and parameter uncertainty, we won’t be misled by either of these. So it’s okay to drink the koolaid [cite]10.1086/660020[/cite], but drink responsibly.

A few reflections

I really have two problems with AIC and other information criteria when it comes to phylogenetic methods. One is that it’s too easy to simulate data from one model, and have the information criteria choose a ridiculously over-parameterized model instead. In one example, the wrong model has a \(\Delta\)AIC of 10 points over the correct model.

But a more basic problem is that it’s just not designed for hypothesis testing — it doesn’t care how much data you have, it doesn’t give a notion of significance. If we’re ascribing biological meaning to different models as different hypotheses, we need want a measure of uncertainty.

When estimating parameters that scale branch length, I think we must be cautious because these are really data-hungry, and don’t work well on small trees. Check out how few of these estimates of lambda on 100 replicate datasets hit near the correct value shown by vertical line:

The package commands are explained in more detail in the package vignette, but the idea is simple. Running the pmc comparison between two models (for the model-choice step) looks like this:

The substantial overlap in the likelihood ratios after simulating under either model indicate that we cannot choose between BM and lambda in this case. I’ll leave the paper to explain this approach in more detail, but it’s just simulation and refitting.

You could just bootstrap the likelihoods or for nested models, look at the parameter distributions, but you get the maximum statistical power from the ratio (says Neyman-Pearson Lemma).

A technical note: mix and match formats

Many users don’t like going between ouch format and ape/phylo formats. The pmc package doesn’t care what you use, feel free to mix and match. In case the conversion tools are useful, I’ve provided functions to move your data and trees back and forth between those formats too. See format_data() to data-frames and convert() to toggle between tree formats.

Reproducible Research

The package is designed to make things easier. It comes with a vignette (written in sweave) showing just what commands to run to replicate the results from the manuscript.

This entire project has been documented in my open lab notebook from its inception. Posts prior to October 2010 can be found on my OWW notebook, the rest in my current phylogenetics notebook (here on wordpress). Of course this project is interwoven with many notes on related and more recent work.

Additional methods and feedback

As we discuss in the paper, simulation and randomization-based methods have an established history in this field[cite]10.1371/journal.pbio.0040373[/cite], [cite]10.1111/j.1558-5646.2010.01025.x[/cite]. These are promising things to do, and we should do them more often, but I might make a few comments on these approaches.

We are not getting a real power test when we simulate data produced from different models whose parameters have been arbitrarily assigned, rather than estimated on the same data, lest we overestimate the power. Of course we need to have a likelihood function to be able to estimate those parameters, which is not always available.

It is also common and very useful to assign some summary statistic whose value is expected to be very different under different models of evolution, and look at it’s distribution under simulation. This is certainly valid and has ties to cutting edge approaches in ABC methods, but will be less statistically powerful than if we can calculate the likelihoods of the models directly and compare those, as we do here.

My treebase package is now up on the CRAN repository. (Source code is up, the binaries should appear soon). Here’s a few introductory examples to illustrate some of the functionality of the package. Thanks in part to new data deposition requirements at journals such as Evolution, Am Nat, and Sys Bio, and data management plan requirements from NSF, I hope the package will become increasingly useful for teaching by replicating results and for meta-analyses that can be automatically updated as the repository grows. Please contact me with any bugs or requests (or post in the issue tracker).

Basic tree and metadata queries

Downloading trees by different queries: by author, taxa, & study. More options are described in the help file.

How do the weekly’s do on submissions to Treebase? We construct this in a way that gives us back the indices of the matches, so we can then grab those trees directly. Run the scripts yourself to see if they’ve changed!

Replicating results

A nice paper by Derryberry et al. appeared in Evolution recently on diversification in ovenbirds and woodcreepers [cite]10.1111/j.1558-5646.2011.01374.x[/cite]. The article mentions that the tree is on Treebase, so let’s see if we can replicate their diversification rate analysis:

Let’s grab the trees by that author and make sure we have the right one:

Yup, their result agrees with our analysis. Using the extensive toolset for diversification rates in R, we could now rather easily check if these results hold up in newer methods such as TreePar, etc.

Meta-Analysis

Of course one of the more interesting challenges of having an automated interface is the ability to perform meta-analyses across the set of available phylogenies in treebase. As a simple proof-of-principle, let’s check all the phylogenies in treebase to see if they fit a birth-death model or yule model better.

We’ll create two simple functions to help with this analysis. While these can be provided by the treebase package, I’ve included them here to illustrate that the real flexibility comes from being able to create custom functions. ((These are primarily illustrative; I hope users and developers will create their own. In a proper analysis we would want a few additional checks.))

This weeks blog (also posted on my blog) is a departure from fish, but is about a recent paper of mine that uses phylogenetic comparative methods to test hypotheses for body size and scale evolution among Sceloporus lizards.

This summer the lab has a reading group on phylogenetic comparative methods, where we are reading through some of the classic phylogenetic papers discussing the various methods. This past week we focused our attention on phylogenetic generalized least squares methods or PGLS. This method was introduced by Grafen in 1989, and although it wasn’t initially a common phylogenetic comparative approach, has seen more use in recent years. For those not familiar with this method, it utilizes a regression approach to account for phylogenetic relationships. In this method the phylogeny is converted to a variance-covariance matrix, where the diagonals in the matrix represent the “summed length of the path from the root of the tree to the species node in question (Grafen 1992).” That is, how far each tip is from the root; in an ultrametric tree the diagonals in the variance-covariance matrix will all be the same. The off diagonals represent the “shared path length in the paths from the root to the two species (Grafen 1992)”. In other words, the off diagonals are the distance from the root to the last common ancestor for the two species. Similar to independent contrasts, this method assumes Brownian motion evolution; however, unlike independent contrasts PGLS assumes the residual traits are undergoing Brownian motion evolution, whereas independent contrasts assumes the characters themselves are undergoing Brownian motion evolution. The other main difference in PGLS is the use of raw data instead of computing independent contrasts. In short, the PGLS approach is similar to a weighted regression, where the weighted matrix is the variance-covairnace matrix based on the phylogeny of the group, and assuming the same phylogeny will produce the same results as independent contrasts.

So what does this have to do with size, scales and Sceloporus? Well, in a recent study we used a PGLS approach to examine patterns of body size and scale evolution in relation to latitude and climate among Sceloporus lizards. Sceloporus (fence and spiny lizards) are a group of more than 90 species of lizards found from Central America up to Washington State in the U.S. Throughout their range they experience a diversity of habitats, from deserts to tropical forests to temperate forests; andhave been used in many studies examining physiological ecology, life history evolution and thermal biology. In our study we used Sceloporus to test two hypotheses for the evolution of morphology. 1) Lizards exhibit an inverse Bergmann’s Rule, with larger individuals found at lower latitudes and/or warmer climates. 2) Lizards from hotter environments will exhibit fewer and thus larger scales to aid in heat dissipation; whereas lizards from colder environments will exhibit more/smaller scales to aid in heat retention. There has been conflicting results for these hypotheses in the literature, and latitude has often been used as a proxy for climate. However, one of the unique things about our study is the incorporation of multivariate techniques to describe habitat. We use latitude as a predictor as well as climatic variables (temperatures, precipitation and a composite aridity index Q), and also utilize principal component analysis to characterize habitat. We therefore can test for specific climate predictors of these traits without assuming that higher latitudes necessarily equate to colder environments.

To test our hypotheses we gathered data on 106 species and populations of Sceloporus from the literature and museum specimens. We obtained latitude from the literature and source maps, and climate date from the International Water Management Institute’s World Water and Climate Atlas (http://www.iwmi.cgiar.org/WAtlas/Default.aspx). Using a recent phylogenetic hypothesis for Sceloporus (Wiens et al. 2010) we examined the relationship between maximum snout-vent length with latitude and 5 climatic predictors under three models of evolution (no phylogenetic relationships (OLS), Brownian motion (PGLS) and a model in which the branch lengths are transformed in an Ornstein-Uhlenbeck process (RegOU). To examine hypothesis 2 we examined a multiple regression with dorsal scale rows as the dependent, body size as a covariate and latitude or one of the 5 climatic predictors as independents. We also compared results with principal components 1-3 as predictors of dorsal scale counts.

So what did we find? First, we found that phylogenetic models (PGLS or RegOU) were always better fit than non-phylogenetic (OLS) based on likelihood ratio tests and AICc scores. We also found that as latitude increases mean and minimum temperatures decrease, as well as precipitation and aridity, but maximum temperature tends to increase. Thus, lizards from this group found at higher latitudes may be experiencing more desert like environments.

For hypothesis 1, we found support for the inverse of Bergmann’s Rule when viewed from a climatic perspective; larger lizards were found in areas with higher maximum temperatures, but not at lower latitudes. We also found that larger lizards were found in more arid environments.

Photo copyright Mark Chappell

Our results for hypothesis 2 were a little more complex. We did not find support for the first part of hypothesis 2, lizards with fewer scales were not found in hotter environments. We did find support for the second part of hypothesis 2, lizards with more scales are found in environments with lower minimum temperatures. We also found a positive effect of latitude, and a significant negative effect of aridity (with lizards with more scales inhabiting more arid environments). Results with principal components were also consistent, with PC1 (a latitude/temperature axis) having a significant negative effect on scale count; and PC2 (a maximum temperature/precipitation axis) having a significant positive effect.

Our results suggest several things. First, latitude alone may not be an accurate description of the environment organisms face, particularly at the finer spatial scales over which an individual species may exist. Second, we found support for the inverse of Bergmann’s Rule at the inter-specific level, which has also been found to be a consistent trend intra-specifically in some ectotherms (see Ashton and Feldman 2003). Finally, our analyses suggest that both temperature and precipitation (hence aridity) are important to the evolution of scale counts in this group. These findings also suggest that scale size may be important for other physiological processes, such as evaporative water loss (lizards in more arid environments may have more/smaller scales to reduce rates of evaporation through the skin as has been suggested by Soulé and Kerfoot 1972 ). Examining the relationship of morphological traits that may function in physiological processes may provide insight into how these organisms may respond to global of climate change.

While high-speed fish feeding videos may be the signature of the lab, dig a bit deeper and you’ll find a wealth of comparative phylogenetic methods sneaking in. It’s a natural union — expert functional morphology is the key to good comparative methods, just as phylogenies hold the key to untangling the evolutionary origins of that morphology. The lab’s own former graduate, Brian O’Meara, made a revolutionary step forward in the land of phylogenetic methods when he unveiled Brownie in 2006, allowing researchers to identify major shifts in trait diversification rates across the tree. This work spurred not only a flood of empirical applications but also methodological innovations, such as Liam’s brownie-lite, and today’s focus: Jon Eastmanet al.‘s auteur package.

Auteur, short for “Accommodating uncertainty in trait evolution using R,” is the grown-up Bayesian RJMCMC version of that original idea in Brownie. Diversification rates can change along the phylogenetic tree — only this time, you don’t have to specify where those changes could have occurred, or how many there may have been — auteur simply tries them all.

If you want the details, definitely go read the paper — it’s all there, clear and thorough. Meanwhile, what we really want to do, is take it out for a test drive.

The package isn’t up on CRAN yet, so you can grab the development version from Jon’s github page, or click here. Put that package in a working directory and fire up R in that directory. Let’s go for a spin.

Great, the package installed and loaded successfully. Looks like Jon’s put all 73 functions into the NAMESPACE, but it’s not hard to guess which one looks like the right one to start with. rjmcmc.bm. Yeah, that looks good. It has a nice help file, with — praise the fish — example code. Looks like we’re gonna run a simulation, where we know the answer, and see how it does:

The data is going in as “phy” and “dat”, just as expected. We won’t worry about the optional parameters that follow for the moment. Note that because we use lapply to run multiple chains, it would be super easy to run this on multiple processors.

Note that Jon’s creating a bunch of directories to store parameters, etc. This can be important for MCMC methods where chains get too cumbersome to handle in memory. Enough technical rambling, let’s merge and load those files in now, and plot what we got:

Thanks Jon and the rest of the Harmon Lab for a fantastic package. This is really just a tip of the iceberg, but should help get you started. See the paper for a good example of posterior analyses requisite after running any kind of MCMC, or stay tuned for a later post.

Pupfish are indeed the only group of fish named after puppy dogs for their playful behavior. They’re best known for their ability to survive in extreme environments, like desert hot springs. However, for my dissertation research, I have focused on understanding their evolution and diversification.

Pupfish show a remarkable pattern of adaptive diversification: in only two small lake systems throughout their entire range, pupfishes are evolving from 50 – 130 times faster than all other pupfish species. Truly ‘explosiveevolution‘ – the fastest morphological diversification rates measured so far in fishes, and one of the fastest rates documented among all organisms. Further, other pupfish groups of similar young age do not show such extreme rates.

Figure 3 in paper. The pupfish heat map. Colors indicate the rate of evolution for 16 traits relative to other pupfishes in a: Lake Chichancanab pupfishes and b: San Salvador Island pupfishes.

What is going on here? The short answer is the evolution of novel ecological niches. Cyprinodon pupfishes occur throughout the Caribbean and along the Atlantic coast from Massachusetts to Venezuela and as far inland as isolated springs in California and Mexico. Throughout their entire range, pupfishes are ecological generalists: they eat mostly algae, decaying vegetation, and whatever insects or crustaceans they can catch. Yumm! Although different species can often be distinguished by differences in male coloration, or subtle differences in body or fin shape, pupfish species on the whole are anatomically very similar, particularly in jaw shape. Further, multiple pupfish species never coexist in the same habitat.

Except in two places. These are the only two places throughout their entire range where multiple pupfish species coexist and specialize on entirely new resources. On the tiny island of San Salvador in the Bahamas (only 11 miles long!), three pupfish species coexist in the inland salty lakes. Incredibly, one of these has evolved to feed almost entirely on the scales of other pupfishes! While scale-eating has evolved at least 14 times in other groups of fishes, within the 1,500 species of atherinimorphs, to which pupfish belong, this undescribed pupfish species is the only known scale-eater! While previous researchers speculated that it may eat scales or other fish, I was stunned to find only scales and no whole fish when I began examining the guts of this species (n = 60). This behavior is easy to watch in the field – the scale-eater stalks any nearby pupfish, quickly orienting perpendicular to its prey, striking and biting off scales, then stealthily moving on to the next target, just like a pup-tiger.

Cyprinodon sp. ‘scale-eater’: Males in full breeding coloration photographed in their natural habitat on San Salvador Island.

There is a second ecologically specialized species in these San Salvador lakes. This species has shortened jaws for crushing its diet of snails and ostracods. Moreover, it has a nose! This is one of the few fish species that tucks its jaw underneath protruding nasal tissue surrounding protruding bones (maxilla and nasal) on the face of the fish.

Cyprinodon sp. ‘nose’ What looks like an upper lip in this photo is actually the fish’s nose protruding outward above the fish’s tucked upper jaw.

The function of this peculiar fish nose is so far unknown (or any fish nose, for that matter). I do have a couple guesses: perhaps it helps stabilize the fish’s jaw while crushing hard shells. Or, it may help with species recognition, as males gently nudge females when trying to entice them to spawn.

The second remarkable place for pupfish diversification is Lake Chichancanab, Mexico, a large, brackish lake in the center of the Yucatan peninsula (Chichancanab is Mayan for “little lake” or “little girl lake”, whichever you prefer). Chichancanab contained at least five coexisting species of pupfishes, including four ecological specialists. One of these, Cyprinodon maya, is the largest pupfish species known and also the only pupfish to eat other fish. A second species, Cyprinodon simus, is the second smallest pupfish species, and was observed feeding on zooplankton in large shoals in open water. Piscivory and zooplanktivory are unique pupfish niches found only in Lake Chichancanab.

Terribly, these descriptions of Chichancanab species are in past tense. In the early 1990’s, invasive African tilapia (probably Oreochromis mossambicus) were introduced to Lake Chichancanab. In addition, the native Mexican tetra, Astyanax sp., was also introduced. All specialized pupfish species promptly declined in abundance and frequency over the next 10 years. I visited the lake in 2009 and after surveying thousands and thousands of fish from several different basins of the large lake, I observed zero Cyprinodon maya and only one putative hybrid Cyprinodon simus. These specialized species are now functionally extinct in the lake. Thankfully, they have survived in home aquaria and backyard fish ponds in the US thanks to the efforts of dedicated aquarium hobbyists in the American Killifish Association. I am now maintaining these extinct-in-the-wild species in the lab as well.

Cleared and stained specimen of Cyprinodon simus (bottom), the only zooplanktivore pupfish. Note the dramatic difference in the thickness of their lower and upper jaws. These specimens were collected in the wild before invasive species were introduced and generously loaned for this research by the University of Michigan Museum of Zoology.

Thus, in only two remarkable lake systems throughout their entire range, pupfish are speciating and adapting to novel trophic resources, like scales, snails, other fish, and plankton. These two groups of pupfishes also happen to be showing the fastest rates of evolution among all pupfishes. Probably not a coincidence: invasion of these novel ecological niches is driving incredible rates of morphological change, particularly in jaw shape.

It is particularly remarkable to see this pattern within pupfish, a group of fishes that has repeatedly been isolated in new, extreme environments and also probably has repeatedly adapted to these new environments. Several other groups of pupfishes were also evolving fast in my analysis – around 5 – 10 times faster than average, such as the groups containing the Devil’s Hole pupfish, a tiny species restricted to the smallest habitat of any known organism, a tiny cave shaft in Death Valley, shown here:

Devil’s Hole, Death Valley National Park, Nevada. This vertical shaft of water stays a balmy 94 degrees F year-round and divers have not yet found the bottom (at least 400 feet deep). Cyprinodon diabolis is restricted to eating scarce algae off a tiny rock shelf near the surface and its population size has fluctuated between 37 and around 400 fish.

Cyprinodon pachycephalus also belongs to a quickly evolving group. This is the pupfish species that lives and breeds in the hottest waters of any known vertebrate, 114 degrees Fahrenheit year-round!

These are incredibly extreme environments that would be expected to drive rapid rates of morphological evolution. Indeed, these species are changing quickly, but the Devil’s hole pupfish and C. pachycephalus are both generalist detritivores, just like their relatives.

However, to really see explosive evolution appears to require that pupfish start dabbling in entirely new ways of life, to go where no pupfish has ever gone before. (this wouldn’t be blogging without Star Trek!)

But, I haven’t yet fully answered the question I originally posed. Why have novel trophic niches evolved in these two places and nowhere else across their entire range? Certainly, the size of these two lakes and lack of competitors (except native mosquitofishes) plays a role. But, there may be many similar lakes with similar fish communities throughout the Caribbean. What is going on here? This remains an outstanding research question, one I am actively pursuing.

Some weeks ago, I discussed a large phylogenetic study that separated sticklebacks from the seahorses and pipefishes – today I’m going to discuss a phylogenetics paper that zooms in on the relationships between different sticklebacks(and their very closest relatives).

Many of the same scientists from the earlier stickleback phylogeny were involved in this paper, though there is one new face, Yale’s Tom Near, a longtime Wainwright Lab collaborator and former CPB Postdoc.

The group sequenced the mitochondrial genomes of all nine sticklebacks and stickleback relatives, and they also sequenced 11 nuclear genes. They used both maximum-likelihood and Bayesian methods to estimate a phylogenetic tree of sticklebacks.

Here’s what they found:

The mitogenome and nuclear gene data dovetail beautifully, as do the maximum-likelihood and Bayesian methods for each dataset, so there’s every reason to feel confidant about this arrangement of species.

There are a number of interesting results here: Aulorhynchidae, the family that includes the tubesnout, turns out to be paraphyletic – perhaps the Aulorhynchidae should be folded into the family Gasterosteidae and considered proper sticklebacks?

The thing I find the most interesting is the phylogenetic position of Spinachiaspinachia, an elongated stickleback similar in appearance to the tubesnout. The paper suggests that perhaps Spinachia‘s elongate form is the result of convergent evolution.

It’s also worth thinking about the geographical distribution of stickleback in the context of this phylogeny: Spinachia and Apeltes, two Atlantic Ocean-only species, are grouped together, while the most basal stickleback relatives are all found in the North Pacific.

There are some interesting future directions possible here as well. One of Tom’s specialties is using fossil data to calibrate phylogenies, so it’s likely we’ll see a phylogeny in the near future that gives us an idea of the timescales of major stickleback divergence events.

In past entries, I’ve made reference to the tubesnout(Aulorhynchusflavidus), an odd little creature that’s closely related to the sticklebacks.

Tubesnouts are currently part of the family Aulorhynchidae, sister group to the Gasterosteidae(sticklebacks). Unlike the stickleback-sygnathiform relationship, the stickleback-tubesnout relationship is supported by molecular and morphological data, so it’s unlikely to change any time soon.

At a quick glance, a tubesnout looks like a little like a pipefish, but if you look closer, you’ll see that it actually looks like a stickleback that’s been stretched out. Tubesnouts have the “iconic” stickleback features, though they’re not as obvious: instead of a few big dorsal spines, they have many very small spines, and instead of “armor”(which is actually not that common on most sticklebacks) they just have a small row of lateral line scales. Their pelvic girdle is not as robust as a threespine’s and their pelvic spines are small and lack serrations, though they do have red pelvic fin webbing like a threespine stickleback.

The mating system of the tubesnout bears some similarity to that of other sticklebacks, namely, males glue together vegetation to make small nests. Males also exhibit specific color patterns during the breeding season; the male tubesnouts that I’ve observed have a patch of black next to a patch of white on the head.

The most striking feature of the tubesnout is its elongated body and head. Many teleosts exhibit elongation (anguilliformes being the most notable), but few have elongation in both the body and the head. (thoughtheydoexist) Perhaps the most interesting thing about elongation and the tubesnouts is that there is reason to believe that elongation is ancestral in sticklebacks. Spinachia spinachia, the sea stickleback, is elongated – if phylogenetic analysis shows that it is the most basal stickleback species, it is possible that the common ancestor of the sticklebacks was elongated, and that some sticklebacks evolved a more classic fishy shape.

Until recently, sticklebacks were thought to be pretty closely related to seahorses and pipefish. At first glance, it seems reasonable: both groups of fish have bony armor plates, male parental care, and species with elongated bodies and snouts. Many of the species also share a mode of swimming called “labriform” that I’ll be talking about more in a later entry.

Things are rarely that simple when you’re dealing with the incredible diversity of teleost fishes, particularly within the Percomorpha, often referred to as the “bush at the top of the tree of life”. Fish are just too diverse for simple morphology-based relationships – you need genetic data to really see what’s going on, and you need a lot of it, because there are so many groups.

In a paper published early last year, Kawahara et al used 75 sequenced mitogenomes to generate a phylogeny of the Gasterosteiformes and related species, and…bam, there goes the neighborhood!

Gasterosteiformes(bolded in the figure above) was split into three pieces: seahorses, pipefishes and their relatives ended as sister to the gurnards, the weird indostomids were sister to the weird synbranchiformes, and finally, the closest relatives of the Gasterosteidei(sticklebacks) were eelpouts and pholids.

Obviously, there’s still a lot of work to be done with these fishes – nuclear genes need to be sequenced to back up the mitochondrial genome data, and given the number of species in the presumed stickleback sister group, it’s conceivable that there could be a paraphyly issue as well.

Here’s a handy little Perl script I wrote that takes a nexus formatted tree file and breaks down every tree that it finds into the clades that it is composed of (i.e., every node in the starting tree is returned as a new tree). It spits all of these out into a text file, which can then be copied and pasted into a tree block.

Each tree in the original file gets its own text file, named woodchipper_treename.tre, where “treename” is the name of the tree given in the trees block in the input file.

It hasn’t been tested exhaustively yet, so please check the results and contact me if you run into any trouble. No warranty is expressed or implied, your mileage may vary, not valid where prohibited, and all that.

NOTE: If you’re having trouble saving the link above, try right click -> save as instead of clicking on it in the normal way.