Molecular ecology, the flowchart

Towards the end of last semester my department’s evolutionary genetics journal club read Rasmus Nielsen’s terrific 2005 review of tests for recent natural selection in genetic data. Nielsen provides figures illustrating the effects of a recent selective sweep and the shape of the site frequency spectrum that you’ve probably seen reproduced in a score of research seminars, and tables that neatly categorize (1) how different evolutionary processes affect diversity within and among species and (2) a selection of widely used test statistics, the patterns they test for, and the kind of data they need. It’s now more than a dozen years old, but it holds up mighty well as an introduction to what we can learn from genetic samples of a population, in no small part because the tests Nielsen describes in human data have only recently become widely accessible for non-human, non-model organisms.

Re-reading the paper and thinking about how I’d approach it as a new grad student, though, it occurred to me that precisely because he was writing in an era when a handful of species were the focus of most real “genomic” research, Nielsen doesn’t really account for the wide variation of genomic research infrastructure we cope with today. Anyone can collect genome-wide SNP data from a tube-rack-full of tissue samples and a RADseq protocol, but what you can do with that data afterwards depends a lot on the species you sampled, the set of individuals in the sample, what other kinds of data you can connect to the samples, and the existence of resources like a reference genome, a linkage map, or “annotations” that might be anything from the results of functional genetic experiments to algorithmic identification of hypothetical gene sequences.

In that journal club discussion we started talking through some of these possibilities in terms of if-then links: if you have SNPs and they’re placed within larger contigs or a whole reference genome then you can look for runs of homozygosity or extended linkage disequilibrium orislands of differentiation … Or if you have SNPs and they’re from multiple species you can estimate a phylogeny and then do ancestral state reconstruction and if you know codon positions you can estimate dN/dS … And so on. I started sketching on the whiteboard, and the result is below:

Click the image to see it bigger. (jby)

That’s right, I’ve tried to cram everything molecular ecologists do into a single flowchart. If you’re a graduate student or otherwise new to using genetic marker data, you can use this to trace the kinds of analysis you can do with a given dataset, or work backwards from a kind of analysis to figure out what data and annotation you’ll need to do it. Some notes to help navigate:

Although I’ve tried to be comprehensive, this is fundamentally a list of what I know to do with a bunch of genetic marker data. It almost certainly misses possibilities that I haven’t explored myself or that I don’t encounter regularly in my reading. For instance, there really isn’t a pathway through this that fits metagenomic study designs. Have I missed your favorite analysis? By all means, point it out in the comments — maybe there’ll be a Version 2.0 eventually.

A related biasing factor is that I’m assuming you start the flowchart with, basically, a genotype table — individuals on rows, loci in columns, that sort of thing. In the era of high-throughput genotyping, that’s skipping over a bunch of data-juggling and processing. This chart could be reframed as one branch of a larger flowchart that starts with a hard drive full of raw Illumina reads. Or it could start at the point of sequencing protocol selection, or sample collection.

In general I’ve tried to avoid naming specific analysis or software, or often even test statistics. That’s mostly because there’s so flipping many options to cover, even for something as ubiquitous and apparently simple as pairwise population differentiation. Once you’ve got a general pattern or process you’re interested in, plug it into Google Scholar and expect to spend some time figuring out which of the available statistical and software methods will work best with your data.

Multiple points in the flowchart are deliberately vague. What’s “a LOT” of markers? Well … it really depends. Where’s the point at which, really, everything under discussion only applies if you have SNPs as opposed to microsats or AFLPs? That’ll depend on the path you take and the analysis you choose.

And, to add in response to Twitter-nitpicking: In spite of the question at the top of the chart, I do not in fact recommend just mindlessly doing any analysis you can do with a given dataset. The directionality of the flowchart reflects the way in which data and analyses build on each other, not a mandate to complete a prescribed pathway. You should start a study design by choosing analyses that test a hypothesis you’re interested in testing — and if you pinpoint that analysis on the chart you can trace back to figure out what prior work you need to do to get there. You may also find that you’ll also be able to do additional things, and if those analyses would help to test the hypothesis you started with, you should add them to the project plan.

With all that said, I hope this is a reasonably helpful roadmap to the range of possibilities available to anyone contemplating a new molecular ecology project. We often plan projects with a single hypothesis test in mind, but even a smallish modern genetic data set has multiple dimensions to consider. To really learn what evolutionary history has shaped the samples in your lab freezer, you need to be ready to put your data through its paces.