On “triangulation” in genome scans

Guest contributor K.E. Lotterhos is a marine biologist at Wake Forest University, who studies evolutionary responses to fishing and climate change. You can find her on Twitter under then handle @dr_k_lo.

A major goal of evolutionary biology is to understand the genetic basis for adaptation to heterogeneous environments. Rapid advances in technology are allowing a large amount of sequence data to be collected (mostly in the form of single nucleotide polymorphisms: SNPs), presenting us with an unprecedented opportunity to address this question in non-model species on a genome scale.

A major challenge for genome scans is to determine whether patterns of genetic variation are due to the effects of selection versus neutral processes such as genetic drift and demography.

In this post, I will introduce the concept of triangulation* in genome-scans: the process of gathering more than one independent source of evidence for the inference of loci under selection. (Disclaimer: I’m thinking about long-lived, non-model organisms here, where recombinant inbred lines, knocking-out genes, or complementation tests would not be feasible). Although recent reviews have highlighted the importance of integrating multiple types of data, analyses, and experiments to uncover the loci responsible for adaptation (Barrett and Hoekstra 2011, Scheinfeldt and Tishkoff 2013), there are still relatively few studies that have achieved this integration.

How can one plan a study such that genome-scan analyses can be considered independent?

First, let’s consider the two most common types of genome scans for single-nucleotide polymorphisms (SNPs) in non-model organisms:

The FST outlier test:FST is a measure of genetic differentiation among populations. Outliers are loci that are more different in their allele frequencies when compared to the rest of the genome, and thus may explain adaptive differences among populations.

The Genetic-Environment Association (GEA): A measure of the correlation between allele frequencies (in populations or individuals) and an environmental axis, usually modeled with allele frequencies as the response variable and genotype as a predictor variable.

Let’s say a number of individuals were collected from heterogeneous environments on the landscape. Some SNPs were significant both in an FST outlier analysis and a GEA. Would we consider these SNPs to have two independent sources of evidence?

NO, because the two tests were performed on the same sets of individuals. Similar reasoning applies if the same SNP is significant in two GEAs (i.e., significant correlations in two different environments): this is not independent evidence because the same set of individuals was used for both tests. If outlier loci are enriched for functional genes (perhaps based on annotation with a closely-related species) or show an excess of non-synonymous substitutions, the strength of the evidence is increased, but this still does not constitute independent evidence.

To constitute independent evidence under triangulation, each statistical analysis should comprise an independent set of individuals. Having an independent set of individuals is important because of sampling error: perhaps—by chance—you sampled more homozygotes than heterozygotes, or—by chance—at one location only a single allele was sampled. These “chance” events occur more often at low sample sizes – and when they do occur, they are likely to affect multiple statistical tests. For this reason, a false-positive FST outlier is also likely to be a false positive in a GEA when both analyses are performed on the same dataset. Triangulation can reduce the set of false positives because it is unlikely the same “chance” events would happen in different sets of individuals.

A Manhattan plot from a GWAS of flowering time in Medicago truncatula.

Here are a few examples of additional experiments that one can do to achieve triangulation in non-model species:

The Genome-Wide Association Study (GWAS): A measure of the correlation between the phenotype and the allelic state. Usually some form of a mixed model, with phenotype as a response variable and genotype as the predictor variable (and random factors of population and/or relatedness). Typically phenotypes and genotypes have been measured in a common garden environment.

The Within-Generation Selection Experiment: The frequency of alleles is measured before and after selection: if an allele frequency change can be shown to be greater than that expected by genetic drift (i.e., of sampling of individuals from the population), then this is evidence in favor of selection at that locus (e.g. Pespeni et al. 2013, Gompert et al. 2014).

The Common-Garden Validation Experiment: Individuals with candidate allele (or alleles) have higher fitness in a common garden environment (e.g. Yoder et al. 2014). Alternatively, gene expression at a candidate gene (or genes) is consistently different among populations in the common garden (e.g., Chen et al. 2012).

The limitation of triangulation is that—even when we have multiple independent surveys or experiments—we don’t always expect them to give the same answer. For example in humans, different loci on each continent (in Tibet, the Andes, and Ethiopia) have been implicated in adaptation to high-altitude conditions (Alkorta-Aranburu et al. 2012, Bigham et al. 2013). All loci, however, are involved in the same biological pathway (reviewed in Scheinfeldt and Tishkoff 2013).

Take home message:

Triangulation makes a stronger case for candidate loci. In planning a project (and in reviewing papers), it is important to consider whether the sampling design utilizes multiple independent types of data and experiments.

Thanks for the post. I’m only just starting to think about this stuff, but it seems to me that no genome scan of the same set of populations, whether for Fst outliers, GWAS or GEA can be considered independent. So why do you include GWAS in the category of analyses that can achieve triangulation?

K E Lotterhos

If one set of individuals was phenotyped and collected from the landscape and for an Fst outlier, a GWAS, and a GEA, then none of those tests would be considered independent.
If one set of individuals was collected from the landscape and used for an Fst outlier/GEA, and a second set of individuals were collected and grown in a common garden for a GWAS, then I would consider the results from those tests to be independent.
Note that there is still non-independence within each dataset: non-indpendence among linked loci, as well as non-independence due to shared evolutionary history among samples. If population structure is not accurately controlled for, it can create many false positives. Assuming that population structure is accurately controlled for by each test, then the main source of non-independence should be from linkage. Since most of the time population structure is probably not accurately controlled for, triangulation can help us pinpoint those loci that are significant in independent datasets.