A Nice opinion on confronting uncertainty and modeling it for GBS data

Just over a week ago, I had the opportunity to work in Chris Nice‘s lab at Texas State University. I was accompanied by one of our MS students, Ben, and my colleague, Erik Sotka, to prep libraries for a genomic survey of a certain alga I’ve a penchant to write about. We also were there to prep a library with Torrance Hanley, a postdoc in the Kimbro and Hughes labs at Northeastern.

Chris walked us through each step as we embark on our first population genomic projects. We got to talking about analyses and issues I’ve written about before. In addition, we got to talking about times in which Bayesian approaches, such as STRUCTURE, may not be appropriate (i.e., when there are strong departures from HWE) and possible ways to get around this in the future!

I asked Chris to offer his opinion and write a small piece for TME. Et voilà …

Population genomics is certainly progressing as a field and there seems to be about as many ways to do things as there are labs doing them. Several methods for library construction have been reviewed recently with some good discussions (Andrews & Luikart, 2014; Puritz et al., 2014; Andrews et al., 2014). One area that has not received as much attention is the downstream analytical details – once you have your sequence reads.

In reading recent papers, it seems clear that there are differing philosophies arising from the fact that next-generation sequence data, especially from reduced representation, GBS protocols, have forced molecular ecologists to confront notions of genotype uncertainty. Stochasticity arising from library preparation and the sequencing process means there is substantial variation in coverage depth per locus and per individual in GBS data sets. This, in turn, means there can be uncertainty about the genotype for an individual at a particular gene region. A popular approach to deal with this uncertainty is to filter data in a way to minimize it. This means throwing away data below a coverage threshold and keeping only those markers for which there are many sequence reads.

An alternative is to confront that uncertainty and to model it. The central idea is that we are sampling the underlying genotype with GBS sequence reads, with all the attendant issues concerning sampling. Thought of this way, it makes sense (to me, at least) to treat the problem of genotyping from a modeling perspective like any other inference problem. In this context, some important contributions seem to have received less attention than I think they deserve.

A recent paper by Mandeville et al. (2015) in Molecular Ecology illustrates this approach. Not only is this a very interesting paper exploring geographic variation in reproductive isolation in repeated hybrid zones in fish, but the authors use a clustering algorithm that accounts for genotype uncertainty. This is an algorithm that is an extension of the hierarchical Bayesian ideas mentioned above, and based on the STRUCTURE algorithm (Pritchard et al., 2000; Falush et al., 2003, 2007), but modied to handle GBS data. The model, called ENTROPY, takes genotype likelihoods from variant calling via SAMtools/BCFtools as the starting point and, as in STRUCTURE, provides a clustering solution for varying numbers of populations (k). Output includes the assignment probabilities as well as genotype probabilities for all individuals at all loci and credible intervals for these estimated parameters.

Figure 4 from Mandeville et al. (2015) in which they estimated posterior distributions of admixture proportion (q) for each individual using ENTROPY, for k=2 to k=8 genetic clusters.

This provides a powerful approach for population genomics using GBS data (Mandeville et al., 2015). The use of Bayesian inference does require more computational time than other approaches to GBS data analysis. It also requires some familiarity with Bayesian methods and might not be applicable to all situations. On the other hand, another advantage of modeling genotype uncertainty is that you can potentially take advantage of lower coverage data, meaning that these methods accounting for variable coverage allow researchers to use more of their data (Buerkle & Gompert, 2013). This alone might justify paying more attention to these models.

Related

About Stacy Krueger-Hadfield

I am a marine evolutionary ecologist interested in the impacts of seascapes and complex life cycles on marine population dynamics. I use natural history, manipulative field experiments and population genetic and genomic approaches with algal and invertebrate models in temperate rocky shores,estuaries and the open ocean.