Abstract

The advent of next generation sequencing technologies has made whole-genome and whole-population sampling possible, even for eukaryotes with large genomes. With this development, experimental evolution studies can be designed to observe molecular evolution “in action” via evolve-and-resequence (E&R) experiments. Among other applications, E&R studies can be used to locate the genes and variants responsible for genetic adaptation. Most existing literature on time-series data analysis often assumes large population size, accurate allele frequency estimates, or wide time spans. These assumptions do not hold in many E&R studies. In this article, we propose a method—composition of likelihoods for evolve-and-resequence experiments (Clear)—to identify signatures of selection in small population E&R experiments. Clear takes whole-genome sequences of pools of individuals as input, and properly addresses heterogeneous ascertainment bias resulting from uneven coverage. Clear also provides unbiased estimates of model parameters, including population size, selection strength, and dominance, while being computationally efficient. Extensive simulations show that Clear achieves higher power in detecting and localizing selection over a wide range of parameters, and is robust to variation of coverage. We applied the Clear statistic to multiple E&R experiments, including data from a study of adaptation of Drosophila melanogaster to alternating temperatures and a study of outcrossing yeast populations, and identified multiple regions under selection with genome-wide significance.

The task of identifying selection signatures can be addressed at different levels of specificity. At the coarsest level, identification could simply refer to deciding whether some genomic region (or a gene) is under selection or not. In the following, we refer to this task as detection. In contrast, the task of site identification corresponds to the process of finding the favored mutation/allele at the nucleotide level. Finally, estimation of model parameters, such as strength of selection and dominance at the site, can provide a comprehensive description of the selection process.

In the effort to analyze E&R selection experiments, many authors chose to adapt existing tests that were originally used for static data, pairwise comparisons (two time points), and single replicates to perform a null scan. For instance, Zhou et al. (2011) used the ratio of the estimated population size of case and control populations to compute a test statistic for each genomic region. Burke et al. (2010) applied the Fisher exact test to the last observation of data on case and control populations. Orozco-ter Wengel et al. (2012) used the Cochran–Mantel–Haenszel (CMH) test (Agresti and Kateri 2011) to detect SNPs whose read counts change consistently across all replicates of two time-point data. Turner et al. (2011) proposed the diffstat statistic to test whether the change in allele frequencies of two populations deviate from the distribution of change in allele frequencies of two drifting populations. Bergland et al. (2014) calculated to populations throughout time to signify their differentiation from ancestral (two time-point data) as well as geographically different populations. Jha et al. (2015) computed a test statistic of generalized linear-mixed model directly from read counts.

Among the methods specifically designed for time-series data, many make assumptions which may not hold in E&R studies. One common assumption is that the underlying population size is large, so it is reasonable to model dynamics of allele frequencies using continuous-state models (Bollback et al. 2008; Feder et al. 2014; Terhorst et al. 2015). Second, many existing methods were originally designed to process the wider time spans seen in ancient DNA studies, an assumption that does not hold for E&R experiments (Steinrücken et al. 2014; Schraiber et al. 2016). Finally, many E&R analysis tools assume that allele frequencies in the input data are unbiased (e.g., Bollback et al. 2008), which may not be valid for shotgun sequencing experiments.

Here, we consider an HMM, similar to Williamson and Slatkin (1999) and Bollback et al. (2008) but under a “small-population-size” regime. Specifically, we use a discrete state (frequency) model. We show that for small population sizes, discrete models can compute likelihood exactly, which improves statistical performance, especially for short time-span experiments. Additionally, we add another level of sampling noise to the traditional HMM model, allowing for heterogeneous ascertainment bias due to uneven coverage among variants. We show that for a wide range of parameters, Clear provides higher power for detecting selection, estimates model parameters consistently, and localizes the favored allele more accurately compared to the state-of-the-art methods, while being computationally efficient.

Materials and Methods

Consider a panmictic diploid population with fixed size of N individuals. Let be the frequencies of the derived allele at generations for a given variant, where at generations samples of n individuals are chosen for pooled sequencing. The experiment is replicated R times. We denote allele frequencies of the R replicates by the set To identify the genes and variants that are responding to selection pressure, we use the following procedure:

Estimating population size: The procedure starts by estimating the effective population size, under the assumption that much of the genome is evolving neutrally.

Estimating selection parameters: For each polymorphic site, selection and dominance parameters are estimated so as to maximize the likelihood of the time-series data, given

Computing likelihood statistics: For each variant, a log-odds ratio of the likelihood of selection model to the likelihood of neutral evolution/drift model is computed. Likelihood ratios in a genomic region are combined to compute the Clear statistic for the region.

Hypothesis testing: An empirical null distribution of the Clear statistic is calculated using genome-wide drift simulations, and used to compute P-values and thresholds for a specified false discovery rate (FDR). We perform single-locus hypothesis testing within selected regions to identify significant variants and report genes that intersect with the selected variants.

These steps are described in detail below.

Estimating population size

Methods for estimating population sizes from temporal neutral evolution data have been developed (Williamson and Slatkin 1999; Anderson et al. 2000; Bollback et al. 2008; Terhorst et al. 2015; Jónás et al. 2016). Here, we aim to extend these models to explicitly model the sampling noise that arise in pool-seq data. Specifically, we model the variation in sequence coverage over different locations, and the noise due to sequencing only a subset of the individuals in the population. In addition, many existing methods (Bollback et al. 2008; Feder et al. 2014; Terhorst et al. 2015; Topa et al. 2015) are designed for large populations, and model frequency as a continuous quantity. We observed that using Brownian motion to model frequency drift may be inadequate for small populations, low starting frequencies, and sparse sampling (in time); factors that are common in experimental evolution (see Results, Figure 3, A–C, and Figure 2). To this end, we model the Wright–Fisher Markov process for generating pool-seq data (Supplemental Material, Figure S1 in File S3) via a discrete HMM (Figure 1B). We start by computing a likelihood function for the population size given neutral pool-seq data.

E&R selection experiments on D. melanogaster. (A) Typical configuration in which time-series data are collected for D. melanogaster. A small set of founder lines is selected from a large population and used to create a subpopulation of isofemale lines. Multiple replicates of the population are evolved and resequenced to collect time-series genomic data. For sequencing, n individuals are randomly sampled and sequenced with coverage λ. (B) Graphical model showing dependence of the random variables in the single-locus model used to compute Clear statistics. Observed variables c (derived allele read count) and d (total read count) are shaded. The variables denote allele frequency, sampled allele frequency, and mean sequencing coverage, respectively. (C) Mean and 95% confidence intervals of the (i and iii) theoretical and (ii and iv) empirical trajectories of the favored allele for (i and ii) hard- and (iii and iv) soft-sweep scenarios and The first 50 generations are shaded in gray to represent the sampling span of sampling in short-term experiments, illustrating the difficulty in predicting selection at early stages of selective sweep.

Likelihood for the neutral model:

We model the allele frequency counts as being sampled from a binomial distribution. Specifically,where π is the global distribution of allele frequencies in the base population. Note that π depends on the demographic history of the founder lines and can be estimated from the site-frequency spectrum (see Figure S2 in File S3) of the initial population. For notational convenience, henceforth we omit the dependence of likelihoods to the parameter π.

To estimate frequency after τ transitions, it is enough to specify the transition matrix where denotes probability of change in allele frequency from to in τ generations:(1)(2)Furthermore, in an E&R experiment, individuals are randomly selected for sequencing. The sampled allele frequencies, are also binomially distributed:(3)We introduce the sampling matrix Y, where stores the probability that the sample allele frequency is given that the true allele frequency is

We denote the pool-seq data for that variant as where represent the coverage and the read count of the derived allele, respectively. Let be the sequencing coverage at different generations. Then, the observed data are sampled according to(4)The emission probability for an observed tuple is(5)For let denote the probability of emitting and reaching state j at Then, can be computed using the forward procedure (Durbin et al. 1998):(6)where The joint likelihood of the observed data from R independent observations is given by(7)where The graphical model and the generative process for which data are being generated is depicted in Figure 1B and Figure S1 in File S3, respectively.

Finally, the last step is to compute an estimate that maximizes the likelihood of all M variants in the whole genome. Let denote the time-series data of the ith variant in replicate r. Then,

(8)

Estimating selection parameters

Likelihood for the selection model:

Assume that the site is evolving under selection constraints and where s and h denote selection strength and dominance parameters, respectively. By definition, the relative fitness values of genotypes 0|0, 0|1, and 1|1 are given by and Then, the frequency at time (one generation ahead), can be estimated using:(9)The machinery for computing likelihood of the selection parameters is identical to that of population size, except for transition matrices. Hence, here we only describe the definition transition matrix of the selection model. Let denote the probability of transition from to in τ generations, then (see Ewens 2004, p. 24, equations 1.58–1.59):(10)(11)The maximum likelihood estimates are given by(12)Using grid search, we first estimate N (Equation 8), and subsequently, we estimate parameters s and h (Equation 12, Figure S3 in File S3). By broadcasting and vectorizing the grid search operations across all variants, the genome scan on millions of polymorphisms can be done in a significantly smaller time than iterating a numerical optimization routine for each variant (see Results and Figure 6).

Empirical likelihood-ratio statistics

The likelihood-ratio statistic for testing directional selection, to be computed for each variant, is given by(13)where Similarly, we can define a test statistic for testing if selection is dominant by(14)While extending the single-locus Wright–Fisher model to multiple linked loci can improve the power of the model (Terhorst et al. 2015), it is computationally and statistically expensive to compute exact likelihood. In addition, computing linked-loci joint likelihood requires haplotype resolved data, which pool-seq does not provide. Here, similar to Nielsen et al. (2005), we calculate the composite-likelihood-ratio score for a genomic region,(15)where L is a collection of segregating sites and is the likelihood-ratio score based for each variant in L. The optimal value of the hyper-parameter L depends upon a number of factors, including initial frequency of the favored allele, recombination rates, linkage of the favored allele to neighboring variants, population size, coverage, and time since the onset of selection (duration of the experiment). In File S3, we provide a heuristic to compute a reasonable value of L, based on experimental data.

We work with a normalized value of given by(16)where and are the mean and standard deviation of values in a large region We found different chromosomes have different distributions of values, and therefore decided to use single chromosomes as

Hypothesis testing

Single-locus tests:

Under neutrality, log-likelihood ratios can be approximated by distribution (Williams and Williams 2001), and P-values can be computed directly. However, Feder et al. (2014) showed that when the number of independent samples (replicates) is small, is a crude approximation to the true null distribution and results in more false positives. Following their suggestion, we first compute the empirical null distribution using simulations with the estimated population size (see Figure S1 in File S3). The empirical null distribution of statistic H is used to compute P-values as the fraction of null values that exceed the test score. Finally, we use the method of Storey and Tibshirani (2003) to control for FDR in multiple testing.

Composite likelihood tests:

Similar to single-locus tests, we compute the null distribution of the statistic using whole-genome simulations with the estimated population size, and subsequently compute the FDR. The simulations for generating the null distribution of are described next.

Simulations

We use the same simulation procedure for two purposes. First, we use the simulations to test the power of Clear against other methods in small genomic windows. Second, we use them to generate the distribution of null values for the statistic to compute empirical P-values. We mainly chose parameters that are relevant to D. melanogaster experimental evolution (Kofler and Schlötterer 2013). See also Figure 1A for illustration.

Creating initial founder line haplotypes: Using msms (Ewing and Hermisson 2010), we created neutral populations for F founding haplotypes with command $./msms 〈F〉 1 −t 〈2μWNo〉 −r 〈2rWNo〉 〈W〉, where is the number of founder lines, is the effective founder population size, is the recombination rate, and is the mutation rate. The window size W is used to compute and We chose W = 50 kbp for simulating individual windows for performance evaluations, and W = 20 Mbp for simulating D. melanogaster chromosomes for P-value computations.

Creating initial diploid population: An initial set of haplotypes was created from step 1, and duplicated to create F homozygous diploid individuals to simulate generation of inbred lines. N diploid individuals were generated by sampling with replacement from the F individuals.

Forward simulation: We used forward simulations for evolving populations under selection. We also consider selection regimes in which the favored allele is chosen from standing variation (not de novo mutations). Given initial diploid population, position of the site under selection, selection strength s, number of replicates recombination rate and sampling times simuPop (Peng and Kimmel 2005) was used to perform the forward simulation and compute allele frequencies for all of the R replicates. For hard sweep (respectively, soft sweep) simulations we randomly chose a site with an initial frequency of (respectively, to be the favored allele. For generating the null distribution with drift for P-value computations, we used this procedure with

Sequencing simulation: Given allele frequency trajectories we sampled depth of each site in each replicate identically and independently from Poisson (λ), where is the coverage for the experiment. Once depth d is drawn for the site with frequency ν, the number of reads c carrying the derived allele are sampled according to binomial For experiments with finite depth the tuple is the input data for each site.

Results

Modeling allele frequency trajectories in small populations

We first tested the goodness of fit of the discrete vs. Brownian motion (a continuous-state model) in modeling allele frequency trajectories, under general E&R parameters. For this purpose, we conducted 100 K simulations with two time samples where is the parameter controlling the density of sampling in time. In addition, we repeated simulations for different values of starting frequency (i.e., hard and soft sweep) and selection strength (i.e., neutral and selection). Then, given initial frequency we computed the expected distribution of the frequency of the next sample under two models to make a comparison. Figure 2, A–F, shows that Brownian motion (continuous model) is inadequate when is far from 0.5, or when sampling times are sparse If the favored allele arises from standing variation in a neutral population, it is unlikely to have a frequency close to 0.5, and the starting frequencies are usually much smaller (see Figure S2 in File S3). Moreover, in typical D. melanogaster experiments, for example, sampling is sparse. Often, the experiment is designed so that (Zhou et al. 2011; Orozco-ter Wengel et al. 2012; Kofler and Schlötterer 2013; Franssen et al. 2015).

Comparison of empirical distributions of allele frequencies (red) vs. predictions from Brownian motion (green), and Markov chain (blue). Comparison of empirical and theoretical distributions under neutral evolution (A–F) and selection (G–M) with different starting frequencies and sampling times of where and For each panel, the empirical distribution was computed over 100,000 simulations. Brownian motion (Gaussian approximation) provides poor approximations when initial frequency is far from 0.5 (A) or sampling is sparse (B, C, E , and F). In addition, Brownian motion can only provide approximations under neutral evolution. In contrast, Markov chain consistently provides a good approximation in all cases.

In contrast to the Brownian motion approximation, discrete Markov chain predictions (Equation 11) are highly consistent with empirical data for a wide range of simulation parameters (Figure 2, A–M). Moreover, the discrete Markov chain can be modified to model the case when the the allele is under selection.

Detection power

We compared the performance of Clear against other methods for detecting selection. For each method we calculated detection power as the percentage of true positives identified with a false-positive rate For each configuration (specified with values for selection coefficient s, starting allele frequency and coverage λ), the power of each method is evaluated over 2000 distinct simulations, half of which modeled neutral evolution and the rest modeled positive selection.

We compared the power of Clear with GP (Terhorst et al. 2015), FIT (Feder et al. 2014), and CMH (Agresti and Kateri 2011) statistics. FIT and GP convert read counts to allele frequencies prior to computing the test statistic. Clear shows the highest power in all cases and the power stays relatively high even for low coverage (Figure 3 and Table S1 in File S3). In particular, the difference in performance of Clear with other methods is pronounced when starting frequency is low. The advantage of Clear stems from the fact that the favored allele with a low starting frequency might be missed by low coverage sequencing. In this case, incorporating the signal from linked sites becomes increasingly important. We note that methods using only two time points, such as CMH, do relatively well for high selection values and high coverage. However, the use of time-series data can increase detection power in low coverage experiments or when the starting frequency is low. Moreover, time-series data provide means for estimating selection parameters s and h (see below). Finally, as Clear is robust to change of coverage, our results (Figure 3, B and C) suggest that taking many samples with lower coverage is preferable to sparse sampling with higher coverage. For comparison purposes, we also tested Clear using the single-locus statistic For the most part, Clear showed an improvement over other methods even with or showed similar performance. The performance improved with higher L.

Power calculations for detection of selection. Detection power for Clear FIT, GP, and CMH under (A–C) hard- and (D–F) soft-sweep scenarios. λ and s denote the mean coverage and selection coefficient, respectively. Orange hexagons represent the performance of Clear when the maximum of the single-locus statistic is used to make a decision for the genomic region, while the red ● corresponds to the performance of Clear when single-locus statistics are averaged over the region. The y-axis measures power—sensitivity with false-positive rate for simulations with and The horizontal line reflects the power of a random classifier. In all simulations, three replicates are evolved and sampled at generations

Site identification

In general, localizing the favored variant using pool-seq data is a nontrivial task due to extensive linkage disequilibrium (LD) (Tobler et al. 2014). To measure performance, we sorted variants by their H scores and computed rank of the favored allele for each method. For each setting of and s, we conducted 1000 simulations and computed the rank of the favored mutation in each simulation. The cumulative distribution of the rank of the favored allele in 1000 simulations for each setting (Figure 4) shows that Clear outperforms other statistics.

Ranking performance for 100× coverage. Cumulative distribution function (CDF) of the distribution of the rank of the favored allele in 1000 simulations for Clear (H), GP, CMH, and FIT, for different values of selection coefficient s and initial carrier frequency. Note that the individual variant Clear score (H) is used to rank variants. The area under the curve is computed as an overall quantitative measure to compare the performance of methods for each configuration. In all simulations, three replicates with are evolved and sampled at generations

An interesting observation is revisiting the contrast between site identification and detection (Long et al. 2013; Tobler et al. 2014). When selection strength is high, detection is easier (Figure 3, A–F), but site identification is harder, due to the high LD between flanking variants and the favored allele (Figure 4, A–F). Moreover, site identification becomes more difficult whenever the initial frequency of the favored allele is low; i.e., at the onset of selection, LD between a favored allele and its nearby variants is high. For example, when coverage and selection coefficient the detection power is 75% for hard sweep, but 100% for soft sweep (Figure 3, B–E). In contrast, the favored site was ranked as the top in 14% of hard sweep cases, compared to 95% of soft sweep simulations.

Estimating parameters

Clear estimates effective population size and selection parameters, and , as a byproduct of the hypothesis testing. We computed bias of selection fitness and dominance for Clear and GP for 1000 simulations in each setting. The distribution of the error (bias) for 100× coverage is presented in Figure 5 for different configurations. Figures S4 and S5 in File S3 provide the distribution of estimation errors for 30× and 300× coverage, respectively. For hard sweep, Clear provides estimates of s with lower variance of bias (Figure 5A and Figure S6 in File S3). In soft sweep, GP and Clear both provide unbiased estimates of s with low variance (Figure 5B). Figure 5, C and D, shows that Clear provides unbiased estimates of h as well when and We also tested if Clear provides unbiased estimates of N by estimating population size on 1000 simulations when As shown in Figure 7A and Figure S9, A–C, in File S3 maximum likelihood is attained at the true value of the parameter.

Distribution of bias for 100× coverage. The distribution of bias in estimating selection coefficient over 1000 simulations using GP and Clear (H) is shown for a range of choices for the selection coefficient s and starting carrier frequency when coverage (A and B). GP and Clear have similar variance in estimates of s for soft sweep, while Clear provides lower variance in hard sweep. Also see Table S2 in File S3 (C and D). The variance in the estimation of h is shown. In all simulations, three replicates are evolved and sampled at generations

Running time

As Clear does not compute exact likelihood of a region (i.e., does not explicitly model linkage between sites), the complexity of scanning a genome is linear in the number of polymorphisms. Calculating the score of each variant requires an computation for However, most of the operations are can be vectorized for all replicates to make the effective running time for each variant faster. We conducted 1000 simulations and measured running times for computing site statistics H, FIT, CMH, and GP with different numbers of linked loci. Our analysis reveals (Figure 6) that Clear is orders of magnitude faster than GP, and comparable to FIT. While slower than CMH on the time per variant, the actual running times are comparable after vectorization and broadcasting over variants (see below).

Running time. Box plots of running time per variant (CPU seconds) of Clear CMH, FIT, and GP with 1, 3, 5, 7, and 10 loci over 1000 simulations conducted on a workstation with Intel Core i7 processor. The average running time for each method is shown on the x-axis. In all simulations, three replicates are evolved and sampled at generations

These times can have a practical consequence. For instance, to run GP in the single-locus mode on the entire pool-seq data of the D. melanogaster genome from a small sample (≈1.6-M variant sites), it would take 1444 CPU hours (≈1 CPU month). In contrast, after vectorizing and broadcasting operations for all variant operations using the numba package, Clear took 75 min to perform a scan, including precomputation; while the fastest method, CMH, took 17 min.

Analysis of Real Data

Analysis of a D. melanogaster adaptation to alternating temperatures:

We applied Clear to the data from a study of the adaptation of D. melanogaster to alternating temperatures (Orozco-ter Wengel et al. 2012; Franssen et al. 2015), where three replicate samples were chosen from a population of D. melanogaster for 59 generations under alternating 12-hr cycles of hot-stressful (28°) and nonstressful (18°) temperatures, and sequenced. In this data set, sequencing coverage is different across replicates and generations (see figure S2 of Terhorst et al. 2015) which makes variant depths highly heterogeneous (Figure S10 in File S3).

We first filtered out heterochromatic, centromeric, and telomeric regions (Fiston-Lavier et al. 2010), and those variants that have a collective coverage of >1500 in all 13 populations: three replicates at the base population, two replicates at generation 15, one replicate at generation 23, one replicate at generation 27, three replicates at generation 37, and three replicates at generation 59. After filtering, we ended up with 1,605,714 variants.

Next, we estimated genome-wide population size (Figure 7B and Figure S9E in File S3), which is consistent with previous studies (Orozco-ter Wengel et al. 2012; Jónás et al. 2016). The likelihood curves of Clear are sharper around the optimum compared to that of the method of Bollback et al. (2008) (see figure S1 in Orozco-ter Wengel et al. 2012). Also, chromosomes 3L and 3R appear to have a smaller population size, respectively. Others have made similar observations on this data. In particular, Jónás et al. (2016) showed that the chromosome-wise population size varies even more when it is computed for each replicate separately (see table 1 in Jónás et al. 2016). For instance, is 131 for chromosome 3R replicate 1, while it is 328 for chromosome X replicate 2.

Estimating population size. (A) Distribution of bias in estimating N, computed on 1000 neutral simulations for each when and (B) Estimates of population size for data from a study of adaptation of D. melanogaster to alternating temperatures. For each case, the distribution of estimator is computed by 100 bootstrap computations using 1000 variants each. The multiple modes are an artifact of grid search used to speed up computation. (C) Distribution of the population size estimates on the yeast data set. Despite large census population size Burke et al. 2014), this data set exhibits a much smaller effective population size

While it would be ideal to compute the Clear statistic for each replicate and chromosome separately, computing empirical P-values and significant regions become computationally intensive as the empirical null distribution of each replicate and each chromosome needs to be computed. Hence, we use a single genome-wide estimate in all analyses, but we normalize statistic separately for each chromosome.

We used a heuristic calculation (see File S3) to choose the sliding window size L as the distance where the LD between the favored mutation and a site away remains strong. For D. melanogaster parameters, we obtained We computed the normalized test statistic on sliding windows of size of 30 kbp and step size of 5 kbp over the genome (see Figure 8A).

Scan of Clear statistic on data from a study of adaptation of D. melanogaster to alternating temperatures. (A) Manhattan plot of scan for statistic using sliding window of size over the genome. The dashed line represents cutoff for genome-wide and identifies five contiguous intervals, I1–I5, which are shaded in blue. (B) Trajectories of the selected variants within intervals I1–I5.

The empirical null distribution of was estimated by creating 100 whole-genome simulations (400 K statistic values), as described in Material and Methods. Then, the P-value of the test statistic in each region in the experimental data was calculated as the fraction of the null statistic values that are greater than or equal to the test statistic(see Figure S11 in File S3). After correcting for multiple testing, we identified five contiguous intervals (Figure 8) satisfying and covering polymorphic sites. We further performed single-locus hypothesis testing on the sites to identify 174 individual variants with an (Figure 8B).

The final set of 174 variants fall within 32 genes (Table S3 in File S3), including many serine inhibitory proteases (serpins) and other genes involved in endocytosis. Recycling of synaptic vesicles is seen to be blocked at high temperature in temperature-sensitive Drosophila mutants (Kosaka and Ikeda 1983). This is also supported by gene ontology (GO)-enrichment analysis, where a single GO term “inhibition of proteolysis” is found to be enriched (corrected P-value = 0.0041). To test for dominant selection, we computed the D statistic on simulated neutral and experimental data, and computed P-values accordingly. After correcting for multiple testing, 96 variants were discovered with an (Figure S12 in File S3).

Analysis of outcrossing yeast populations

We also applied Clear to 12 replicate samples of outcrossing yeast populations (Burke et al. 2014), where samples were taken at generations We observed a significant variation in the genome-wide, site-frequency spectrum of certain populations over different time points for some replicates (Figure S13 in File S3). The variation does not have an easily identifiable cause. Therefore, we focused analysis on seven replicates with a genome-wide, site-frequency spectrum over the time range (Figure S14 in File S3).

We estimated population size to be haplotypes (Figure 7C and Figure S9F in File S3), and computed the and H statistics accordingly. To compute P-values, we created 1-M single-locus neutral simulations according to the experimental data’s initial frequency and coverage. By setting the FDR cutoff to 0.05, only 18 and 16 variants show significant signal for directional and dominant selection, respectively (Figure 9 S12 in File S3). Selected variants for directional selection are clustered in two regions, which match two of the five regions (regions C and E in figure 2A in Burke et al. 2014) identified by Burke et al. (2014) in their preliminary analysis. UCSC browser tracks for analysis of D. melanogaster and Yeast datasets are available as File S1 and File S2, respectively.

Single-locus analysis of the yeast outcrossing populations. Manhattan plot of scan single-locus Clear statistic for testing (A) directional selection and (C) dominant selection. The dashed line represents cutoff for genome-wide Trajectories of the selected variants are depicted in panels (B) and (D).

Discussion

We developed a computational tool, Clear, that can detect regions and variants under selection E&R experiments. Using extensive simulations, we show that Clear outperforms existing methods in detecting selection, locating the favored allele, and estimating model parameters. Also, while being computationally efficient, Clear provides a means for estimating populations size and hypothesis testing.

Many factors; such as small population size, finite coverage, LD, finite sampling for sequencing, duration of the experiment, and the small number of replicates; can limit the power of tools for analyzing E&R. Here, by discrete modeling, Clear estimates population size and provides unbiased estimates of s and h. It adjusts for heterogeneous coverage of pool-seq data, and exploits the presence of linkage within a region to compute a composite likelihood ratio statistic.

It should be noted that, even though we described Clear for small fixed-size populations, the statistic can be adjusted for other scenarios, including changing population sizes when the demography is known. For large populations, transitions can be computed on sparse data structures, as for large N the transition matrices become increasingly sparse. Alternatively, frequencies can be binned to reduce dimensionality.

The comparison of hard- and soft-sweep scenarios showed that the initial frequency of the favored allele can have a nontrivial effect on the statistical power for identifying selection. Interestingly, while it is easier to detect a region undergoing strong selection, it is harder to locate the favored allele in that region.

There are many directions to improve the analyses presented here. In particular, we plan to focus our attention on other organisms with more complex life cycles, experiments with variable population size, and longer sampling-time spans. As E&R experiments continue to grow, deeper insights into adaptation will go hand in hand with improved computational analysis.

Acknowledgments

The authors thank the anonymous reviewers whose comments substantially improved the quality and clarity of this manuscript. A.I., A.A., and V.B. were supported by grants from the National Institutes of Health (1R01 GM-114362) and National Science Foundation (DBI-1458557 and IIS-1318386). C.S. is supported by the European Research Council grant ArchAdapt. V.B. is a cofounder, has an equity interest, and receives income from Digital Proteomics, LLC (DP). The terms of this arrangement have been reviewed and approved by the University of California, San Diego in accordance with its conflict of interest policies. DP was not involved in the research presented here.

The Genetics Society of America (GSA), founded in 1931, is the professional membership organization for scientific researchers and educators in the field of genetics. Our members work to advance knowledge in the basic mechanisms of inheritance, from the molecular to the population level.