Abstract

While genetic diversity can be quantified accurately from high coverage sequencing data, it is often desirable to obtain such estimates from data with low coverage, either to save costs or because of low DNA quality, as is observed for ancient samples. Here, we introduce a method to accurately infer heterozygosity probabilistically from sequences with average coverage of a single individual. The method relaxes the infinite sites assumption of previous methods, does not require a reference sequence, except for the initial alignment of the sequencing data, and takes into account both variable sequencing errors and potential postmortem damage. It is thus also applicable to nonmodel organisms and ancient genomes. Since error rates as reported by sequencing machines are generally distorted and require recalibration, we also introduce a method to accurately infer recalibration parameters in the presence of postmortem damage. This method does not require knowledge about the underlying genome sequence, but instead works with haploid data (e.g., from the X-chromosome from mammalian males) and integrates over the unknown genotypes. Using extensive simulations we show that a few megabasepairs of haploid data are sufficient for accurate recalibration, even at average coverages as low as At similar coverages, our method also produces very accurate estimates of heterozygosity down to within windows of about 1 Mbp. We further illustrate the usefulness of our approach by inferring genome-wide patterns of diversity for several ancient human samples, and we found that 3000–5000-year-old samples showed diversity patterns comparable to those of modern humans. In contrast, two European hunter-gatherer samples exhibited not only considerably lower levels of diversity than modern samples, but also highly distinct distributions of diversity along their genomes. Interestingly, these distributions were also very different between the two samples, supporting earlier conclusions of a highly diverse and structured population in Europe prior to the arrival of farming.

THE genetic diversity at a particular location in the genome is the result of its evolutionary past. Comparing the genetic diversity between individuals or regions of the genome thus gives insight into differences in their respective evolutionary histories. For a diploid individual, the heterozygosity of a genomic region (the fraction of sites in a region at which the individual carries two different alleles) is the result of mutations that occurred since the two alleles shared a common ancestor. It is thus a function of the local mutation rate and the time to the most recent common ancestor, which, in turn, is reflective of the demographic and selective past at that locus. Variation in local mutation rates and, due to recombination, also in the strength of selection and genetic drift leads to variable diversity across the genome. Comparing heterozygosity between regions can thus identify locations that were affected differently by selection, or those with an increased mutation rate. Comparing heterozygosity between individuals may further highlight differences in the demographic histories of populations or different levels of inbreeding, which may lead to long runs of homozygosity.

While heterozygosity is readily obtained from high quality genotype calls by counting, it is much harder to infer accurately from low coverage genomes (i.e., genomes sequenced at low depth). This is primarily due to a substantial probability of observing only one of the two alleles and to sequencing errors, which occur at rates orders of magnitude higher than the expected heterozygosity in many species, including humans. Additional biases may be introduced by relying on a reference genome or by postmortem DNA damage (PMD) when working with ancient DNA.

A natural way of circumventing these issues is to infer genetic diversity probabilistically by taking many of the mentioned issues into account, and several such methods have been developed over the past decade. Johnson and Slatkin (2006), for instance, developed a method for estimating the population scaled mutation rate where N is the population size, and μ the mutation rate, from large metagenomic data sets in the presence of sequencing errors. Shortly after, multiple moment based estimators were introduced to infer heterozygosity from a single individual (Hellmann et al. 2008; Jiang et al. 2008). Lynch (2008) then introduced a likelihood-based estimator that relaxed the assumption of a known error rate by estimating it jointly with heterozygosity from the data itself. Despite the additional parameter, this likelihood-based estimator is generally more accurate, even if this implementation is ill-behaved at very low coverages (Lynch 2008).

Several methods to infer genetic diversity that relax the assumption of a constant error rate have since been proposed. These methods commonly make use of error rate estimates provided by the sequencing machines (the quality scores) to calculate the genotype likelihoods then used in the inference. A particularly frequently used approach infers the site frequency spectrum (SFS) while integrating over genotype uncertainty at individual sites (Nielsen et al. 2012). If applied to data from a single individual, the SFS is a direct estimation of heterozygosity. However, the current implementation of this approach in the software package ANGSD (Korneliussen 2014) requires a priori knowledge of the two potentially observable alleles at each site. While these may be inferred accurately in the case of multiple samples with decent coverage, such an inference from low coverage data are likely biased.

Here, we present a direct extension of these previous approaches to infer heterozygosity within a region that relaxes the assumption of infinitely many sites, does not require any a priori knowledge of the underlying alleles, and takes additional biases introduced by PMD fully into account. We achieve this by explicitly modeling PMD using an extension of the likelihood framework proposed by Maruki and Lynch (2015) to infer site allele frequencies by integrating over all possible genotypes, and by modeling genotype frequencies using the classic substitution model of Felsenstein (1981), which allows for back mutations.

A major hurdle for all approaches relying on quality scores provided by sequencing machines is that these scores are not reliable and must be recalibrated, particularly when coverage is low. This is commonly achieved by learning error rates from sites assumed a priori to be invariant, for instance, by masking polymorphic sites, repetitive elements, and large structural variants (DePristo et al. 2011). While we have extended this approach to tolerate PMD (Hofmanová et al. 2016), it requires detailed knowledge of the study species, which is often lacking for nonmodel organisms. We propose here to circumvent this problem by using a reference-free recalibration approach that relies on haploid sequences such as those from the X-chromosome in male mammals. Our approach does not require any a priori information on the underlying sequence, as it integrates over all possible but hidden genotypes while taking PMD and covariates such as position in read or read context into account. This renders our approach essentially free of reference biases, since the reference is only required for aligning raw reads by mapping and current mapping techniques tolerate a sequence divergence of up to 10% (e.g., Lunter and Goodson 2011).

Using computer simulations, we show that our method reliably estimates local genetic diversity in single, diploid individuals even with average coverage below for windows of ∼1 Mbp. We further show that a few megabasepairs of data at equally low coverage are sufficient to properly recalibrate distorted quality scores. Finally, we use the methods here developed to infer the genome-wide pattern of diversity for several ancient and modern human samples. We found that these patterns differ between European and African samples, but that samples from a few thousand years ago cluster well with modern samples. In contrast, European hunter-gatherer individuals differ strongly from modern Europeans, but also from each other, illustrating the high diversity that existed in Europe before the Neolithic transition.

Theory

Inferring heterozygosity

Here we develop a method to estimate local heterozygosity in a genomic window from a collection of aligned reads by integrating out the uncertainty of the local genotype, and by taking the potential effects of PMD into account. Specifically, we are interested in inferring the stationary base frequencies , together with the rate of substitutions , along the genealogy connecting the two alleles of an individual within a genomic region. Here, T corresponds to the time to the most recent common ancestor of the two lineages, and μ to the mutation rate per base pair per generation. Notably, it is not possible to infer T and μ independently, and we therefore only attempt to estimate the compound substitution rate θ from the data.

We extend Felsenstein’s model of substitutions (Felsenstein 1981) to infer θ while accounting for the uncertainty in the genotypes in the region of interest. Let us denote the hidden genotype at site i by , where consists of a pair of nucleotides with Under the substitution model, the probability of observing a specific genotype given the base frequencies , and the substitution rate θ, is given by(1)To integrate out the uncertainty in observed genotypes, we adopt a model similar to Lynch (2008), and those commonly used for genotype calling (e.g., Li 2011). The subsequent notation will closely follow the notation we recently introduced (Hofmanová et al. 2016).

The observed data at site i shall correspond to what is typically obtained when individual reads of next generation sequencing approaches are mapped to a reference genome. Here, we will assume that all sequencing reads were accurately mapped and that reads with low mapping qualities have been filtered out. The data obtained at site i thus consists of a list of observed bases

We chose to model the observed data, , at site i as a function of the underlying genotype, , as well as the base-specific rates of sequencing errors for We assume here that these rates are known (e.g., through quality scores provided by the sequencing machine). Assuming further that sequencing errors occur independently, the likelihood of the full data at site, i, is given bywhere

Following Maruki and Lynch (2015), and commonly used approaches (e.g., Li 2011), we will assume that a sequencing read is equally likely to cover any of the two alleles of an individual, and that sequencing errors may result in any of the alternative bases with equal probability, The probability of observing a base, , given the underlying genotype, , is then given byAssuming sites to be independent, the full likelihood of our model is given bywhere the sum runs over all combinations

PMD damage:

We will next extend this model with the possibility of PMD. The most common form of PMD is C deamination, which leads to a transition on the affected strand and a transition on the complimentary strand (e.g., Briggs and Stenzel 2007). These deaminations do not occur randomly along the whole read, but are observed much more frequently at the beginning of a read. This is due to fragment ends being more often single-stranded, and thus subject to a much higher rate of deamination. Consequently, the rates of PMD, while specific to the sample and the sequencing protocol used, generally decay roughly exponentially with distance from the ends of the read Skoglund et al. (2014). Since ancient DNA is highly fragmented, one read can often cover an entire DNA molecule, and hence and transitions may be seen in a single read, but they are accumulated at opposite ends.

Here, we will develop our model for this form of PMD following the formulation of Skoglund et al. (2014) and Hofmanová et al. (2016), but we note that it is readily extendable to incorporate other forms of PMD as well. We feel that the rationale of the approach taken is best explained with a specific example. Consider , given the underlying genotype There are three possible ways to obtain a T: (i) by sequencing an allele T without error, (ii) by sequencing an allele C affected by PMD without error, (iii) or by sequencing an allele C not affected by PMD with error. We thus have(2)where denotes the probability that a PMD occurred at the base of read j covering site i.

The emission probabilities for all combinations of and derived following the same logic are found in the Appendix.

Inference using expectation-maximization:

In this section, we will detail how to find the maximum likelihood estimate (MLE) of the model parameters θ and in a genomic window of I sites using an expectation-maximization (EM) algorithm (Dempster et al. 1977). For this, we will make the assumption that base-specific sequencing error rates, , and rates of PMD, , are given constants. For the cases in which they are not known a priori, we show below how they can be learned accurately from genome-wide data prior to inferring θ and In an effort to unburden the notation, we will thus refer to the emission probabilities simply as in the following.

The relevant property to develop an EM algorithm is the complete data likelihood, which, in the case of our model, is given byand thus the complete data log-likelihood by

E-step:

The expected complete data log-likelihood is calculated aswhere the sum runs over all combinations Only the second part of this sum depends on the parameters We havewhere we use the shorthand notation We have by Bayes’ Theorem(3)Let us write out explicitly:

M-step:

We have to maximize subject to the constraintFor this reason, we form the Lagrangianwhere μ is the Lagrange multiplier. We get the following partial derivatives of the Lagrangian:We have to set these equations to zero and solve for and μ. Since this is not possible analytically, we will revert to the Newton-Raphson algorithm.

To streamline the notation, we will rename our variables , and With these variable the above system can be simplified to a system(4)wherefor andTo apply the Newton-Raphson algorithm, we determine the Jacobian matrix The nonzeros entries of the Jacobian with are:(5)We can now approximate the zero of Equation 4 with the iteration(6)After a few iterations, we get the new estimate for the original parameters by setting for and

Confidence intervals:

We calculate an approximate confidence interval for θ using the Fisher information. To simplify the calculations, we consider the as constant. The observed Fisher information at the ML value isand the corresponding derivatives are(7)Observe that From this we easily get that(8)where we have set(9)An approximate confidence interval is now given by

Estimating rates of PMD

As mentioned earlier, the method above assumes that rates of PMD are known a priori. In cases in which they are not known, they can be readily inferred from genome-wide data as we outline in this section.

Following Jónsson et al. (2013), we present an approach to estimate PMD rates directly from genome-wide counts of and transitions as a function of distance within the read. For this, we first build the three-dimensional table , where each entry corresponds to the number of observed bases, s, read at a site with reference base, r, at position p within a read. While these counts depend on the divergence between the sequenced individual and the reference genome used for mapping, we develop here an approach that takes this divergence into account.

Let us denote by the probability of a true difference between the sequenced individual and the reference, such that the reference has base r and the sequenced individual base s. Since the reference and a sequenced chromosome form a genealogy on which these mutations occurred, it is safe to assume that We will further assume that the observed counts in a cell not affected by PMD are a direct function of

Since the rate of PMD is generally low far away from the read ends, position-specific estimates may become noisy for these positions, particularly if data are limited. We thus introduce a method to estimate parameters of a model of exponential decay with the position in the read. The use of such a model was first introduced by Skoglund et al. (2014), and we implement here a slightly more general version of their function. Specifically, we assume that the probability of observing base when the reference sequence is a C at position p is given bywhere again denotes true differences between the individual and the reference.

Note that some of the parameters of this model are nonidentifiable. In the Appendix, we show how to obtain ML estimates of the parameters of the probability function(10)using the standard Newton-Raphson algorithm.

To then obtain estimates for the original parameters , and c, we assume that and obtain a ML estimateThen, , and We use the analogous logic to infer PMD patterns for damages, but measuring positions from the opposite end of the read.

Estimating base-specific error rates (recalibration)

The challenge of inferring genetic diversity from next-generation sequencing data lies in the fact that the per base error rates are orders of magnitude higher than the expected heterozygosity of many species (Lynch 2008). While this issue can easily be overcome with high coverages, accurate inference from low-coverage data relies on an exact knowledge of base-specific error rates. Crude estimates of these rates are usually directly provided by the sequencing machines themselves. However, these estimates are often inaccurate, and are recommend to be recalibrated for genotype calling (DePristo et al. 2011).

The most commonly used approach for recalibration is BQSR (Base Quality Score Recalibration) implemented in GATK (McKenna et al. 2010; De-Pristo et al. 2011). This approach infers new quality scores by binning the data into groups based on covariates such as the raw quality score, the position in the read, or the sequence context. All bases within such a bin are assumed to share the same error rate, which can be readily inferred if the true underlying sequence is known. As an alternative, Cabanski et al. (2012) proposed to fit a logistic regression to the full data where the response variable is the probability of a sequencing error and the explanatory variables are the raw quality scores, and covariates such as position in the read, or base context.

For our purpose, these methods suffer from two shortcomings: first, they cannot be applied to ancient DNA since they do not take PMD into account. Second, both require a reference sequence as well as knowledge of polymorphic positions, such that they can be excluded from the analysis. While we have shown how to extend the BQSR method to ancient DNA (Hofmanová et al. 2016), we develop here an approach that also integrates over the unknown reference sequence.

To do so, we will assume the existence a genomic region for which the individual does not show any polymorphism. A good example of such a genomic region are nonhomologous sequences from sex chromosomes in heterogametic individuals, and we will describe our approach with this type of data in mind. However, we note that our approach can also be readily applied to diploid regions that are known to be monomorphic, such as positions that are highly conserved among species or positions retained after filtering out those with high minor allele counts (Cabanski et al. 2012).

Model

As above, let us denote the hidden (haploid) genotype at site i by where is one of the nucleotides At each site i, there are reads, and we denote by , the base of read j covering site i. A sequencing error occurs with probability These probabilities shall now be given by a model(11)where is a given external vector of information, and are the parameters of the model that have to be estimated. While our approach is flexible regarding the choice of included covariates, we will here consider the raw quality score, the position within the read, the squares of these to account for a nonlinear relationships, and all two-base contexts consisting of the bases of the read at positions and i.

Following Cabanski et al. (2012), we impose the logit model(12)withIn the case of monomorphic or haploid sites only, the probability of the read vector given the hidden state can be written more generally as

(13)

Here, the dependence on the parameters is given by (12),and and refer to the known probability that a or PMD occurred at the position covering site i in read j.

Again, this model can be estimated using the EM algorithm with the Newton-Raphson algorithm in the M-step. Details are given in the Appendix.

Data availability

The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article.

Simulations

Generating simulations

In this section, we illustrate the power and accuracy of our inference approaches with simulations. These were generated using a custom-made R script that implements the following steps:

The first chromosome of length L was simulated using random bases with frequencies

The second, homologous chromosome was simulated according to the Felsenstein (1981) substitution model (Equation 1) with and a chosen θ value.

Sequencing reads of 100 bases were then generated by copying from one of the two chromosomes with equal probability, and by choosing a starting position uniformly between positions 1 and until the desired average coverage was reached. All reads copied from the second chromosome were considered to map to the reverse strand.

PMD was simulated on all reads with probabilities following an exponential decay with increasing position in the read as proposed by Skoglund et al. (2014) to match realistic patterns. Specifically, we simulate PMD at position p within the read with probability

where and for both and but with p counted from the 3′ and 5′ ends, respectively.

For each simulated base, a phred-scaled quality score was simulated and sequencing errors were then added with probabilities given by these scores. If not stated otherwise, quality scores were simulated from a normal distribution with mean and SD truncated at zero. When testing our recalibration approach, however, the quality scores were simulated from a uniform distribution , and then transformed according to Equation 12 with coefficients to obtain the true error rate, with which sequencing errors were simulated.

The simulated data were finally used to generate a reference FASTA file containing the first chromosome and a SAM file containing the reads. The latter was then transformed into a BAM file using SAMtools (Li et al. 2009).

Power to infer θ

To check the power of our approach to infer θ from low coverage data, we first simulated data within a 1 Mbp window with a true for various coverages. The specific value of was chosen to reflect the median heterozygosity in a modern, non-African human individual.

We found the median of our θ estimates across replicates to be very close to the true value, but the variance to be a function of coverage. At low coverage (), θ was often inferred to be zero. This is not surprising, as the information about genetic diversity can come only from sites covered at least twice, which is rare at average coverages The simulations that did result in an estimate above zero were thus enriched for cases with slightly above average number of polymorphic sites among those covered twice. As soon as average coverage exceeded however, our approach estimated θ at very accurately (Figure 1A).

Power to infer θ from low coverage data. Results from sets of 100 simulations with PMD for different average coverage, window size, and true θ values. (A) Estimated in windows of 1 Mbp as a function of average coverage. (B) Estimated as a function of window size and fixed average coverage of (C) Accuracy of estimating quantified as the median relative error () over replicates indicated by contour lines as a function of both coverage and window size. (D) True vs. estimated θ for different average coverages (see color legend). Polygons indicate the quantile of estimated values among all replicates. The diagonal black line indicates the expectation for perfect estimation. In (A, B, and D), replicates resulting in a are not shown, but their fraction across replicates are printed below the horizontal black line.

We next performed simulations with a fixed coverage of but varying the window size (Figure 1). Interestingly, we found that an increase in window size has a positive effect on the estimate accuracy, similarly to an increase in coverage, suggesting that larger windows help to increase accuracy if coverage is very low. To illustrate this effect, we performed simulations at various window sizes and coverages, and recorded the relative estimation error for a series of replicates. As expected, we found the median relative estimation error to be a direct function of the product of window size and coverage (Figure 1C), thus suggesting our method will perform well also at average coverages below if the window size is large enough.

Using the same setting, we also checked the accuracy of the approximate confidence intervals obtained using the Fisher information. For this we inferred θ from 1000 windows of 1 Mbp simulated with θ at for a coverage of 1.0 and 0.2. We found the true value to be included in the 95% confidence interval in 93.6 and 98.6%, respectively, suggesting these confidence intervals to be a very accurate reflection of estimation uncertainty.

Using a third set of simulations, we found that, at equal coverage and window size, higher θ values are estimated more accurately than lower values (Figure 1D). This is expected since, in the case of low θ, only few heterozygous sites are present in a given window, rendering the estimate more dependent on the detection of individual sites. Nonetheless, we found our approach to infer very accurately in a window of 1 Mbp if the average coverage exceeds .

All results above were generated assuming base-specific quality scores to be normally distributed with and SD which is lower than the quality expected with current sequencing approaches (e.g., Utturkar et al. 2015). Sequences generated with higher quality will positively affect estimation accuracy. Indeed we found that simulating data with or resulted in much lower estimation error, effectively rendering the estimation of θ feasible even at very low average coverage (Figure 2). For instance, we found that, at an average coverage of more than 90% of windows with and were estimated within less than half an order of magnitude from the true value. At this accuracy was only reached with an average coverage of

Effect of sequencing quality on power to estimate θ. Results from sets of 100 simulations to assess the power to estimate θ of and for (A) and (B), respectively, for different average base qualities distributed normally with mean 20, 40, or 60, and a SD of 4.5, but truncated at 0. Polygon shapes indicate the confidence interval for estimated over all replicates, excluding those resulting in (the fraction excluded are printed below the horizontal black line). All simulations were conducted with PMD, and the true PMD probability functions were used during the estimation.

Comparison to an existing method

While there is currently no implemented method available to infer heterozygosity in a window or region from a single individual, our method is very similar to those inferring the SFS from multiple individuals (e.g., Nielsen et al. 2012; Korneliussen 2014). Indeed, inferring the SFS from a single individual gives a direct estimate of heterozygosity, H, which is related to θ by(14)The most commonly used approach to infer the SFS while integrating over genotype uncertainty at individual sites (Nielsen et al. 2012) is implemented in the software package ANGSD (Korneliussen 2014). This implementation assumes that the two potential alleles are known a priori for each site, and hence have to be inferred first using the major-minor option implemented in ANGSD. While the method implemented in ANGSD can be extended to address this limitation, we used our simulation framework to assess its effect on the inference of heterozygosity from low coverage data. In order to make estimates comparable, we report those of ANGSD in terms of θ calculated according to Equation 14 from the fraction of sites reported to be heterozygous in the SFS.

As shown in Figure 3A, ANGSD inferred θ very accurately if the reference allele was provided for each site, and only the second allele needed to be inferred. Making explicit use of this additional information, the estimates were more accurate than those obtained with our approach, which does not make any assumption about the alleles present. However, when no information was given about the observable alleles, and both had to be inferred using the major-minor option of ANGSD prior to the inference of heterozygosity, θ was grossly underestimated whenever coverage was below 10×. While we note that ANGSD is designed for applications involving multiple samples, in which case inferring the two observable alleles accurately is much easier; this result illustrates the importance of taking the full genotype uncertainty into account when inferring diversity from low coverage data.

Performance comparison with ANGSD. Results from sets of 100 simulations with (A) or without (B) PMD, comparing the performance of the method presented in this study and ANGSD in inferring θ. ANGSD was run either with the reference sequence provided (ANGSD_wREF) or without (ANGSD_noREF), in which case the major and minor alleles were first inferred in an additional step. For all simulations, we further assumed that base qualities were distributed normally, with and but truncated at 0. Polygon shapes indicate the confidence interval for estimated but excluding replicates resulting in The fraction of replicates excluded are printed below the horizontal black line. When applying our method in simulations conducted with PMD, we provided the true PMD probability patterns during the estimation.

Again using simulations, we next studied the impact of PMD on the inference of heterozygosity (Figure 3B). Unsurprisingly, the diversity estimated using the method implemented in ANGSD that does not take PMD into account was more than one order of magnitude too large. In contrast, our method properly accounts for PMD if its pattern is well characterized. While ANGSD can readily be extended to account for PMD, these results highlight the importance of accounting for PMD for any population genetic inference or comparison involving ancient DNA.

Accuracy of recalibration

The results discussed so far were all obtained under the assumption that quality scores provided by the sequencing machine are accurate. Unfortunately, this is rarely the case, making recalibration of the quality scores necessary for most applications, and, in particular, when trying to infer genetic diversity from low coverage data. Here, we developed an approach to recalibrate quality scores without prior knowledge of the underlying sequencing information. Instead, we simply assume that a part of the sequence is known to be monomorphic, such as the haploid X-chromosome in mammalian males.

To investigate the power of our approach to infer recalibration parameters, we simulated sequencing reads from a haploid region where the quality scores provided in the SAM files were distorted. We did this by first simulating fake quality scores from a uniform distribution , and then transforming them into true quality scores according to Equation 12. We used the following coefficients: all context coefficients = 1.0, the coefficients for the raw quality score, the square of the raw quality score, the position within the read, , and the square of the position within the read, These values were chosen to reflect a distortion observed in real data from the ancient human samples analyzed in this study (see below). They also result in both a relatively strong distortion, as well as meaningful error rates for the evaluation of our approach.

We found all coefficients to be inferred with high accuracy from a 1 Mbp window with an average coverage above (Figure 4). If the amount of data were much lower than that, estimates were generally less accurate. In particular, we found the coefficients for the quality ( and ) to be often slightly overestimated at low coverages, likely because many sequencing errors go undetected since they can only be inferred at sites covered at least twice. However, this bias can be alleviated with larger window sizes if coverage is very low (see below).

Accuracy in inferring recalibration parameters. Results from sets of 100 simulations are shown where sequence data from a haploid 1 Mbp region was simulated assuming a uniform distribution of observed quality scores () that were then transformed to true qualities according to Equation 12, with , and all context coefficients at 1.0. All simulations were conducted with PMD, and the true PMD probability patterns were used during the estimation. The coverage values refer to the diploid regions of the simulated genome.

Accuracy of full pipeline

We finally used simulations to assess the accuracy of the full pipeline, that is, when inferring first the pattern of PMD, then the recalibration coefficients given the inferred PMD pattern, and lastly using the recalibrated quality scores along with the inferred PMD pattern to estimate θ. In these simulations, the distortion of quality scores was, in addition to the four effects included above ( and ), also affected by sequence context, in that simulated sequencing errors were 1.5 times more likely to result in a C or G than in an A or T.

Regardless of the true θ value we used, we detected a strong bias toward high values in our estimates whenever very little data were used (Figure 5, 0.1 Mbp). This is a direct result of the overestimation of the quality scores during the recalibration step as reported above, which leads to an overestimation of diversity. Encouragingly, however, this bias is overcome with only slightly more data. Indeed, we found 1 Mbp of data with an average coverage of well below to be sufficient to accurately infer , and of for Notably, even lower average coverages were sufficient when data were available for 10 Mb. Finally, we found an average coverage of to be sufficient when conducting recalibration and inference in windows as small as 0.1 Mb. These results thus suggest that our approach may be useful not only for hemizygous individuals with large chunks of haploid DNA (the sex chromosomes), but may also work well in other individuals when using mtDNA, for which high coverage can be obtained, or ultraconserved elements for recalibration.

Accuracy in estimating θ using the full pipeline. Results from sets of 50 simulations, each consisting of data from a haploid as well as a diploid region used to conduct recalibration and inference of θ, respectively. The data sets in (A, B, and C) were simulated with different true values of θ, which are indicated with the dashed lines, and were and respectively. Each data set was simulated with PMD as well as distorted base quality scores according to Equation 12, with and In addition, these simulations also included context effects in that sequencing errors were simulated to result 1.5 times more often in a C or G than in an A or T. The average coverage indicated is for the diploid data, while the haploid data were simulated with half the coverage. Line segments and polygons correspond to the median and the quantile of all estimated within the set of simulations, respectively.

Application

We illustrate the benefit of our approach by inferring θ for several ancient human male samples, and comparing these estimates to those obtained for several male individuals from the 1000 Genomes Project. For the ancient genomes, we first inferred PMD patterns from chromosome 1 using the exponential model introduced here (Supplemental Material, Figure S1), then used the first 20 Mbp of the X chromosome to perform recalibration individually for each read group, taking the inferred PMD pattern into account (Figure S2). Finally, we used both the inferred PMD patterns, as well as the recalibrated quality scores, to infer θ in windows of 1 Mbp in the whole genome, excluding windows closer than 5 Mbp to telomeres or centromeres as defined by the track Gap in group Mapping and Sequencing in the UCSC Table Browser (Karolchik et al. 2008).

The samples that we analyzed this way were (1) two European hunter-gatherer individuals (Jones et al. 2015), namely the Mesolithic genome “Kotias” from Kotias Klde cave from Western Georgia (KK1), and the western European Late Upper Paleolithic genome, “Bichon” from Grotte du Bichon, Switzerland (Bich), ∼17,700 years old (2) an individual from the Bronze age burial site at Ludas-Varjú-dülö, Hungary (BR2 Gamba et al. 2014), and (3) a 4500-year-old male from Mota Cave in the southern Ethiopian highlands (Gallego Llorente and Jones 2015). All these samples had relatively high coverage (), and thus allowed us to infer fine scaled patterns of heterozygosity along the genomes, even for regions with low diversity ().

For comparison, we also inferred diversity patterns for nine modern males from three populations that were analyzed as part of the 1000 Genomes Project, phase 3 (alignment files downloaded from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/). These were the British males HG00115, HG00116, and HG00117, Tuscan males NA20509, NA20511, and NA20762, and Yoruban males NA18486, NA18519, and NA18522. As shown in Figure 6B, these nine individuals portrayed the expected pattern of higher diversity in African than European individuals, but they also revealed significant variation among individuals of the same population (t-test, in at least two out of three possible comparisons in each population). Larger differences in overall diversity were observed among the ancient samples analyzed. Unsurprisingly, the African sample Mota exhibited the highest diversity of all ancient samples, which was also higher than the diversity observed in modern day Europeans, yet lower than modern day Yorubans. The ancient sample with the second highest diversity was the Bronze Age sample BR2, whose diversity falls well within the range of estimates obtained from modern day Europeans. In contrast, the two European hunter-gatherer samples KK1 and Bichon showed much lower diversity than modern day Europeans with their median estimates being 15–25% lower than the median estimates of modern Europeans. These results suggest that while hunter-gatherer populations had much lower diversity, the diversity found in Europeans about 3000 years ago was very comparable to the diversity observed today. This conclusion is in perfect agreement with a temporal trend in the total length of runs of homozygosity (ROH) inferred from imputed genotypes among ancient samples from Hungary that spanned a period from 5700 to 100 bc and also included the sample BR2 (Gamba et al. 2014).

Local diversity in ancient and modern humans. (A) Heterozygosity (θ) inferred in 1 Mb windows along the first 75 Mbp of chromosome 1 (excluding windows closer than 5 Mbp of the telomere) for two modern Europeans (TSI2 and GBR2), and two ancient European hunter-gatherers (KK1 and Bich). Solid lines indicate the MLE estimate, shading indicates the 95% confidence intervals and dashed lines the genome-wide median for each sample. (B) Distribution of estimates in 1 Mbp windows across the first 22 chromosomes of each sample. (C) Similarity in the pattern of θ along the genome visualized by hierarchical clustering using 1 − Spearman correlation as distance.

The inference of local diversity patterns also allows us to compare the distribution of diversity in the genome between individuals, regardless of the overall level of diversity. This analysis revealed a substantial phylogenetic signal in the distribution of diversity as quantified by Spearman correlations. For instance, the diversity pattern is more strongly correlated among modern Yorubans (pairwise Spearman correlations between 0.55 and 0.60) than between Yorubans and Europeans (0.40–0.50). Similarly, the diversity pattern of the ancient African samples Mota is most strongly correlated with that of modern day Yorubans (0.491–0.537), and much less so with modern day Europeans (0.362–0.433). Interestingly, European samples are more diverse in their patterns than Africans (pairwise Spearman correlations between 0.42 and 0.48), and their correlations do not exceed those obtained when comparing African and European individuals. Nonetheless, hierarchical clustering groups all modern day Europeans together, and also puts the Bronze age sample BR2 at the basis of that clade (Figure 6C).

The lowest pairwise correlations (0.28–0.39) were found for comparisons involving the two European hunter-gatherer samples KK1 and Bich, with the overall lowest being the correlation between these samples (0.28). This is also illustrated visually when plotting our estimates of the first 75 Mbp of chromosome 1, where we found relatively high concordance in local diversity among the two European samples, but vastly different patterns among the hunter-gatherer samples (Figure 6C). These results suggest that, despite very comparable overall levels of diversity, the distribution of diversity along the genome was very diverse among European hunter-gatherer populations, and very different from the one observed among modern day individuals. Multiple observations support such a conclusion: first, the two samples analyzed here represent two vastly different geographic regions, with one being a sample from Switzerland, and the other from Georgia, and were previously reported to belong to two different clades that split 45,000 years ago, as inferred from genotyping data (Jones et al. 2015). Second, the ancestry of modern Europeans traces only partly back to European hunter-gatherers, with early Neolithic people from the Aegean (Hofmanová et al. 2016) and Yamnaya steppe herders (Haak et al. 2015) contributing the majority of the modern-day genetic makeup. Finally, the two European hunter-gatherer samples both exhibit many, but unique, regions of very low diversity ( in 4% of all windows, cf. 0.00–0.03% in all modern Europeans), which do not have particularly low coverage. They are likely the result of small population sizes with some level of consanguinity in the population (Pemberton et al. 2012).

Discussion

Quantifying genetic diversity and comparing it between different individuals and populations is fundamental to understanding the evolutionary processes shaping genetic variation. Unfortunately, the inference of heterozygosity is confounded by both sequencing errors, resulting in false diversity as well as the statistical power to identify heterozygous sites, particularly when coverage is low. Several methods have been developed to learn about heterozygosity probabilistically, that is, without the need to first call genotypes. A rather recent such approach (Bryc et al. 2013) proposes to leverage data from external reference individuals to obtain an unbiased estimate of the probability that a specific sites is heterozygous. The expected heterozygosity is then estimated from these site-specific estimates. Since this approach requires per site estimates to be accurate, only sites with a coverage of or higher can be included in the analysis, which severely limits the scope of the application.

An alternative is to infer heterozygosity probabilistically from a collection of sites. Among the earliest such methods was a likelihood-based estimator (Lynch 2008), which infers heterozygosity of an individual jointly with the rate sequencing errors. We presented a natural extension of this approach that relaxes the infinite sites assumption and integrates PMD, a particular feature of ancient DNA that is not captured by base quality scores provided by sequencing machines. We achieved this by extending Felsenstein’s model of substitutions (Felsenstein 1981) with an explicit model of next-generation sequencing data that incorporates both sequencing errors as well as errors arising from PMD. We have previously used the same PMD error model for variant calling from ancient samples (Hofmanová et al. 2016), and we note that it may be used to extend any other inference method based on genotype likelihood to ancient DNA.

Following other recently developed approaches to infer genetic diversity from next-generation sequencing data (e.g., Nielsen et al. 2012; Korneliussen 2014; Maruki and Lynch 2015), our method also relaxes the assumption of constant error across all reads by benefiting from the base-specific quality information provided by current sequencing technologies. Yet since these provided quality scores are often distorted, we also introduced here a method to recalibrate the quality scores for low coverage genomes. In contrast to commonly used methods for recalibration (e.g., McKenna et al. 2010; DePristo et al. 2011; Cabanski et al. 2012), our approach does not require information about the underlying sequence context. It only assumes sites to be monomorphic while integrating over the uncertainty of the sequence itself. Examples of regions known to be monomorphic are the sex chromosomes in hemizygous individuals. But, since we found that our method recalibrates quality scores with high accuracy and reliably, even based on DNA stretches as short as 1 Mbp, we are confident that it will work even on ultraconserved elements or plasmid DNA. Finally, we note that if multiple individuals are sequenced together, they are likely affected by the same distortion of quality scores, and can hence be recalibrated with parameters inferred from a subset of them (e.g., the male samples).

As an illustration, we applied the methods developed here to modern and ancient human samples of various coverage. While our approach to infer heterozygosity incorporates the possibility of PMD, it assumes that the probability of a PMD event occurring is known. We thus also introduce a method to infer these probability functions from raw data, which is robust to divergence between the sample and the reference genome. By inferring PMD patterns for each sample, then the recalibration parameters, and, finally, local diversity in 1 Mbp windows, we found that both ancient and modern African samples exhibited much larger diversity than European individuals. In addition, the diversity inferred from two ancient European hunter-gatherer samples was much lower than that of modern samples, which is likely explained by smaller population sizes. Besides overall differences in diversity, also the pattern of diversity along the genome revealed a strong geographic clustering among modern and ancient samples. The exceptions were the two European hunter-gatherers that showed patterns very different from both modern samples, as well as from one another, further corroborating the view (Jones et al. 2015) that these samples represent different and ancient clades that contributed only marginally to the genetic make-up of modern day Europeans.

Acknowledgments

We are grateful to two anonymous reviewers for their very constructive comments on an earlier version of this work. This study was supported by Swiss National Foundation grant number 31003A_149920 to D.W.

Appendix

Emission Probabilities in the Presence of PMD

Following Lynch (2008) and commonly used approaches (e.g., Li 2011), we assume here that a sequencing read is equally likely to cover any of the two alleles of an individual, and that sequencing errors may result in any of the alternative bases with equal probability, In the absence of PMD, the probability of observing a base, , given the underlying genotype is then given byIn ancient DNA, differences between the base observed within a read and the underlying alleles may also be the result of PMD. Following Hofmanová et al. (2016), let us denote by and , the known probability that a or PMD occurred at the position covering site i in read j, respectively. In the presence of PMD, the probability of observing a base, , given the underlying genotype is given by

Newton-Raphson Algorithm to Infer PMD Patterns

Under the model proposed in Equation 10, the log likelihood of the data are given bythe gradient vector byand the Jacobian matrix bywhereandThe Newton-Raphson iteration for is given by

(15)

EM Algorithm to Infer Base-Specific Error Rates

We propose an EM algorithm for this estimation that is similar to the one above, but assume here that the base frequencies are known, i.e., can be derived accurately from counting in the region. The complete data log-likelihood of our model is given byFrom this, we get the expected complete data log-likelihoodFor the M-step, we need only to consider the first part of whereby Bayes’ formula. From (13), we get more explicitlywhere we used the abbreviations andIn order to maximize for we calculate the gradient vector with components(16)for From (12), we obtainObserve that and for

We solve with the Newton-Ralphson method with the Jacobian matrix From (16), we getwhere

, 2006Inference of population genetic parameters in metagenomics: a clean look at messy data Inference of population genetic parameters in metagenomics: a clean look at messy data.Genome Res.16: 1320–1327.

The Genetics Society of America (GSA), founded in 1931, is the professional membership organization for scientific researchers and educators in the field of genetics. Our members work to advance knowledge in the basic mechanisms of inheritance, from the molecular to the population level.