Abstract

Genetic maps, which document the way in which recombination rates vary over a genome, are an essential tool for many genetic analyses. We present a high-resolution genetic map of the human genome, based on statistical analyses of genetic variation data, and identify more than 25,000 recombination hotspots, together with motifs and sequence contexts that play a role in hotspot activity. Differences between the behavior of recombination rates over large (megabase) and small (kilobase) scales lead us to suggest a two-stage model for recombination in which hotspots are stochastic features, within a framework in which large-scale rates are constrained.

Several recent studies (1, 2) have shown that fine-scale recombination rates can be successfully estimated from genetic variation data by coalescent-based methods, but to date these have only been applied to small fractions of the human genome. Here we studied recombination rates across the entire genome by applying one such method, LDhat (1), to a previously published genome-wide survey of genetic variation in which ∼1.6 million single-nucleotide polymorphisms (SNPs) were genotyped in three samples: 24 European Americans, 23 African Americans, and 24 Han Chinese from Los Angeles (3). Informally, the method fits a statistical model based on the coalescent to patterns of linkage disequilibrium, the nonrandom association between nearby SNPs, and then estimates recombination rates within this model in a Bayesian framework in which the prior distribution encourages smoothness and reduces overfitting in estimated rates. Recombination rates were estimated separately for each population sample and averaged to give a single estimate (4). As a further validation of the approach used here, scatterplots of our estimated recombination rates against known rates from pedigree studies (5, 6) show extremely good agreement over the megabase scales for which the pedigree rates have good resolution (fig. S1, genome-wide R2 = 0.96). At the fine scale, we find strong concordance between genetic variation–based and sperm-typing estimates of recombination rates and the location of hotspots across 3.3 Mb of the human major histocompatibility region (4, 7) (fig. S2).

The fine-scale genetic map for each of the 22 autosomes and the X chromosome is shown in fig. S3 (8); Fig. 1 shows an example for chromosome 12. Compared to existing genetic maps, recombination rates show much greater variation at fine scales (Figs. 1 and 2A), and pedigree-based rate estimators are poor predictors of rates over smaller physical distances (Fig. 2B). For example, across the genome, the rank correlation between the rate over each 5-cM region and that for the 50-kb interval centered within it is 0.35, and the rank correlation between the rate over each 5-cM region and that for the 5-kb interval centered within it is 0.24. In large part this is because the fine-scale recombination landscape is dominated by recombination hotspots: Rate estimates show sharp, narrow peaks, with the bulk of the recombination occurring in a small proportion of the sequence. Typically, 80% of the recombination occurs in 10 to 20% of the sequence (Fig. 2C). An interesting exception to this pattern is chromosome 19, which has a much lower density and intensity of hotspots in addition to having the highest gene density (9) and proportion of open chromatin (10).

Recombination rate variation along chromosome 12. Shown are estimated recombination rate (black), locations of statistically significant recombination hotspots [triangles; colors indicate relative amount of recombination from low (blue) to high (red)], and estimated recombination rates from the deCODE (6) genetic map (red curve near bottom). Also shown are the location of ENSEMBL genes on the two strands (blue segments), fluctuations in local GC content (gray lines; averages over 1000-bp windows shown on an arbitrary scale), and an ideogram of chromosome banding.

Fine-scale recombination rate variation. (A) Histograms of the recombination rate for regions of sizes 5 Mb, 500 kb, 50 kb, and 5 kb across the genome, showing substantially increased variability, and a distribution dominated by a small proportion of large values, for smaller regions. Coefficients of variation for the four scales are 0.44, 0.67, 1.2, and 2.0, respectively. (B) Scatterplots of the recombination rate of larger regions, of sizes 5 Mb (top row), 500 kb (middle row), and 50 kb (bottom row), compared to a smaller region (500 kb, 50 kb, and 5 kb) centered in the large region. (C) Proportion of the total recombination in various percentages of sequence, plotted for each chromosome.

Earlier analyses of pedigree-based maps reported the presence of recombination deserts (6, 11)—large (megabase-sized) regions where there is very little or no recombination. We also identified such regions (the left-hand tail of the 5-Mb histogram in Fig. 2A), but closer inspection revealed that in all such deserts there are recombination hotspots—although they are relatively scarce and have low intensities. An additional nonparametric analysis (12), which allows for plausible levels of genotyping error (13), shows that apart from the centromeres, there are no regions of the genome larger than 200 kb that are completely devoid of recombination.

To date, fewer than 20 human hotspots have been identified by direct analyses, typically through sperm typing in males. Previous coalescent analyses of population data have identified rather more (1, 2). The approach we have applied here and elsewhere (1, 14) (“LDhot”) has recently been shown empirically to have reasonable power and, crucially, a low false positive rate (15). Using it, we identified more than 25,000 recombination hotspots (4), all but a few hundred of which had not previously been characterized. Of the hotspots detected in other studies where full resequencing and very dense genotyping were used, we identified 50 to 60% (4), which suggests an average hotspot density across the genome of approximately one every 50 kb.

The figure of 25,000 to 50,000 recombination hotspots is comparable to estimates of the number of genes in the human genome (16) and therefore suggestive of the alpha-hotspot model in yeast (17), where recombination machinery is recruited to sites bound by transcription factors. To address the relationship between genes and recombination, we have plotted recombination rate as a function of distance from the start codon (Fig. 3). In contrast to the alpha-hotspot model, we find that recombination rates are on average lower within genes and increase with distance from the gene symmetrically in either direction for ∼30 kb before decreasing again. In short, recombination hotspots in humans seem to preferentially occur near (within 50 kb of) genes, but are preferentially located outside the transcribed domain.

Recombination rate as a function of distance from the nearest gene. The average recombination rate for SNP intervals is shown as a function of the physical distance between the midpoint of the SNP interval and the nearest gene (both 5′ and 3′). The left and right vertical tick marks respectively refer to the average rate in the first and last exons, and the central tick mark is the average over all intervals in internal exons; intervening points represent rates in first, internal, and last introns, respectively. Some caution may be needed in interpreting the results for the largest distances from genes. These relate to data points in gene deserts, which have atypical sequence features, and the decreased rate may be due to these other features rather than directly to the distance from the gene.

To investigate whether other factors—particularly repeat sequences and simple sequence motifs—are associated with hotspots, we matched detected hotspot regions with regions of the same size and SNP density that showed no evidence for being a hotspot (we call these matched regions “coldspots”). We found interesting differences in the frequency of certain sequence features between hotspots and coldspots (4). For example, the long terminal repeats of two retrovirus-like retrotransposons, THE1A [frequency in hotspot/frequency in coldspot (RR) = 2.3] and THE1B (RR = 1.7), are strongly overrepresented in hotspots, as are CT-rich repeats (RR = 1.4) and GA-rich repeats (RR = 1.4). By contrast, (TA)n repeats (RR = 0.7), GC-rich repeats (RR = 0.3), and certain L1 long interspersed nuclear elements (LINEs) (RR = 0.4) are strongly underrepresented in hotspots (in terms of both presence and length).

Comparison of THE1A/B elements in hotspots with those outside of hotspots revealed several marked sequence differences. The strongest signal is for the 7-nucleotide oligomer CCTCCCT, which aligns to positions 261 to 267 in the THE1B consensus (18). This motif is more frequent in hotspot THE1Bs than in THE1Bs elsewhere in the genome by a factor of 5.9 (P < 10–33, Fisher's exact test); similarly, it is more frequent in hotspot THE1As than in THE1As elsewhere in the genome by a factor of 5.1 (P < 10–5, Fisher's exact test). In each case, the consensus sequence for the repeat has a C in the seventh position. Two separate lines of evidence also point to the motif CCTCCCT, or possibly a larger motif containing it, playing an important role in hotspot determination. After masking repeat elements, we compared hotspot and coldspot regions for differences in the frequency of all motifs of length 5 to 9; the motif CCTCCCT showed the greatest enrichment in hotspots of any of the 8192 7-nucleotide oligomers. Further independent evidence comes from sperm typing studies. These have previously shown directly that the recombination rate at hotspot DNA2 is polymorphic in men, with a specific polymorphism suppressing hotspot activity (19). We examined the DNA sequence immediately surrounding this polymorphism. Strikingly, chromosomes active for the hotspot contain the motif CCTCCCT, with the “suppressor” mutation being a change from T to C in its third position. The THE1A/B context seems to be strongly influential in the function of this motif (although, as in the case of DNA2, not essential). We estimate that within this THE1A/B background, the motif will result in a hotspot 60% of the time, but outside of repeats this reduces to as little as 2 to 3%. Overall, the motif could explain about 11% of the 25,000 hotspots we studied, with occurrences outside THE1A/B elements contributing most of this total because of their far greater frequency in the genome (4). The CCTCCCT motif does not appear to match any sequences previously linked to recombination activity.

Our analysis also reveals how sequence-context effects in determining recombination hotspots can be both complex and extensive. First, we have identified additional motifs that are enriched among the THE1A/B elements within hotspots and that are both independent of and at some distance [up to 132 base pairs (bp)] from the 7-nucleotide oligomer described above (4). Second, we find that L1 elements are strongly underrepresented in hotspots, and the effect is stronger the nearer the elements are to their full length of 6 to 7 kb. In Drosophila, transposons are biased toward regions of low recombination (20, 21), which has been interpreted as evidence for selection against genome instability caused by ectopic recombination (either direct selection against elements that transpose to hotspots and subsequently cause genome rearrangement, or a chromatin-silencing mechanism of the genome that restricts germline transposition and consequently recombination). However, the sequence contexts found in repeat elements, rather than their transposon ancestry, may well be the primary determinant of their relationship with recombination in humans, given our findings that repeat elements can be enriched, suppressed, or unaffected by recombination hotspots; that specific motifs within these hotspots can have additional influences on recombination activity; and that other motifs (such as the 9-nucleotide oligomer CCCCACCCC) can play a role in promoting hotspot recombination outside repeat elements (4).

The extent to which differences in recombination rates over larger scales are due to differences in the numbers of hotspots, and/or to differences in the intensities of the hotspots in different regions, is an open question (22). We find that both are important determinants of large-scale recombination rate variation. For example, it is well known that over large scales, recombination rates tend to be higher in telomeric, as compared to centromeric, chromosomal regions. In telomeric regions, the mean detected hotspot spacing is 90 kb and the mean intensity (total rate across the hotspot) per hotspot is 0.115 cM, whereas for centromeric regions the mean spacing is 123 kb and the mean intensity is 0.070 cM.

Population data only allow estimation of sex-averaged historical recombination rates (23). Notwithstanding this, some information can be gleaned about sex-specific differences in the recombination process. We do this in two ways. The first is to compare recombination on the X chromosome (which is necessarily female-specific) with that on the autosomes (where our estimates are sex-averaged). The second is by comparing autosomal regions in which pedigree estimates show large differences between male and female rates. Although the smaller sample size and SNP density on the X chromosome reduces power to detect hotspots, these are definitely present and hence definitively a feature of the human female recombination process. To enable direct X-autosome comparisons, we created a new autosomal data set matched to the X chromosome for sample size, SNP density, and the allele frequency spectrum by randomly deleting individuals and SNPs from the original data. We then reestimated recombination rates and hotspots from this matched autosomal data set. Having done so, the overall recombination landscape, and the extent to which it is dominated by hotspots, looks similar between the X chromosome and the autosomes (4) (fig. S4).

We also identified 66 paired regions of 2 Mb each, matched for sex-averaged recombination rate, where one region had a strong skew toward male recombination and the other had a strong skew toward female recombination [as estimated from the deCODE (6) map; see (4)]. Regardless of whether the region had much higher male rates than female rates or the opposite, the local recombination landscapes looked similar with regard to hotspot density, hotspot intensity (except for a slight deficit of very hot hotspots in regions where female rates are much higher than male rates), and the proportion of recombination in different proportions of sequence (fig. S4). Previous studies (1, 15) have argued that there are no female-specific hotspots in two small regions, each about 200 kb. Our analyses show that there are definitely hotspots in female recombination, and suggest that the overall density and (except in the right-hand tail) intensity of hotspots, and their role in determining overall rates, is similar in male and female meioses.

Several studies (14, 24–26) have shown substantial differences between recombination hotspots in humans and chimpanzees, establishing that hotspots have evolved very quickly relative to sequence differences. More recently, two lines of evidence—namely hotspots characterized by sperm typing that are present in some men but not in others (19, 27) and differences between contemporary intensities and historical rates in hotspots—suggest that hotspots may even be evolving over the time scales of human polymorphism (15). This leads to a striking paradox. As noted above (Fig. 2C) and elsewhere (1, 14), it is the pattern of hotspots and hotspot intensities that is largely responsible for recombination rate variation over centimorgan scales. The hotspots themselves are evolving very rapidly. So over centimorgan scales, contemporary sex-averaged rates estimated from pedigrees should differ from historical rates estimated from population data. But this is not the case. In contrast to the picture at individual hotspots (15), contemporary rates and historical rates are effectively identical over 5-Mb regions (fig. S1). Furthermore, over these large scales, recombination rates are reasonably well predicted by a few sequence and genomic features (6, 22). But by combining multiple fine-scale genomic features, we can, at best, explain little more than 4% of the variation in recombination rate at the 5-kb scale (corresponding to the scale of recombination hotspots) (4). Finally, although there are documented differences in the presence (or absence) of hotspots and their intensities in men, no significant differences have been detected in male genome-wide recombination rates (5, 6).

This suggests a two-stage process for recombination. A possible model is that recombination rates are constrained over large scales, plausibly by physical stresses acting on the DNA and/or by access to the recombination machinery, and that these constraints are slowly evolving and can be reasonably well predicted by sequence and genomic features. Within this big picture, the fine-scale landscape of hotspots is much less constrained and is rapidly evolving. Several lines of evidence support meiotic drive as a mechanism by which hotspots will in effect extinguish themselves (15, 19), but the so-called “hotspot paradox” asks where new hotspots come from. If recombination rates over large scales are tightly constrained, then it might be the relative, rather than the absolute, propensity toward formation of double-strand breaks (DSBs) that matters. Over time, when one hotspot dies out, the relative recombination rates of others may increase, and/or new hotspots may arise in the locations that are next in an evolving queue of relative likelihood for DSB formation. This is consistent with observations in yeast of local competition between hotspots: A high frequency of DSBs at one site suppresses DSBs at nearby sites (28–32).

There are several testable predictions of this model. One is that large-scale recombination rates are likely to be correlated between humans and chimpanzees, in contrast to fine-scale rates. Although there is evidence for weak correlation of human and chimpanzee rates over intermediate (50-kb) scales (24), comparisons over larger scales will require either a chimpanzee genetic map or coalescent analyses of much larger chimpanzee polymorphism surveys than are currently available. Another prediction relates to hotspots detected by sperm typing that are polymorphic among men. In a set of men who do not have a particular hotspot, the model would predict increased activity in other hotspots and a similar total amount of recombination over large regions containing the polymorphic hotspot.

Historical rates represent time averages of the recombination rate over the time scales during which the polymorphism has evolved. In the case of humans, this is likely to be on the order of 500,000 to 1 million years.

We thank D. Cox and colleagues at Perlegen, and C. Spencer. Supported by NIH grant U54 HG2750 and the SNP Consortium (G.M.) and by NIH grant U54 HG2750, the Nuffield Trust, the SNP Consortium, the Wellcome Trust, and the Wolfson Foundation (P.D.).