The idea for this paper emerged during a long drive across New York State, from Cold Spring Harbor to Ithaca, after the 2011 Biology of Genomes meeting. Two postdocs in my research group, Ilan Gronau and Leo Arbiza, were riding with me in my old Subaru, trying not to express too much alarm at my distracted driving. Also with us in the car was Ran Blekhman, who was at the time a postdoc with Andy Clark. (Ran is now an assistant professor at the University of Minnesota.)

Our conversation turned to important open questions in computational genomics, and in particular, to ways of making better use of the vast quantities of functional genomic data being pumped out of projects such as ENCODE. At the time, Ilan, Leo, and I were thinking a lot about how to use patterns of within-species polymorphism and between-species divergence to shed light on the influence of natural selection on regulatory sequences. Along these lines, we had spent much of the spring developing our new INSIGHT method [1,2], and we had just presented this work for the first time at Biology of Genomes. Ran, however, was pushing us to think less about abstract evolutionary questions and more about genomic function and disease association. He made a strong case that the biggest obstacle to progress in medically related human genomics was the absence of adequate functional annotations in noncoding regions of the human genome.

For a while the conversation went in circles, as we grasped for ways of measuring “functional potential” across the genome that would make use of genomic data yet be grounded in evolutionary theory. Then it suddenly dawned on us that we already had in hand a key piece of what we needed. The INSIGHT program was designed to estimate, for any collection of nucleotide positions across the genome, the fraction (denoted ρ) of those positions that were directly influenced by natural selection, in the sense that point mutations at those positions tended either to increase or to decrease the fitness of an organism. We realized that an alternative way of interpreting ρ was as a probability that the nucleotide at each position in the analyzed collection influenced fitness, assuming exchangeability of sites (as INSIGHT does).

All that we needed, therefore, was a general way of partitioning nucleotide positions from across the genome into distinct classes that were reasonably homogeneous in their functional roles. We could then estimate ρ for each class using INSIGHT, and assign to each genomic position the estimate of ρ for the partition to which that position belonged. This procedure would produce a “score” across the genome that looked roughly like widely used evolutionary conservation scores, but instead of representing local divergence patterns across the mammalian phylogeny, the score at each position would be estimated from groupwise patterns of polymorphism and divergence and would be directly interpretable as a probability of fitness consequences. Later Ilan would dub these “fitCons” scores, to emphasize this fitness-related interpretation. (“FitCons” also nicely parallels “phastCons,” our first conservation-scoring method.)

Because INSIGHT measures selection on recent time scales, fitCons scores would circumvent a major shortcoming of standard evolutionary conservation scores—that they require functional roles to have remained consistent over very long evolutionary time periods (tens to hundreds of millions of years) in order to be detectable in divergence patterns at individual sites or small loci. In principle, fitCons scores should be able to detect selection (hence potential function) at sites whose functional role had emerged quite recently, perhaps even along the human lineage.

The Problem of Grouping

The piece that was still missing in our plan was a particular scheme for grouping together similar genomic sites from across the genome. We did not get to the point of working this problem out in any detail during our revelatory drive to Ithaca, and, as it happened, it took several more months to settle on a solution. By this time, a Ph.D. student from Computer Science, Brad Gulko, had joined the project and assumed the lead in implementing a prototype of the scores.

At first, Brad, Ilan, Leo, and I spent some time thinking about fancy algorithms for clustering genomic sites that would consider functional and evolutionary information jointly. However, it did not take long to realize that this was a hard problem. Eventually, we decided to move forward with a simple grouping scheme, based on functional genomic data alone. This would allow us to cluster genomic sites in a pre-processing step and avoid the need for an iterative solution. Our hunch was that the scores would not be too sensitive to the grouping scheme as long as it was reasonable. As we discuss in our article, it may be worthwhile to revisit this clustering approach eventually, but it appears to be adequate for our current purposes. (I hope to convince Brad to discuss some of the technical issues with the clustering problem in an upcoming blog post.)

Relevance to the “Share Under Selection” in the Human Genome

By the fall of 2012, we had finally settled on an initial set of fitCons tracks and were beginning to observe decent prediction performance for cis-regulatory elements, when the human genomics community was thrown into a frenzy by a deluge of publications and accompanying press releases from the ENCODE Consortium. This event led to the now-famous controversy over what fraction of the genome is truly “functional” and whether ENCODE’s measures of “reproducible biochemical activity” (which apply to over 80% of the genome) were comparable in any meaningful way to the “share under selection” (SUS) estimated from comparative genomics (which generally came out to 5–10%).

I do not wish to rehash the familiar terms of this debate here, but I do want to focus on one aspect of it that was particularly relevant to our work. Many of the criticisms of ENCODE reminded readers that comparative genomic analyses pointed to a SUS of ~5–10%, suggesting that 80% might be a gross over-estimate of the functional content of the genome. However, others pointed out that these comparative-genomic estimates applied only to the fraction of the genome that had been under long-term selective constraint, because evolutionary turnover of functional elements—if it occurred at appreciable rates—could bias estimates based on long-term genomic divergence substantially downward. (For the latest chapter in this saga, see a recent paper by Gerton Lunter, Chris Ponting, and colleagues [3].)

We realized that the fitCons scores could help address aspects of this controversy, because they were based on patterns of variation over much more recent time scales and should therefore be much less sensitive to turnover than scores based on divergence patterns across the mammalian phylogeny. Moreover, the fitCons scores, by making use of INSIGHT to interpret patterns of polymorphism and divergence, might provide substantially better estimates of the quantities of interest than simple analyses of SNP densities or allele frequencies, a few of which had appeared among the ENCODE publications. Finally, the INSIGHT-based estimates are unique in that they directly predict the SUS, without the need for separate thresholding, mixture deconvolutions, or enrichment analyses.

Somewhat surprisingly, when we estimated the SUS based on fitCons scores, we obtained values (4.2–7.5%) that were quite similar to those based on conservation patterns in mammals. There are a number of tricky technical issues involved in this type of estimation—for example, concerning the corrections for local mutation rates and coalescence times—but violations of our modeling assumptions generally should tend to push our estimated upper bound (7.5%) to conservatively high values, implying that the true value is lower than 7.5%. In addition, the correction we have applied to obtain our lower bound (4.2%) is quite conservative, making it likely that the true value is higher than 4.2%. Therefore we have high confidence that the fraction of the genome under detectable selection from the available polymorphism and divergence data is indeed fairly close to 5%. As we discuss in the paper, it is important to bear in mind that the absolute values of these estimates reflect constraint on the identities of individual nucleotides only, and do not take into consideration higher order constraints, for example, on element lengths or spacing. Nevertheless, the similarity of the estimates based on mammalian divergence and human polymorphism suggest that evolutionary turnover has not produced a major downward bias in conservation-based estimates of the SUS.

Ilan and I soon realized that we could go a step further in this analysis and compare the fitCons-based estimates with parallel estimates based on the same functional categories but a measure of natural selection based on divergence only. The idea here was to perform a direct “apples to apples” comparison of the fraction of the genome under selection as measured on two different time scales: the 1–5 million-year time scale measured by fitCons and the ~30 million-year time scale measured by an analogous method based on divergence patterns in four primate genomes (human, chimpanzee, orangutan, and rhesus macaque), which we called “fitConsD” (the “D” is for “divergence”). I won’t attempt to describe this analysis in detail here, but our general conclusion is that the estimates of selection are highly similar on these two different time scales, suggesting further that evolutionary turnover has not had a dramatic effect on the functional content of the human genome over the past 30 million years or so. It is worth noting that Lunter and colleagues’ recent analysis is not strictly incompatible with ours (they estimate 7.1–9.2% constraint at present and focus on turnover over longer time scales) but their qualitative interpretation suggests large amounts of turnover, while ours suggests modest amounts.

Scooped by CADD… or Perhaps Not

As grant proposals, other manuscripts, and job searches led to delays in writing up our work through 2013, we began to hear rumblings on social media about a method called CADD, developed by Greg Cooper and Jay Shendure’s groups, that sounded alarmingly similar to fitCons. Then, in early 2014, a paper by Kircher, Witten et al. describing CADD appeared in Nature Genetics [4]. When we saw this paper, our initial impression was that we had been scooped by sitting on a good idea for too long. CADD was described as a method that integrated functional and evolutionary data and produced a measure of “relative pathogenicity” across the entire genome, and it was motivated, in part, by its potential usefulness in noncoding as well as coding regions. The paper included several impressive-looking ROC plots in which CADD apparently outperformed conservation based methods such as phastCons, phyloP, and GERP by a significant margin. In addition, CADD made use of a support vector machine (SVM), which was potentially a highly flexible and powerful means for considering large numbers of covariates with arbitrarily complex correlations.

We decided, with a certain amount of dread, that we needed to add CADD to our empirical performance comparisons on putative regulatory elements. At the time, fitCons was showing clear advantages in predictive power for cis regulatory elements compared with conservation-based methods and a functional annotation database called RegulomeDB. FitCons had several potential advantages over CADD—for example, it made direct use of polymorphism data for prediction, it considered covariates in a cell-type-specific manner, and it avoided a need for brute-force simulations through its use of INSIGHT for inference—but we thought that the use of the SVM in CADD made it unlikely that fitCons could compete with it in a pure classification task. Nevertheless, Brad dutifully downloaded the CADD scores, added them to his experiments, and displayed curves for CADD in his ROC plots for three types of regulatory elements (ChIP-seq-supported transcription factor binding sites, eQTLs, and enhancer predictions based on chromatin marks).

To our surprise, fitCons significantly outperformed CADD in all of these tests. This was true for three different types of putative regulatory elements, and true whether or not we considered cell-type-specific test sets. In fact, CADD performed essentially no better than conventional conservation scores in these tests, in apparent contradiction to the results presented in the CADD paper.

A closer reading of the CADD paper revealed a possible explanation for these observations. While the method was motivated, in part, by its applicability to the entire genome, the authors’ validation experiments heavily emphasized coding regions. In fact, it appears that even the ROC plot for “genome-wide” results (Figure 3a in the paper) is actually based almost exclusively (>92% by our interpretation of the paper) on missense variants in coding regions. The experiments that included substantial numbers of noncoding sites, in turn, were much more indirect—for example, by showing correlations with derived allele frequencies (Figure 2), known disease-causing status (Figure 4), and changes in expression in saturation mutagenesis experiments at two enhancers and one promoter. It is possible to have correlations of this kind without having substantial predictive power for regulatory variants.

When Greg Cooper saw our initial preprint on bioRxiv, he raised two major objections to our validation experiments. First, he pointed out that we were measuring the sensitivity of the fitCons scores in terms of bulk coverage of elements, when those elements actually consist of a mixture of sites at which mutations are deleterious and sites at which mutations are neutral or nearly neutral (such as degenerate positions in transcription factor binding sites). This approach to measuring sensitivity may be overly generous to the fitCons scores, which are relatively “blocky” along the genome, varying little from one site to the next, in comparison to higher-resolution prediction methods that properly distinguish between functionally important and neutral sites within elements. Second, Greg pointed out that we were using a naive genome-wide background set for our eQTL, which did not properly account for the ascertainment scheme used for eQTL identification.

We felt that these were fair and reasonable criticisms, and needed to be addressed. Therefore, we revised our validation experiments to consider only high-information-content positions in transcription factor binding sites (a proxy for functionally important nucleotides), and to use a more appropriate control for eQTL. The details of these follow-up experiments are described in our revised preprint (now on bioRxiv), but the bottom line was that they had almost no effect on our ROC plots. In other words, the apparent performance advantages of fitCons over CADD and other divergence-based methods is not an artifact of our experimental design but appears to reflect real advantages of the method. While Greg is correct that the coarse, “blocky” nature of the fitCons scores is a limitation of our current methods, the method still appears to perform significantly better than any competing method in distinguishing putatively functional regulatory nucleotides from background sequence. In other words, while scores that exhibit more variation from one nucleotide to the next—such as CADD, GERP, and phyloP—may appear on the surface to have higher predictive resolution, much of that variation is uninformative about regulatory function, and, on balance, the “blocky” fitCons scores are more useful in prediction.

We have spent some time trying to understand the differences in performance between fitCons and CADD, and believe we have some insights into why fitCons performs significantly better on regulatory elements. (What follows are our conjectures only; the authors of CADD do not agree with our analysis.) While the SVM in CADD is potentially a strength, we believe that it is substantially limited in this case by the use of a linear kernel and by pooling features across cell types, rather than focusing separately on each cell type of interest. In addition, we think there is a fundamental problem with the optimization scheme used by CADD. The SVM in CADD is trained by a global strategy, in the sense that a single set of parameter values is selected (for a given choice of training set and generalization parameter) to obtain an optimal fit, on average, across all examples in the training set. Thus, if it is true that different covariates are relevant in coding and noncoding regions, as expected, then the method will have to make tradeoffs between these types of sites. If the “signal” for training (i.e., the contribution toward the SVM’s objective function) is stronger in one type of region than another, it is likely that the tradeoff will favor that type of region. Because constraint will be strongest, on average, in coding regions, leading to higher rates of difference between simulated and observed variants in these regions, it seems likely that these regions will indeed dominate in the training procedure, and this may explain CADD’s superior performance in coding regions and its weaker performance in noncoding regions. FitCons avoids this problem by applying INSIGHT separately to each class of sites.

These observations raise the interesting possibility of a modified CADD that addresses some of these limitations. There is no reason why CADD couldn’t be trained separately on noncoding and coding regions, perhaps with different sets of covariates for each type of sites. Moreover, regulation-associated covariates could be treated in a cell-type-specific manner. A modified CADD designed along these lines (regulatory CADD, or rCADD?) could provide an interesting alternative to fitCons.

Summary

When we first discussed the idea for the fitCons scores during our drive across New York State three years ago, I envisioned a quick spin-off project that could be completed in perhaps half a year. As so often happens in research, several unanticipated challenges arose in completing this work, but we also found unexpected opportunities to connect our analysis with important open questions in the field. In addition, we were stimulated to think about the problem of combining functional and evolutionary data in new and deeper ways by another paper that addressed a similar problem but in a fundamentally different way. The end result is a paper I am quite proud of—one that provides what I think will be a useful resource to the genomics community and that also offers new insights into longstanding evolutionary questions.