About this Author

College chemistry, 1983

The 2002 Model

After 10 years of blogging. . .

Derek Lowe, an Arkansan by birth, got his BA from Hendrix College and his PhD in organic chemistry from Duke before spending time in Germany on a Humboldt Fellowship on his post-doc. He's worked for several major pharmaceutical companies since 1989 on drug discovery projects against schizophrenia, Alzheimer's, diabetes, osteoporosis and other diseases.
To contact Derek email him directly: derekb.lowe@gmail.com
Twitter: Dereklowe

February 25, 2013

ENCODE: The Nastiest Dissent I've Seen in Quite Some Time

Posted by Derek

Last fall we had the landslide of data from the ENCODE project, along with a similar landslide of headlines proclaiming that 80% of the human genome was functional. That link shows that many people (myself included) were skeptical of this conclusion at the time, and since then others have weighed in with their own doubts.

A new paper, from Dan Graur at Houston (and co-authors from Houston and Johns Hopkins) is really stirring things up. And whether you agree with its authors or not, it's well worth reading - you just don't see thunderous dissents like this one in the scientific literature very often. Here, try this out:

Thus, according to the ENCODE Consortium, a biological function can be maintained indefinitely without selection, which implies that (at least 70%) of the genome is perfectly invulnerable to deleterious mutations, either because no mutation can ever occur in these “functional” regions, or because no mutation in these regions can ever be deleterious. This absurd conclusion was reached through various means, chiefly (1) by employing the seldom used “causal role” definition of biological function and then applying it inconsistently to different biochemical properties, (2) by committing a logical fallacy known as “affirming the consequent,” (3) by failing to appreciate the crucial difference between “junk DNA” and “garbage DNA,” (4) by using analytical methods that yield biased errors and inflate estimates of functionality, (5) by favoring statistical sensitivity over specificity, and (6) by emphasizing statistical significance rather than the magnitude of the effect.

Other than that, things are fine. The paper goes on to detailed objections in each of those categories, and the tone does not moderate. One of the biggest objections is around the use of the word "function". The authors are at pains to distinguish selected effect functions from causal role functions, and claim that one of the biggest shortcomings of the ENCODE claims is that they blur this boundary. "Selected effects" are what most of us think about as well-proven functions: a TATAAA sequence in the genome binds a transcription factor, with effects on the gene(s) downstream of it. If there is a mutation in this sequence, there will almost certainly be functional consequences (and these will almost certainly be bad). If, however, imagine a random sequence of nucelotides that's close enough to TATAAA to bind a transcription factor. But in this case, there are no functional consequences - genes aren't transcribed differently, and nothing really happens other than the transcription factor parking there once in a while. That's a "causal role" function, and the whopping majority of the ENCODE functions appear to be in this class. "It looks sort of like something that has a function, therefore it has one". And while this can lead to discoveries, you have to be careful:

The causal role concept of function can lead to bizarre outcomes in the biological sciences. For example, while the selected effect function of the heart can be stated unambiguously to be the pumping of blood, the heart may be assigned many additional causal role functions, such as adding 300 grams to body weight, producing sounds, and preventing the pericardium from deflating onto itself. As a result, most biologists use the selected effect concept of function. . .

A mutation in that random TATAAA-like sequence would be expected to be silent compared to what would happen in a real binding motif. So one would want to know what percent of the genome is under selection pressure - that is, what part of it is unlikely to be mutatable without something happening. Those studies are where we get the figures of perhaps 10% of the DNA sequence being functional. Almost all of what ENCODE has declared to be functional, though, can show mutations with relative impunity:

From an evolutionary viewpoint, a function can be assigned to a DNA sequence if and only if it is possible to destroy it. All functional entities in the universe can be rendered nonfunctional by the ravages of time, entropy, mutation, and what have you. Unless a genomic functionality is actively protected by selection, it will accumulate deleterious mutations and will cease to be functional. The absurd alternative, which unfortunately was adopted by ENCODE, is to assume that no deleterious mutations can ever occur in the regions they have deemed to be functional. Such an assumption is akin to claiming that a television set left on and unattended will still be in working condition after a million years because no natural events, such as rust, erosion, static electricity, and earthquakes can affect it. The convoluted rationale for the decision to discard evolutionary conservation and constraint as the arbiters of functionality put forward by a lead ENCODE author (Stamatoyannopoulos 2012) is groundless and self-serving.

Basically, if you can't destroy a function by mutation, then there is no function to destroy. Even the most liberal definitions take this principle to apply to about 15% of the genome at most, so the 80%-or-more figure really does stand out. But this paper has more than philosophical objections to the ENCODE work. They point out that the consortium used tumor cell lines for its work, and that these are notoriously permissive in their transcription. One of the principles behind the 80% figure is that "if it gets transcribed, it must have a function", but you can't say that about HeLa cells and the like, which read off all sorts of pseudogenes and such (introns, mobile DNA elements, etc.)

One of the other criteria the ENCODE studies used for assigning function was histone modification. Now, this bears on a lot of hot topics in drug discovery these days, because an awful lot of time and effort is going into such epigenetic mechanisms. But (as this paper notes), this recent study illustrated that all histone modifications are not equal - there may, in fact, be a large number of silent ones. Another ENCODE criterion had to do with open (accessible) regions of chromatin, but there's a potential problem here, too:

They also found that more than 80% of the transcription start sites were contained within open chromatin regions. In yet another breathtaking example of affirming the consequent, ENCODE makes the reverse claim, and adds all open chromatin regions to the “functional” pile, turning the mostly true statement “most transcription start sites are found within open chromatin regions” into the entirely false statement “most open chromatin regions are functional transcription start sites.”

Similar arguments apply to the 8.5% of the genome that ENCODE assigns to transcription factor binding sites. When you actually try to experimentally verify function for such things, the huge majority of them fall out. (It's also noted that there are some oddities in ENCODE's definitions here - for example, they seem to be annotating 500-base stretches as transcription factor binding sites, when most of the verified ones are below 15 bases in length).

Now, it's true that the ENCODE studies did try to address the idea of selection on all these functional sequences. But this new paper has a lot of very caustic things to say about the way this was done, and I'll refer you to it for the full picture. To give you some idea, though:

By choosing primate specific regions only, ENCODE effectively removed everything that is of interest functionally (e.g., protein coding and RNA-specifying genes as well as evolutionarily conserved regulatory regions). What was left consisted among others of dead transposable and retrotransposable elements. . .

. . .Because polymorphic sites were defined by using all three human samples, the removal of two samples had the unfortunate effect of turning some polymorphic sites into monomorphic ones. As a consequence, the ENCODE data includes 2,136 alleles each with a frequency of exactly 0. In a miraculous feat of “next generation” science, the ENCODE authors were able to determine the frequencies of nonexistent derived alleles.

That last part brings up one of the objections that many people many have to this paper - it does take on a rather bitter tone. I actually don't mind it - who am I to object, given some of the things I've said on this blog? But it could be counterproductive, leading to arguments over the insults rather than arguments over the things being insulted (and over whether they're worthy of the scorn). People could end up waving their hands and running around shouting in all the smoke, rather than figuring out how much fire there is and where it's burning. The last paragraph of the paper is a good illustration:

The ENCODE results were predicted by one of its authors to necessitate the rewriting of textbooks. We agree, many textbooks dealing with marketing, mass-media hype, and public relations may well have to be rewritten.

Well, maybe that was necessary. The amount of media hype was huge, and the only way to counter it might be to try to generate a similar amount of noise. It might be working, or starting to work - normally, a paper like this would get no popular press coverage at all. But will it make CNN? The Science section of the New York Times? ENCODE's results certainly did.

But what the general public things about this controversy is secondary. The real fight is going to be here in the sciences, and some of it is going to spill out of academia and into the drug industry. As mentioned above, a lot of companies are looking at epigenetic targets, and a lot of companies would (in general) very much like to hear that there are a lot more potential drug targets than we know about. That was what drove the genomics frenzy back in 1999-2000, an era that was not without its consequences. The coming of the ENCODE data was (for some people) the long-delayed vindication of the idea that gene sequencing was going to lead to a vast landscape of new disease targets. There was already a comment on my entry at the time suggesting that some industrial researchers were jumping on the ENCODE work as a new area to work in, and it wouldn't surprise me to see many others thinking similarly.

But we're going to have to be careful. Transcription factors and epigenetic mechanisms are hard enough to work on, even when they're carefully validated. Chasing after ephemeral ones would truly be a waste of time. . .

I wrote a post about Graur's article on Sci Am when it came out (clicking on my name will take you to the link). I too think people need to look beyond the tone. In fact I think that this article is a welcome change from the usually boring technical style that we are accustomed too. Sometimes it's worth stirring up the pot, especially if you are countering something that's been hyped so much by the media.

Well. I disagree with the "Houston" team's definition of function. If a mutation does not destroy a function then it is not a functional region?? What if a mutation simply alters it, to a small extent? While I don't fully agree with the ENCODE team's contention that 80% is functional, even this one is not really upto it.

"If Protein A functionally binds Substrate B at 60 uM and hydrolysis occurs at X rate", and then a mutation causes "Protein A to bind Substrate B at 200 uM which results in a 2-fold decreased rate of hydrolysis" that mutation is destroying the native function of the enzyme. Sure, it is still functional but not with the same function it once had.

if 80% of our DNA were functional, then the use of any of that for a "molecular clock" to measure evolutionary distance would be invalid. The success of genetic methods in reproducing phylogenetic trees says that a lot of the DNA we're carrying is indeed free to mutate without selection over time.

Ewan Birney tried to repair the damage on his blog, explaining that the press release and abstract were d̶u̶m̶b̶e̶d̶ ̶d̶o̶w̶n̶ simplified to avoid confusing the press with two numbers. He also belatedly explained the difference between function and biochemical activity. But that important distinction was left out of the original press release.

And so a false "fact" was duly reported by the media, including the New York Times: "The human genome is packed with at least four million gene switches that reside in bits of DNA that once were dismissed as 'junk' but that turn out to play critical roles in controlling how cells, organs and other tissues behave."

If Graur's vitriolic tone forces the NYT and other media to reassess this "fact", or at least report that there is a legitimate scientific diagreement, I can't fault him. So far, the polite approach hasn't worked.

(I haven't read ENCODE or Graur, but that won't stop me from asking ...) Suppose you need to separate a gene or motif or several segments of DNA from each other (perhaps to accommodate transcription factors, processing machinery, etc.) by a specific distance. Put down some "junk" DNA of the required length and proper characteristics to separate and not screw things up. Is that functional or causal? If that segment gets mutated a problem might not result from whether you've gone from an asphalt separator to pot-holed asphalt to gravel to dirt. A major deletion (or addition) could alter distances and prevent binding or functional binding. The gene sequence wouldn't matter, but having a precise amount of DNA would matter.

I read the television analogy above (and will have to read it in more detail to make sure I understand what Graur is saying) but I don't think this is like leaving ONE TV on for millions of years and expecting nothing to happen. A self-replicator (Sears, BestBuy) is replacing the TV at a regular rate and (being Sears and BestBuy) they will make mistakes over time (deliver the European model instead of the US model, hook up the cable to wrong connectors, etc.) and the mutated TV will sometimes work and sometimes not work. If I'm using the TV to hold a potted plant the mutations are less important than if I'm trying to watch a Red Sox game.

(On the organic chemistry side, I used to love it when Fritz Menger would blast some hyped up chemistry publication. Come to think of it, I think Menger was one to bash Rebek's non-gene self-replicators.)

I also agree that encode paper certainly didn't prove 80% of the genome is functional.

On the other hand I am pretty sure the function of a significant portion of our DNA is structural - spacers, support of 3D nuclear scaffold, etc. It's not easy to alter such function by mutating unless we are talking large insertions/deletions. And it is even harder to prove it.

Scathing criticism may be entertaining (unless you're the recipient), but aren't some of the attacks verging on the personal? That's something frowned upon by everyone from Nature to the Fresno Shoppers Gazette.

On another note, there are people out there who can one day tell us what happens if you leave a television on for a million years...

@2: you missed the point here - it wasn't a question of whether A mutation might leave some function intact, it was that NO mutation could destroy the function (apologies for the caps - didn't mean to shout).

I think the open chromatin part is the weakest in Graur's otherwise entertaining paper. The quoted passage 'turning the mostly true statement “most transcription start sites are found within open chromatin regions” into the entirely false statement “most open chromatin regions are functional transcription start sites”' - sounds like a misinterpretation as I can't find such a claim anywhere. Has Graur never heard of enhancers (which he doesn't mention in the paper)? Are they always "non-functional"?

"One of the principles behind the 80% figure is that "if it gets transcribed, it must have a function", but you can't say that about HeLa cells and the like, which read off all sorts of pseudogenes and such (introns, mobile DNA elements, etc.)"

However, that's definitely not to say that transcribing these things can't cause disease (meaning that they might be drug targets). I should have had a post up on this point based on a very recent paper, but a happy family event supervened.

"One of the principles behind the 80% figure is that "if it gets transcribed, it must have a function", but you can't say that about HeLa cells and the like, which read off all sorts of pseudogenes and such (introns, mobile DNA elements, etc.)"

And who can blame them? Faced with the prospect of eternal life in a cell culture dish you would also try every trick in your book to spice things up.

people who take money for pretending to be scientists deserve personal criticism as well as the sack. of course I am making a general comment rather than discussing the specifics here, which is far enough outside my area of expertise that I must rely on the word of the local specialists.

I am not a chemist, I'm a computer programmer. Can someone please explain the claim "much of the genome MUST be non-functional otherwise the whole thing would be too sensitive to mutations"?

Perhaps I'm naïve, but surely padding the genome with junk accomplishes NOTHING because a genome twice as long will be subject to twice as many mutations?

I am assuming each random mutation is independent of the rest.

Example, a genome with 1,000 base pairs (all functional) is copied with a 1% error rate. We expect 10 base pairs to be mutated. Oh dear!

Pad the genome with 9,000 junk base pairs. Copy it again with the same error rate. We expect 100 base pairs to be mutated. Don't panic! Most of them were in the junk, if we look in the functional part there are... let's see... 10 mutations.

I think that one part of the answer is that all mutations are not created equal. The importance of individual bases in the genome is wildly nonlinear. A good example is APP, amyloid precursor protein. One gene out of thousands, and it has 751 amino acids, but if one if them changes between glutamate and glutamine (not that big a change), you get rampaging Alzheimer's before age 40.

Another part is, I think, is that it's not that adding lots of junk DNA is any sort of help to the genome. It's just that it's not enough of a hindrance to be selected against.

I think #18 Gomez is slightly confused. Adding junk DNA does not (or does it ? Can it work as a sink for free radicals for example?) reduce the number of functional mutations. What authors argue, in your example, that if genome contains 1,000 functional base pairs (out of 10,000), we will get 10 functional mutations per cell division. If it contains 9,000 functional base pairs we will get 90 functional mutations which makes our genome ultra-sensitive given the known (from biochemistry, genomic tree, etc) mutations frequency.

@akhilesh,Anonymous #2-3: I think they are referring to Tajima's D measures on segments of DNA. If selection is active, then the rate of mutations differs from a background level. That would happen, among other reasons, because neutral mutations that are linked (same gene, hitchhiker, whatever) to harmful ones get wiped out by negative selection on the harmful one. Positive selection would also reduce variation, by making whatever neutral mutations are associated with the helpful gene to be "the new normal" throughout the population over time.

That test is a really restrictive requirement on function, though. Not in theory, of course, in theory it's a perfect measure, and that's what Graur et al are promoting. In practice, all we have is an estimate of the background mutation rate for SNPs, so there's no practical test for significance. It's done by rule of thumb, eyeball, expectation of experience with that species. We're sure, if there's a big difference, selection is operating. But being dogmatic that the fuzzy fringes of the test put absolute bounds on the amount of functional DNA seems a bit foolish to me.

In addition, while Graur et al are happy to talk about how low effective population size and long replication times could translate to keeping around junk DNA, they don't get around to mentioning that low effective population sizes and long replication times also diminish the accuracy of their measure of selective function. I believe those non-idealities tend to make the test weaker, so only stronger selective function is clearly seen, thereby confirming their bias that most DNA is non-functional and the rest is junk and garbage.

I also dislike the way this paper puts words in the mouths of the original authors. I sense a straw man army assembling to try to overwhelm any possible objection. Some of the methodological issues they point out with ENCODE are probably valid, and ought to be aired. Those can and hopefully will be addressed, independent of the top-level argument about how confident we can be about what we do not know.

Graur et al write:
"It has been pointed to us that junk DNA, garbage DNA, and functional DNA may not add
up to 100% because some parts of the genome may be functional but not under constraint
with respect to nucleotide composition. We tentatively call such genomic segments
'indifferent DNA.' Indifferent DNA refers to DNA sites that are functional, but show no
evidence of selection against point mutations. Deletion of these sites, however, is
deleterious, and is subject to purifying selection. Examples of indifferent DNA are
spacers and flanking elements whose presence is required but whose sequence is not important...
The amount of indifferent DNA is not known."

So, let me get this straight: they recognize a chunk of DNA for which their very conservative, simple-minded "function" test does not work (the background mutation rate for deletion of entire sites is unknown), they acknowledge they have no idea how large this chunk may be, and yet they very boldly claim someone else's measure of function is worthless and their own numbers--ignoring this category--is the True Gospel? Pot. Kettle. Black, black, black.

On their concept of junk: consider a fly swatter. Original purpose: to kill flies in the house. Have there been any flies in the house in a long time? No, it's wintertime, and there are screens on the windows and doors. Is the fly swatter junk? If next summer rolls around and a fly shows up and the fly swatter is used, is that a case of junk being coopted for a "new" purpose? What if it's not next summer, it's been two years--is it junk now? I think at the fringes, this is how the debate boils down to angels dancing on the head of pins.

That textbook you saved from college, the bent gym clip I keep handy (ejects CDs/DVDs, hits recessed reset buttons): junk or useful tool? Graur et al apparently are the clean freaks of the DNA world--did we use it this month?--and ENCODE are the professors with books and papers stacked high.

(Trying to rescue this post toward disease and medicines...) Did ENCODE talk about non-ATG-initiated translation mechanisms?

Would't it be a cruel joke to actually test ENCODE "researches" blindly by supplementing Marbled Lungfish genome sequence instead of Homo Sapiens. I bet they will still discover it contains 80% or so of "functional genome". That is out of 130,000 Mb according to Wiki. In comparison, human genome is 3,2000 Mb or just about 2,5% ! S that fish can lose 97% of "function" and not just survive but be as complex as human. Worse still, there is another fish, Pufferfish, which genome is only 340 Mb.
Now this is getting really silly. Fish can lose _99.8%_ of the whole genome and still be a happy thriving fish. Now, how much of that 99.8% was "functional" if you trust those 400+ ENCODE employees ?