Cataloging the controlled chaos of the human genome

The completion of the human genome, rather than being the huge breakthrough it was presented as, raised almost as many questions as it answered. Less than two percent of it encoded a protein, and only about five percent ended up being conserved relative to many of our fellow mammals. The rest of it seemed like a bit of a mess—damaged viruses, long stretches of repetitive sequence, and huge stretches devoid of any genes.

The ENCODE project was formed to make sense of that mess. A huge consortium of labs, ENCODE started performing a massive assay of pretty much everything to do with DNA: where proteins stuck to it, how it was packaged, how it was modified, etc. Now, the consortium is out with a massive number of papers—six alone in today's Nature, and dozens to follow in the days to come, scattered through many different journals. The results are staggering: over 1,500 different types of data, assayed across the whole genome, from 147 different cell types. It will feed researchers for years to come.

And it suggests that more of the chaos in the genome may be useful, although that suggestion comes with some big caveats.

First, the data. The researchers used a huge variety of techniques to look at just about anything that can happen to the DNA inside a cell. They took 119 proteins that bind to DNA and found every single site in the genome that they stuck to. They looked for signs of a chemical modification of DNA called methylation that can change the expression of genes nearby. The location of the proteins that structure chromosomes, called histones, were also checked, and areas where the chromosome structure is relatively accessible were pinned down. Regions of the chromosomes that are close to each other inside a cell were mapped. Every single RNA within a cell was sequenced and resequenced.

Once a single cell type, like liver cells, was checked, the authors moved on to look at another. And another. The end result was 1,640 different data sets, covering a total of nearly 150 different cell types. It's a staggering amount of work; calling ENCODE the Large Genome Collider would provide a real sense of the scale of this effort.

Functional or not?

What does it tell us about the genome? Some interesting things, although they're a bit tough to interpret. With data on this scale, we can't actually look at each piece of data on every piece of DNA. So we have to define some generic definitions of things like function: an RNA is made there, a protein binds there, etc. ENCODE chose to go extremely broad: "Operationally, we define a functional element as a discrete genome segment that encodes a defined product (for example, protein or non-coding RNA) or displays a reproducible biochemical signature (for example, protein binding, or a specific chromatin structure)." Basically, if anything came out of any of these assays regarding a base, it was considered functional.

And by that definition, most of the genome is doing something. The figure was about 80 percent in these assays, and by adding more cell types, the authors suggest we could get the number even higher.

But there are some problems with that figure. For instance, many of the non-functional viruses and mobile genetic elements will still contain the protein binding sites that used to be critical to their function. So even though these will be traditionally classified as "junk DNA," they'll be considered functional under the ENCODE definition. More generally, proteins that bind DNA are not very picky about their binding sites, so they will create some indication of function by accident; they can also lead to RNA production nearby, looping other bases into an apparent function.

That said, the study does provide some solid indications that more of the genome is doing useful stuff than many previous studies have estimated. More stringent criteria can be used—multiple proteins in close proximity, an accessible chromosome configuration, etc. These reduce the percentage of the genome that is likely to be doing something biologically useful down to about 10 percent—which is lower, but still a lot higher than the five percent that is conserved throughout mammalian evolution.

The authors mention a bit about where that additional five percent might come from. Some of it appears to be primate specific, in that it's common to primates but not found in other mammals. That hints that it might be doing something specific in our own lineage. Further evidence comes from the 1,000 genomes project, which suggests some of these sequences are doing human-specific things.

Although the 80 percent figure doesn't really fit my definition of functional, the authors do present a compelling case that more of the genome is doing something useful than we previously had evidence for.

Definitions with consequences

One of the other things the authors look at is all the human genetic differences that have been associated with diseases. Most of these haven't been located in genes, but the authors find they're enriched in the sequences that have met some definition of "functional." And, if you expand to look for things that are only near something that has been identified by ENCODE (rather than right at a location), the number goes up further still.

There are probably many cases where this involves a real and significant association between a change and human disease. But it's also probable that many of these are false positives, given the very high frequency of sites identified as giving a positive signal in at least some ENCODE assays.

In the long term, that could be a significant problem. ENCODE is mostly providing a resource for other researchers to understand the things they're studying, be it DNA binding proteins or human diseases. With 1,640 assays in 147 cell types, it will be very easy for them to search the ENCODE data and find a single assay that tells them what they'd like to see.

This danger is highlighted by the fact that the percentage of the genome flagged as functional is already at 80 percent, and the authors suggest that the number will go higher when they look at additional cell types and DNA binding proteins. If it reaches 100 percent (something that an ENCODE researcher has already suggested), then it will be very difficult to not come home with a positive result whenever you search the ENCODE genome data.

All of this means that ENCODE will be a tool that has to be used very carefully, and things that come out of it should be checked in detail. Still, it's a potentially powerful tool, since it provides so much information about so many different genes and cell types. Biologists will undoubtedly end up enjoying having it as much as they've enjoyed having the genome sequence in the first place.

39 Reader Comments

The data sounds like it will be a very useful resource. I'm concerned, however, about the possibility of this being terribly abused. I can see it now: "ENCODE Data Shows 80% of Genome Functional. Junk DNA Myth Exposed! Intelligent Design Most Likely Explanation."

Trying to get some kind of grasp on this data is indeed a challenge. If I have it right, one piece of evidence is the wide distribution of the location of SNP's that have some kind of association with different diseases. Another reality is that the activity of chromatin does seem to be intertwined with much of the regulation of DNA transcription. It is possible that the primary logic of all of this DNA is as a structural skeleton as proposed by Tom Cavalier-Smith. Once it was there and involved with chromatin just to manage its physcial state, it is not hard to imagine how regulatory effects could have evolved.

The massive amount of work the project has done (442 researchers!), and the brilliant ways in which they present and use that work, and the beautiful things they have discovered — it's all there, every paragraph bursting with sense of wonder. Give yourself a treat, and go read it.

First, let me just say, "Duh!". Second, I predicted that "junk dna" would be found to have some utility (most likely system level controls) a couple of years ago and JT acted just as warily as he did to this article. Of course I made this prediction only on the basis of intuition with absolutely no data. So, yeah, thanks to ENCODE for providing that... oh yeah, the in-depth analysis, too.

Yeah, I kind of wish they'd said 80% was active rather than functional. Researchers are going to have to be pretty careful how they define their functions.

Good, please do go and write a blog post about your definition of functional and the strong opinions you appear to have on the matter. Otherwise, present the argument without distracting personal opinions.

Good, please do go and write a blog post about your definition of functional and the strong opinions you appear to have on the matter. Otherwise, present the argument without distracting personal opinions.

While I think your response is a little truculent I do get the feeling that the gist of the article was to "debunk" the "80% functional" meme that's floating around now rather than covering ENCODE.

The massive amount of work the project has done (442 researchers!), and the brilliant ways in which they present and use that work, and the beautiful things they have discovered — it's all there, every paragraph bursting with sense of wonder. Give yourself a treat, and go read it.

The massive amount of work the project has done (442 researchers!), and the brilliant ways in which they present and use that work, and the beautiful things they have discovered — it's all there, every paragraph bursting with sense of wonder. Give yourself a treat, and go read it.

Thank-you for that. This provides quite a bit more information.

++ Please read the article above but also check out the ENCODE website:

I'd hold off on saying something is NOT functional. When our bodies go nuts from NOT being exposed to the environment it adapted to (kinda odd you body doesn't work right when your not exposed to certain bacteria/virus right?), so looking at the DNA wondering how it fits in with anything else in there, when it doesn't because it fits in with an external influence that the body adapted to expect, makes excluding things difficult yes?

I think it is delightful that so much of our DNA is available for transcription or protein binding, or some other aspect of "function". This is the raw material for evolution. Surprising interactions or useless products that suddenly turn out to be selectable in a new context! If everything had precisely one purpose (as if designed) there would be far less slosh to play with evolutionarily. In other words, it has to be this way if the system is going to show emergent behaviors. And it does.

It also serves as a strong warning against anyone claiming that a genome wide association "proves" some piece of DNA is important in their favorite disease. Correlation vs causation again.

To begin with, i did link to Ed Yong's piece at Discover in the article itself. Ed's also been good about tracking the fairly large number of researchers who also find the 80% claim to be problematic. I'll quote the most relevant paragraph:

"Barely five hours after publication, and a backlash against the 80 percent figure has already begun. Leonid Kruglyak from Princeton University said that the 80% includes definitions of “activity” that are “barely more interesting” than just saying the sequences were copied (which all of them are). Michael Eisen added that “measurable biochemical activity”, which is how the wider range of elements was defined, “is a meaningless measure of functional significance.” And Daniel Macarthur from the Broad Institute said, “It doesn’t change the fact that the majority of the genome is indeed evolutionary junk.” More on this as it happens."

And as for complaints about personal opinions being included in the piece - I explained, why, in scientific terms, i had issues with the figure. And providing reasoned arguments as part of our stories has been a part of Ars since before i started here.

What I find interesting is the astrobiology perspective, the new and larger opening to the evolutionary history and the RNA world this gives.

- The specifically much looser requirement for selective pressures on some RNAs, especially on long non-coding RNA (lncRNA) like the signal recognition particle RNA shared by all cells. Apparently long stretches of RNA can accommodate the functional folding despite variation.

This means there can be more remains of the RNA world than earlier believed. In bacteria there are several clusters of RNA akin to them having coding gene clusters, operons, for regulatory purposes. (Reading of like amounts of related genes in a sweep.)

And there are also thousands of small RNAs in bacteria. While ENCODE finds that the correlation between selection fixation and function isn't all that tight outside the coding transcripts, meaning some of those can have function along lineages.

- The ENCODE project sees the functional unit as the RNA transcript, not the gene. The non-coding RNAs outnumber the proteins 10:1, at least in some eukaryotes.

Conceptually it means our cells are seen as more akin to the original RNA/protein cells, and it is perhaps easier to understand the change the later DNA add on machinery ushered in.

divisionbyzero wrote:

I predicted that "junk dna" would be found to have some utility (most likely system level controls) a couple of years ago and JT acted just as warily as he did to this article.

Well, there are a couple of factors to consider here:

- Researchers have headed off the awaited discoveries by renaming the non-coding genome as "genomic dark matter". This means potentially undiscovered functionality more than the original "junk DNA" term which stood for pseudogenes only AFAIU. I wonder what gave them the idea...

- Much of this is anticipated epigenetics, with previously known histones, methylation, et cetera.

- The dark matter still contains a lot of genetic parasites (can be tens of percents IIRC) and the original "junk DNA" - the no longer used pseudogenes. Parasites have functionality in the ENCODE sense, but not in the host centric sense.

The latter can be a large factor in extreme cases. When I studied up on the lncRNA for this comment I saw that especially pathological bacteria, who has a lot of changing selective pressures due to the hosts immune systems and competitors et cetera, can have as much as ~ 40 % of the genome in the form of pseudogenes. That is all non-functional for the host there and then. (A very few pseudogenes are later reinstated as parts of new genes.)

"The completion of the human genome, rather than being the huge breakthrough it was presented as, raised almost as many questions as it answered"

Had a single human genome been the outcome of the effort its value would have been something like a trip to the moon. In fact, the realization that all Eukaryotes shared a limited number of proteins is one of the great breakthoughs of science. There seem to be many in the scientific community who are questioning the value of the Encode work. One wonders where they were when the brains were handed out. Of course, one should never be surpised at the petty attitudes of scientists when money is involved. Nobody should expect any quick answers about a compex system that was designed through a process of trial and error. Nevertheless, it looks like modern technology is up to the task of unlocking all the details of that complexity. Nobody should be surprised if it takes some time and a lot of hard work.

Man, this article was a downer compared to most which emphasize the scientific discoveries and new directions of thought that stem from the work. Even those scientists you quote above, John, come off as sounding defensive of some belief rather than really appreciating the research doors this opens. What this work shows us is that the potential of our genome is incredibly more complex than before imagined, and gives thousands of keys to researchers looking to unlock these doors. As one writer put it - it's like some lights have been turned on, where before we were fumbling around with a pen light.

First, let me just say, "Duh!". Second, I predicted that "junk dna" would be found to have some utility (most likely system level controls) a couple of years ago and JT acted just as warily as he did to this article. Of course I made this prediction only on the basis of intuition with absolutely no data. So, yeah, thanks to ENCODE for providing that... oh yeah, the in-depth analysis, too.

I was thinking something similar to you. I remember reading ages ago (10 years-ish) about software that used evolutionary methods to design circuits. The circuits ended up having all sorts of useless and functionless parts on them, but if the apparently useless parts were removed, the circuits no longer functioned properly.

English is a clumsy language, and "junk" influences people's perceptions. I'll bet although 80%+ of junk DNA can function, that doesn't make it necessarily functional.

Quote:

Yeah, I kind of wish they'd said 80% was active rather than functional. Researchers are going to have to be pretty careful how they define their functions.

I really hate the term junk DNA. I am no biologist, but it seems to me that if it was not functional, by natural selection it would get eliminated as needed over time (deep time). I am curious if some of the DNA is there to provide for things that are not so common anymore. Perhaps we needed certain sections to allow us to interact with certain bacteria, symbiotically or for fighting infections, that may no longer exist. I would hope that natural selection allows for DNA to remain for a long while after no longer needing it in case the need arises again. After all didn't each part of our DNA chains take hundreds to thousands of years to become integrated into our genome, it would suck if each generation had to re-build their own resistance to diseases, or allowance of symbiotic relationships in our guts.

I remember reading that originally homo sapiens were lactose intolerant (even though we did have mammalian glands), and only after many generations of milk drinking societies did we finally evolve. I guess we still are not 100% lactose tolerant, but in today's age we no longer really require lactose for survival.

Not a biologist so i could be way off on my understanding, but i have a hard time thinking the body spends time replicating stuff that would/should be considered junk because each part requires energy to produce/reproduce. If 90% of our dna is junk, shouldn't evolution get rid of it?

Thanks for the nice article, Dr. Jay. And I also agree that Ed Yong's post on Not Exactly Rocket Science is outstanding as well. There is just one more link that should be included if people want some more excellent information about ENCODE:

ENCODE: My Own Thoughts, by Ewan Birney, who was the lead analysis coordinator of the project. (Ed Yong has a link to this in his post as well)

I'm not going to say that this oversimplification is the reason why there is "junk DNA", because it is never this simple, and there is still plenty that we don't know, but...

If the cost of removing the "junk" from the genome exceeds the cost of just leaving it there and replicating it, then evolution would not get rid of it.

I'm not a big fan of the term "junk DNA" either, but debating whether or not it actually is junk all comes back to the debate about what is meant by functional and from what point of view something should be considered junk. For example, a virus that inserts itself into a bacteria's genome may be "junk" to the bacteria (unless it brings along genes like antibiotic resistance or something) but obviously it's not junk to the virus. Similarly, "selfish genes" in the specific sense of genomic parasites, would also probably be considered junk to their host genomes, but they manage to stay in the host genomes because they have adapted to being able to remain in the genome. Are they actually junk? Not to themselves. I am, of course, speaking metaphorically about how the situation seems to behave over generational and evolutionary time. Obviously the molecules don't think of themselves at all.

Eh? I didn't realize I was casting any light. Your post started with your thoughts about how evolution should get rid of the junk and ended with the question of whether or not that's the case. I was merely attempting a simplified conceptual answer to your question.

Is there some other way your question could be interpreted that would spark the trolls?

Man, this article was a downer compared to most which emphasize the scientific discoveries and new directions of thought that stem from the work.

I don't get that impression. Dr. Jay can't stop himself from telling us what a huge, monumental undertaking this was and how thorough the cataloging is. But like most bio guys I've read, he takes issue with the severe dilution of the term "functional," which is somewhat vindicated by the press coverage claiming that this work overturns the concept of "Junk DNA" and so is some kind of reversal of the establishment. It could have been avoided if the ENCODE team had simply not hijacked an existing term used by geneticists every working day, and come up with a term like "potentially active" or something.

If one thinks of proteins as functional units, one only needs so many for an entire organism. However an organism made up of hundreds of different types of cells, and if one includes spatial relationships, in effect millions of types of cells, needs to make each cell type express the particular mix of proteins, and quite possibly the 10s or 100s of thousands of RNAs which allow that cell to be at that position and to function properly. One can be confident that this control uses up a lot of the dark matter DNA, and it may use up most of it. Who knows if some dead viral transposon provides just the right degree of bend around a histone to allow one cell type to be the skin surface I am typing with, and another to be the cornea that I am using to correct all of the typos? As evolution is stochastic, there is no saying that fixes it adopts are in any way related to the original purpose of that DNA, or elegant, or parsimonious. Duct tape can easily cover a hole in the door, but rather a lot of it would be required to fix a broken door frame, but a few rolls and a couple of splints (histones?) and it would function.

While working on any coding project, I create many functions that get depreciated. They are still 'functional', technically, as in they do stuff, but they get replaced by better functions. So, let them use the term 'functional', while knowing that 80% of it is 'depreciated functions', which may one day give us as much insight into our biological evolution as old languages do into the process of language evolution.

Here is how I teach this stuff. Organisms are a product of a process of development, and development is a process of "what-where-when." If all I tell you is which parts of the genome code for proteins, then all I am telling you is "what." Imagine constructing a building, and being presented with a list - "concrete, wood, nails... etc.." That would be useful, but would give you no idea how to build a building, or what kind of building you are looking at. (And given how many proteins signal, rather than construct, and given the arbitrary nature of the relationship between signal and what is signaled, it is even worse.)

The definition of functional genes as coding genes is obviously too restrictive. The definition of functional genes as "ones we can tell are reacting in some way or other to one of our tests" is obviously too inclusive. But if you are trying to figure out the where and when, I would rather have the inclusive list as a starting point.

As you can see, these elitist men of science cannot agree about whether so-called "DNA" actually does anything or not. Isn't it time to throw off oppressive political correctness and allow real innovation in biology? Peer reviewed articles by Drs Morgan and Poe clearly show the utter failure of their theory. Students should receive equal education on Lamarckism.

While working on any coding project, I create many functions that get depreciated. They are still 'functional', technically, as in they do stuff, but they get replaced by better functions. So, let them use the term 'functional', while knowing that 80% of it is 'depreciated functions', which may one day give us as much insight into our biological evolution as old languages do into the process of language evolution.

Some of the things they identify might not have ever had a function, it may have just been the random product of a copying error that never made proteins in the history of its existence. It could have been genetic nonsense from day 1 through year 50 million. So they wouldn't even be "deprecated." Sometimes the genetic material they labeled as functional was merely stuff that can be transcribed to RNA, not that it even is under normal conditions. I still think that a better label would have been "potentially _____" instead of "________"

Man, this article was a downer compared to most which emphasize the scientific discoveries and new directions of thought that stem from the work. Even those scientists you quote above, John, come off as sounding defensive of some belief rather than really appreciating the research doors this opens.

I think it is in the nature of the beast. ENCODE is a (unfinished) database.

I really hate the term junk DNA. I am no biologist, but it seems to me that if it was not functional, by natural selection it would get eliminated as needed over time (deep time).

crs117 wrote:

Not a biologist so i could be way off on my understanding, but i have a hard time thinking the body spends time replicating stuff that would/should be considered junk because each part requires energy to produce/reproduce. If 90% of our dna is junk, shouldn't evolution get rid of it?

That isn't what is observed, and it doesn't seem to depend on cost. T. Ryan Gregory of the Genomicron link Interactive Civilian gave is the go to man here, he is "an evolutionary biologist specializing in genome size evolution".

He proposes the "Onion Test" that every genome size theory should pass, founded on the following observation: plants, among them onions, can easily double their chromosomes many times over (tetraploidy et cetera), still you can't really tell them apart.

Presumably half of their genome is therefore "junk", conversely it doesn't go away over time. These clades shows this is a stable feature among them.

There are many reasons to believe the junk is useful (can be material for new genes) or have function (sets the size of the nucleus and therefore indirectly the cell). This may not pass the Onion Test, because if so genomes would bootstrap up to the "right" size.

There are other reasons to believe it costs little. Lane's energy theory on eukaryotes predicts that we have ~ 10^5 times as high energy density due to mitochondrial endosymbionts, energy that can be used for protein turnover and so a larger genome by many orders of magnitude.

There are organisms like pufferfish that has compact genomes for some reason or other. It will be interesting when and if the ENCODE results meet these genomes. How do they do it, and why?

Parasites can often simplify their body plans, if not always their life cycle, so simplification as such is not unheard of. But why does it happen in the genome? My bet is now that pufferfish doesn't have simpler genomes that simply gets rid of dark matter, their compactness is likely even more complex than the usual eukaryote genome.

* Though, similarly to Lane's energy theory on eukaryotes, we have Valentine's energy theory on archaea.

The ecological differences between these 3 domains would be predicted by eukaryotes being high energy heterotroph or oxygenic photosynthesis consumers, bacteria medium energy opportunists with a set of metabolic pathways options and archaea being low energy specialists on a few metabolic pathways.

This predicts archaea low permeability membranes (little ion leakage so high energy efficiency) _and_ their small genomes: few pathways and a high cost penalty on non-compactness.