Getting to Know the Genome

A massive project involving hundreds of scientists suggests that very little—if any—of the human genome is truly non-functional.

By Ed Yong | September 5, 2012

Nature

In 2001, the Human Genome Project produced a near-complete readout of the human species’ DNA. But researchers had little idea about how those As, Gs, Cs, and Ts were used, controlled, or organized, much less how they code for a living, breathing human.

That knowledge gap has just got a little smaller. A massive international project called ENCODE, the Encyclopedia of DNA Elements, has cataloged every nucleotide within the genome that does something—which, it turns out, is significantly more than the 1.5 percent of the genome contains actual instructions for making proteins. The research, a 10-year effort by an international team of 442 scientists, shows that the rest of the genome—the non-coding majority—is still rife with “functional elements.”

“The genome is no longer an empty vastness,” said Shyam Prabhakar from the Genome Institute of Singapore, who was not involved in the study. “It is densely packed with peaks and wiggles of biochemical activity.”

“Almost every nucleotide is associated with a function of some sort or another, and we now know where they are, what binds to them, what their associations are, and more,” added Tom Gingeras, one of the studies’ many senior scientists. The results are published today (September 5), in more than 30 papers across many different journals.

Researchers have long recognized that some non-coding DNA probably has a function, and many solid examples have recently come to light. At the same time, people didbelieve that much of these sequences were, indeed, junk. The ENCODE project suggests otherwise.

The researchers found that many non-coding parts of the human genome contain docking sites where proteins can bind, affecting the expression of both nearby and distant genes . Other non-coding regions are transcribed into RNA molecules that are never translated into proteins. Still others affect how the DNA is folded and packaged. In sum, these regions are not just junk; according to ENCODE’s analysis, 80 percent of the genome has some biochemical function.

The remaining 20 percent may not be junk either, according to Ewan Birney, the project’s Lead Analysis Coordinator. He explains that while ENCODE looked at 147 different types of cells, there are a couple of thousand in total. If other cell types are examined, functions may emerge for the phantom proportion. “It’s likely that 80 percent will go to 100 percent,” Birney said. “We don’t really have any large chunks of redundant DNA. This metaphor of junk isn’t that useful.”

The implications are vast, from redefining what a “gene” is to providing new clues in the quest to understand diseases and how the genome works in three dimensions. “There are nuggets for everyone here,” Prabhakar said. “No matter which piece of the genome we happen to be studying in any particular project, we will benefit from looking up the corresponding ENCODE tracks.”

Of course, there’s still a long way to go, Birney noted. “I think it’s going to take this century to fill in all the details,” he said. “That full reconciliation is going to be this century’s science.”

By the numbers

Researchers already knew that 1.5 percent of the genome codes for proteins. ENCODE found that an additional 8.5 percent codes for regions where proteins stick to DNA, presumably regulating gene transcription. And, because ENCODE hasn’t looked at every possible type of cell or every possible protein that sticks to DNA, this figure is likely conservative. Birney estimates that the total proportion of the genome that either creates a protein or sticks to one is around 20 percent.

The rest of the functional elements in the ENCODE analysis cover other classes of sequence that were thought to be essentially functionless, including introns. “The idea that introns are definitely deadweight isn’t true,” said Birney. Even some repetitive sequences—small chunks of DNA that have the ability to copy themselves and are typically viewed as parasites—are likely to be functional, often containing sequences where proteins can bind to influence the activity of nearby genes. Perhaps their spread across the genome represents not the invasion of a parasite, but a way of spreading control. “These parasites can be subverted sometimes,” Birney said.

Birney expects that many skeptics will argue about the exact proportion—the 80 percent of the genome that ENCODE estimates to be doing something—and about the definition of “functional.” But, he said, “no matter how you cut it, we’ve got to get used to the fact that there’s a lot more going on with the genome than we knew.”

What’s in a gene?

The simplistic view of a gene is that it’s a stretch of DNA that is transcribed to make a protein. But with ENCODE’s data, this definition no longer makes sense. There are a lot of transcripts, probably more than anyone had realized, some of which connect two previously unconnected genes. This means that the boundaries for those genes have to widen, and the gaps between them shrink or disappear.

Gingeras says that this “intergenic” space has shrunk by a factor of four. “A region that was once called Gene X is now melded to Gene Y,” he says. With such blurring boundaries, Gingeras thinks that it no longer makes sense to think of a gene as a specific point in the genome, or as its basic unit. Instead, that honor falls to the RNA transcript. “The atom of the genome is the transcript,” says Gingeras. “They are the basic unit that’s affected by mutation and selection.”

New disease leads

For the last decade, geneticists have run a seemingly endless stream of genome-wide association studies (GWAS), and have thrown up a long list of single nucleotide polymorphisms (SNPs) that correlate with the risk of different conditions. The ENCODE team has mapped all of these GWAS-identified SNPs to their data.

The researchers found that just 12 percent of known SNPs lie within protein-coding areas. They also showed that compared to random SNPs, the disease-associated ones are 60 percent more likely to lie within the non-coding but functional regions that ENCODE identified, especially in promoters and enhancers. This suggests that many of these variants are controlling the activity of different genes, and provides many fresh leads for understanding how they affect our risk of disease. “It was one of those too good to be true moments,” said Birney. “Literally, I was in the room [when they got the result] and I went: Yes!”

The ENCODE researchers also found new links between disease-associated SNPs and specific DNA elements. For example, they found five SNPs that increase the risk of Crohn’s disease, and that are recognized by a group of transcription factors called GATA2. “That wasn’t something that the Crohn’s disease biologists had on their radar,” Birney said. “Suddenly we’ve made an unbiased association between a disease and a piece of basic biology.”

“We’re now working with lots of different disease biologists looking at their data sets,” he added. “In some sense, ENCODE is working from the genome out, while GWAS studies are working from disease in.” So far, the team has identified 400 such hotspots that are worth looking into.

The 3-D genome

Writing the genome out as a string of letters invites a common fallacy: that it’s a two-dimensional, linear entity. In reality, DNA is wrapped around proteins called histones like beads on a string. These are then twisted, folded and looped in an intricate three-dimensional way. In this way, distant parts of the genome can actually be physical neighbors, and can affect each other’s activity.

Job Dekker, a bioinformaticist at University of Massachussetts Medical School,used ENCODE data to map these long-range interactions across just 1 percent of the genome in three different types of cell, and discovered more than 1,000 of them. “I like to say that nothing in the genome makes sense, except in 3D,” said Dekker. The availability of the new ENCODE data is “really a teaser for the future of genome science,” he added.

Sharing the data

The new ENCODE results are vast, reported in 30 central papers in Nature, Genome Biology, and Genome Research, as well as a slew of secondary articles in Science, Cell, and others. And all of the data are freely available to the public.

The pages of printed journals are a poor repository for such a vast trove of data, so the ENCODE team have devised a new publishing model. On the ENCODE portal site, readers can pick one of 13 topics of interest, such as enhancer sequences, and follow them in special “threads” that pull out all the relevant paragraphs from the 30 main papers. “Rather than people having to skim read all 30 papers, and working out which ones they want to read, we pull out that thread for you,” Birney said.

The team has also built what they call a Virtual Machine, a downloadable program that includes all the code that the ENCODE scientists used to analyze their data. Any researcher can download almost-raw data and reproduce any of the analyses in the papers by themselves. It’s the ultimate in transparency.

“With these really intensive science projects, there has to be a huge amount of trust that data analysts have done things correctly,” said Birney. With the virtual machine, “you can absolutely replay, step by step, what we did to get to the figure. I think it should be the standard for the future.”

Tags

Add a Comment

Comments

What's old is new again. Nice to see how well the hits on the DNase hypersensitivity and Chip binding tracks line up. Could always see it happening in small regions in isolated studies, but here you can see it over large stretches.

The more energy we spend on accessing, documenting, measuring, comparing, analyzing... new data, the less time we will have for vacuuous debates over things we have no data to support.The progress in genetics, epigenetics, proteomics... has been exhillerating in this century so far.Am currently reading "The Beak of the Finch," by Jonathan Weiner, 1995, and discovering there is, after all, some hard data to support some aspects of current state of evolutionary theory. BRAVO! Too bad so many who are on both sides are unable to cite sources, such as this one, and make arguments based more on emotion than data or reason.There are numerous interpretations of existing hard data in this book that are compelling, and some that strike me as a bit forced and bordering on dogmatic.But let me recommend that each and every person who wishes to argue in opposition to bioevolutionary theory AND every person who wishes to argue for it, read this book verythoroughly.That way, neither side will have to argue points on theoretical or common sense grounds only. There IS hard data. And this book is so beautifully and sensibly written that it is both hard to put down and enormously enlightening.Far more hard data is becoming available in genetics, epigenetics,proteomics, microbiology, there is little time for emotional or pseudo-common sense arguments.Hard data and objective reason are where answers in science come from -- not from clashes of uninformed opinion. The way things ARE is the the way the ARE. And the more energy spent on discovering that, the less energy is spent in arguing over how many evolutionists it takes to screw in a light bulb. (Two, but it takes a rather large light bulb.)

The concept that is extended involves the epigenetic"tweaking" of immense gene networks in â€˜superorganismsâ€™ that solveproblems through the exchange and the selective cancellation and modification of signals. In every other species, we know that nutrient chemicals epigenetically effect intracellular signaling and stochastic gene expression and that pheromones do this also (even in avian species).

Nutrient chemicals are required for individual survival and their metabolism to pheromones controls reproduction. If their epigenetic effects on stochastic gene expression was not responsible for de novo gene expression (e.g., for new odor receptors) and for remaindered pseudogenes, we would have nothing but a theory of random mutations to explain species diversity, which is obviously dependent on nutrition and species-specific pheromones for ecological, social, neurogenic, and socio-cognitive niche construction. Indeed, until now we have had only a theory to compare to the biological facts of evolved gene, cell, tissue, organ, organ system reciprocity, which is obviously due to the epigenetic "tweaking" of immense gene networks by nutrient chemicals and pheromones. The Beak of the Finch was genetically predisposed by the nutrient chemicals available to its ancestors and that the metabolism of those chemicals to pheromones determined species survival via controlled reproduction. If that means we must now (e.g., finally) give up on the notion that avian species are primarily visual or auditory, perhaps even more rapid progress can be made when it is fully realized that species from microbes to man exist because the epigenetic effects of nutrient chemicals and pheromones allowed for the genetic diversity we now see was not due merely to random mutations. And, if the diversity is not due to nutrient chemicals and their metabolism to the pheromones that control reproduction, is there a model for that diversity?

Thank you very much for this contribution, in particular including the direct link to the ENCODE portal. This is surprisingly intuitive to use, contains direct access to an enormous amount of information and is incredibly fast. Looking at the data I wonder a little at the conclusion that there is direct evidence for functionality of 80% of the genome. However, looking at evolutionary conservation between species and which regions have been allowed to undergo high random mutation/deletion/insertions and which regions have not, the conclusion would be expected.

Darwin could not have argued for transgenerational epigenetic inheritance of behaviors associated with the obvious requirements of nutrient chemical-dependent reproduction controlled by pheromones in species from microbes to man, including his pigeons. Similarly, a 1995 book on finches could not have included anything about the importance of olfactory/pheromonal input to avian behavior. Thus, the two most necessary requirements for adaptive evolution via ecological, social, neurogenic, and socio-cognitive niche construction have not been included in discussions of genetics or evolved behaviors. A 1995 book on human pheromones would provide an excellent source of citations that bring us current to information available on genomic adaptation. See, for example, The Scent of Eros: Mysteries of Odor in Human Sexuality (1995/2002) as a guide to what has recently been detailed about adaptive evolution in the context of epigenetic effects on genetic predispositions that include genetically predisposed behaviors.

What a wonderful leap forwards and what a jaw-dropping vista the new data affords. I can with even more confidence now tell my children that a career in genetics will sustain not only them, but also their children and grandchildren, for it seems likely that it will take at least three generations of scientists to unravel the full complexity. The engineer in me says that we end up with a massive master circuit diagram, full of nested loops and strange counter-intuitive circuit topologies. Stability analyses will have been performed against both internal and externally-induced perturbations.

Once this level of understanding is reached (the poles and zeroes if you will), we can then re-engineer the whole thing. It is almost certain that we can do a better job post hoc than the accretive methods used by Nature

There has been a lot of ssational crap written about pheremones in my lifetime. As a consequence of my having read so much garbage, I welcome any source of information that provides hard evidence and resists the temptation of the one reporting that evidence to color it with his/her personal psychological quirks and biases.

In science, curiosity together with open mind is fertile ground for some progress. Unfortunately many scientists are human (:>).

I have nothing of hard substance to offer on the subject of pheremones in relation to selection.

You've awakened an interest in me in this regard.

We know that many mammal offspring become imprinted on their mother's unique odor, such that no other mother, even of the same species or even of the same family, is that one mother.

Evidently there are similarities in odor of family members as opposed to other members of one's own species. Interestingly, studies of the consequences of selection indicate that incest is not conducive to higher fittness; so, in a single breath, we may be tempted to explain bonding to mother and siblings and yet preferences for non-family sexual partners.

You speak of "the" socioaffective nature of evolved behaviors. Wouldn't it be better to speak of "varieties of" instead of "the."

Of all the nonsense written about evolution, one of the most common fallacies is "extension of a phenomenon" or "generalization" from one species to another.

One of the most interesting characteristics of nurturing scenarios and sexual selection scenarios is how varied are these mechanisms from species to species.

Perhaps you are as amused as am I at some of the offerings of things chimpanzees do, for example, in explaining things humans do.

I certainly want to read the study you refer to, if it demonstrates what members of specific groups sharing certain socioaffective behaviors in specific individuals or groups of certain species, in certain locations, or under specific circumstances. I am skeptical as to whether it can be extended or found general among other species.

So that's what has driven human evolution from a common ancestor with the honey bee?

Eureka!

If one person thinks he is the joker, and shoots up the audience in a movie theater, and another donates a kidney to someone he is not even kin to, that is because the ancestors of each, got a whiff of a different school of pheremonical impetus.

Gosh!

And all along I've been suffering under the illusion that the explanation of the varieties of human behavior were more complex than that.