ENCODE finds the human genome to be an active place

The ENCODE consortium performs an exhaustive analysis of one percent of the …

A paper that will appear in today's edition of Nature starts off with a bang, its first sentence being "The human genome is an elegant but cryptic store of information." The paper's goal is nothing less than decrypting as much as we can about a one percent of that genome, in the expectation that it will serve as an accurate model of the remaining 99 percent. It's an audacious and very satisfying piece of work; my biggest qualm about it comes from the accompanying press releases, which suggest that we're going to see some bafflingly incoherent media coverage of the findings.

ENCODE stands for Encyclopedia of DNA Elements, and it is a multi-institutional consortium dedicated to finding out what our DNA is up to. As a first step, 30 Megabases of DNA from 44 different locations in the genome were subjected to roughly 200 forms of biochemical and computational analysis. These methods explored RNA production, DNA packaging, and other aspects by several independent assays, providing a fair degree of confidence in the results.

A genome full of pervasive transcription

The big surprise in this work is that the genome is pervasively made into RNA. Although the view that RNA's primary function is to code for proteins went by the wayside with the discovery of various forms of regulatory RNA, the regulatory RNAs still fit into the paradigm of consistent and discrete RNA production. The new study finds that essentially every base in the genome shows up in RNA at one point or another. This is despite the fact that most of these bases aren't doing anything: 95 percent of the genome isn't under selective pressure, and most of that 95 percent doesn't appear functional in an evolutionary sense.

The data indicate that the process of copying DNA into RNA, called transcription, is fundamentally noisy. The transcription factors that tell the cell where to start making RNAs are rather promiscuous, often having affinities for DNA sequences that will appear at random every 1,000 bases. They also act against a background where the packaging of DNA in the cell, which determines their access to such sequences, is dynamically changing. Apparently, wherever these changing conditions allow, transcription will start.

This suggest that regulatory elements around genes act less to specifically start transcription there and more to make gene transcription at the gene more probable than the general background of RNA noise. That said, the study also found clear signals of defined transcription start sites at a rate of nearly ten times the number of genes in the area, suggesting some aspects of the excess transcription are nonrandom. Other examples of extra transcribed bases result from a transcription stopping mechanism that also appears to be noisy. In several cases, RNAs were found that started in one gene and plowed straight through to the next one down the chromosome.

The study did find one factor that might explain the distribution of all the extra sites of transcription when it looked into the process of duplicating the DNA prior to cell division. Areas where this process starts appear to have less compact DNA, which favors transcript initiation as well.

If there's a weakness to this study, it's that it used cells that grow rapidly in culture instead of samples from normal tissue. These cultured cells generally have origins in cancer and so may have aberrant control of a range of cellular processes. Hopefully, the study can be repeated using cells that are a better approximation of normal.

Where does this leave us?

There seems to be three possible interpretations for all these extra transcripts. One is that, even though we haven't detected a biological function, and evolution doesn't conserve them, they are actually specifically functional. This would be the "there is no junk DNA" take on matters. The opposite extreme would be an "it's all junk" view of it. From this perspective, the starting and stopping of transcription is just an inherently noisy process and doesn't do humans enough harm to create a selective pressure to improve it.

Somewhere between the two would be the view that few of these extra transcripts are useful in themselves, but it's useful having them present on the collective level. Reasons could include anything ranging from excess RNA performing some sort of structural function through to the random transcripts being a rich source of new genes.

Personally, I fall into the "it's all junk" end of the spectrum. If almost all of these sequences are not conserved by evolution, and we haven't found a function for any of them yet, it's hard to see how the "none of it's junk" view can be maintained. There's also an absence of support for the intervening view, again because of a lack of evidence for actual utility. The genomes of closely related species have revealed very few genes added from non-coding DNA, and all of the structural RNA we've found has very specific sequence requirements. The all-junk view, in contrast, is consistent with current data. We've wondered for decades how transcription factors can act specifically and at long distances despite their relatively weak specificity for DNA. This data answers that question simply: they don't.

Feeding the data to the press

All of this brings me back to the press release, which has set my blood boiling nearly every time I read it. It basically takes the hardcore "none of it is junk" view but then undercuts its own arguments. It states that "the new data indicate that the genome contains very little unused sequences; genes are just one of many types of DNA sequences that have a functional impact." What is that functional impact? They have no idea: "many species' genomes contain a pool of functional elements that provide no specific benefits in terms of survival or reproduction."

It looks like they're choosing to define functional as "made into RNA," even though they recognize that much of the DNA that is made into RNAs clearly has no influence on survival or fitness. They're then using that skewed definition to claim the data shows that most of the genome is functional. Since most of the popular press produces accounts based on the press release, the public is going to be receiving a very distorted view of this work.