THE EVOLUTION LIST

THE EVOLUTION LIST is a forum for commentary, discussion, essays, news, and reviews that illuminate the theory of evolution and its implications in original and insightful ways. Unless otherwise noted, all materials may be quoted or re-published in full, with attribution to the author and THE EVOLUTION LIST. The views expressed herein do not necessarily reflect those of Cornell University, its administration, faculty, students, or staff.

According to Mats Ljungman, a researcher at the University of Michigan Medical School, as many as 20,000 lesions occur daily in a cell’s DNA. To repair all this continual damage, how does the cell first detect it? Ljungman’s research identified the logical candidate – RNA polymerase (the machine that reads the DNA and makes an RNA copy). Apparently, whenever the RNA polymerase encounters a lesion, it signals to p53, a master protein that activates all sorts of DNA repair processes.

According to the press release:

“These two proteins are saying, ‘Transcription has stopped,’” says Ljungman. These early triggers act like the citizen who smells smoke and sounds a fire alarm, alerting the fire department. Then p53, like a team of fire fighters, arrives and evaluates what to do. To reduce the chance of harmful mutations that may result from DNA damage, p53 may kill cells or stop them temporarily from dividing, so that there is time for DNA repair.

Recently, the ENCODE consortium determined that the majority of DNA in the human genome is transcribed:

This broad pattern of transcription challenges the long-standing view that the human genome consists of a relatively small set of discrete genes, along with a vast amount of so-called junk DNA that is not biologically active.

Of course, one could also argue that all this transcription simply speaks to the sloppy and wasteful nature of the cell. Yet here’s a thought. It would seem to me that Ljungman’s research now raises a third possibility: all that transcription is just another layer of error surveillance.

That is a VERY interesting hypothesis. It could work like this: by incorporating large amounts of transcribed (but not translated) DNA into the human genome, the cell is essentially presenting a much larger "target" for mutation-detection by the p53 surveillance system. In essence, a cell that has been especially challenged by mutation-producing processes would be much more likely to send out the "fire alarm," since it would be much more likely to have transcription terminated and thereby triggering the p53 "stopped transcription" alarm. To extend the "fire alarm" analogy, imagine a house that is unusually likely to have a fire; perhaps it's very hot, or dry, or has smoldering fires in several locations. As the old saying goes, "where there's smoke, there's fire," and a fail-safe cancer/mutation detection system would be much more likely to detect potential "hot-cells" if there were a large amount of transcription going on.

Indeed, this would be most important in cells in which relatively little transcription of functional (i.e. protein-encoding) genes normally takes place, but which are still subject to mutation and potential cancer induction. By running the "non-coding transcription program constantly in the background, such cells could still alert the cancer/mutation surveillance system, even when they themselves aren't actively coding for protein.

Now, since transcription is itself a costly process, doing a lot of it for non-coding genes would also be costly. Cells would therefore be selected via a cost-benefit process for the amount of non-coding "surveillance transcription" they could do. that is, the more likely a cell/organism is to have a cancer/mutation event, the more valuable its non-coding/surveillance transcription system would be, and therefore the more non-coding DNA it should have. This immediately suggests a possible test of hypothesis: those cells (or organisms) that are more likely to suffer from cancer/mutation events would therefore have more non-coding "surveillance transcription" DNA sequences.

For example, since animals are much more likely to be harmed by uncontrolled cell division (i.e. cancer, induced by mutation), then one would predict that animals would have more non-coding/surveillance transcription sequences than, say, plants. Also, animals that live longer (and would therefore have a larger "window" for suffering mutations), should also have relatively large amounts of non-coding/surveillance transcription sequences.

The old C-value paradox may have some relevance here. Does the amount of non-coding/surveillance transcribed sequences correlate with the total amount on non-coding sequence? For example, do puffer fish have fewer non-coding transcribed sequences than zebrafish, or do they have the same amount of transcribed DNA with the difference in genome size being due to non-coding, non-transcribed sequence?

Encode's data would seem to argue for a close correlation between total genome size and amount of transcribed non-coding sequence. If that observation is generally applicable to other organisms, thenC value might be one way to test MikeGene's and Allen's hypotheses. The idea that transcription of non-coding DNA is another layer of mutation detection/error correction would imply that organisms with larger genomes have more mutation detection capability. Do animals with smaller genomes require less error detection because they live in less mutagenic environments? The dramatic differences in genome size among related organisms that live in similar environments would seem to argue against that hypothesis. Compare genome sizes of freshwater pufferfish and zebrafish, both of which live in freshwater streams, or look at the variation in genome size among salamanders of the genus Plethodon

You can also test Allen's lifespan hypothesis. For example, zebrafish and small tetras with lifespans of 2 or 3 years have approximately the same genome size as common carp with lifespans of 20+ years.

One of the ID supporters on the list then challenged me to explain how such a complex error-surveillance system could have evolved via non-directed natural selection. This was my reply (Nota bene: the following is, of course, an HYPOTHESIS only):

Consider two virtually identical phylogenetic lines, A and B. At time zero, individuals in both lines start out with virtually no transcribable but non-coding DNA (abbreviated TNCDNA). If we assume a constant mutation rate for both lines, individuals in both lines would have essentially the same probability of dying from cancer.

Assume further that, over time, sequences of non-TNCDNA accumulate in the genomes of each line. This can happen by any one (or more) of several known mechanisms, such as gene dupilcation (without active promoter sequences), random multiplication of tandem repeats, retroviral or transposon insertions of non-TNCDNA, etc.

Then, at time one, an individual (or more than one) in line B have an active promoter inserted in front of one or more of their non-TNCDNA sequences in one or more of their cells, by the same mechanisms listed above. Now, these individuals have a lower probability of dying from the resulting cancer, since their p53-regulated surveillance systems would be more likely to eliminate the affected cells. Again, this would be a side-effect of the larger "mutation sponge" their cells would present to potentially mutagenic processes. Such individuals would therefore have more descendants, and over time the average size of all of the "mutation sponges" in the subsequent populations would increase. Natural selection in action, folks.

Now, as to the question of where the p53 surveillance system came from in the first place, proteins like p53 are common intermediates in intracellular signalling systems. Assume that the ancestor of p53 was a protein with some other signalling function. At some point, an individual that had p53 doing that other function has a mutation that changes the shape of p53 in such a way that it becomes part of a regulatory pathway that triggers apoptosis, thereby eliminating the cell. If the altered p53 no longer participates in the original pathway, and if that alteration is damaging, such individuals would be elimated, and the original function of p53 would be preserved.

However, if the altered p53 (now participating in the regulation of apoptosis) were also activated by the cells' normal "transcription termination signalling system" as described in Mike's original post, then individuals with the altered p53 would be less likely to die from cancer, and their descendants (who now produce the altered form of p53) would become more common over time.

Mike's original post notes that the research report cited the relatively recent observation that many cells actually suffer multiple mutations much of the time. This is precisely the situation that Darwin originally stated was a prerequisite for natural selection: not genetic mutations (Darwin didn't know about them), but increased heritable variation (which Darwin couldn't explain, but could point to as an observable phenomenon in living organisms). In other words, as both EBers and IDers both point out, phenotypic variations are very, very common, and so are the genetic changes with which they are correlated. Most of these variants are either selectively neutral (c.f. Kimura), nearly neutral (c.f. Ohta), or deleterious to some degree. Such changes either accumulate (if they are neutral or nearly so) or are eliminated (if they are deleterious).

But, in those relatively rare occasions when they result in increased relative survival and reproduction, they increase in frequency in those populations in which they exist. By this process of "natural preservation" (Darwin's preferred name for the process he and Alfred Russell Wallace proposed as the primary mechanism for descent with modification) results in the accumulation of both neutral and beneficial characters and the elimination of deleterious ones.

And by the way, the foregoing is why Darwin (and not Edward Blythe) is credited with the concept of "natural selection/preservation": Blythe only described the elimination of deleterious characters, and never realized that the preservation of beneficial characters could result in the origin of adaptations. Blythe, in other words, only recognized what EBers call "stabilizing selection," but missed the much more interesting and important "directional selection," which Darwin cited as the causal basis for evolution of adaptations.

The Gene Is Dead: Long Live The Gene!

In a previous post (New Definitions Of A Gene), I discussed new ideas of what genes might be according to recent discoveries in genetics and genomics. Now comes the absolutely stunning news that between 74% and 93% of the typical mammalian genome is transcribed into RNA, but not translated. This DNA accounts for almost all of what was recently referred to as "junk DNA." This discovery has shaken some of the fundamental principles of genetics, and promises to do even more to the underlying assumptions of neo-Darwinian evolutionary theory.

In particular, the "neutral theory" of Motoo Kimura and the "nearly neutral theory" of Tomoko Ohta may need to be extensively revised, if not entirely replaced. These theories are based on the assumption that the vast majority of the DNA of most organisms, especially eukaryotes, is selectively neutral (i.e. is not acted upon by natural selection). Furthermore, central to these theories is the idea that the neutrality of most of the genome is the result of its not being transcribed or translated into protein (and therefore ultimately into some component of organisms' phenotypes). However, if most of this DNA is transcribed, but not translated, then these theories (which form part of the foundation of current neo-darwinian evolutionary theory) will probably have to be revised, or even jettisoned.

Here is the text of the entire article. Pay particular attention to the various hypotheses presented for what all that transcribed but not translated RNA is doing in cells. This discovery opens up a huge new area of research, and seriously undermines the estimate of the number of "genes" mapped by the Human Genome Project:

When scientists unveiled a draft of the human genome in early 2001, many cautioned that sequencing the genome was only the beginning. The long list of the four chemical components that make up all the strands of human DNA would not be a finished book of life, but a road map of an undiscovered country that would take decades to explore.

Only 6 years later, the landscape of the genome is already proving to be dramatically different than most scientists had expected.

The established view of the genome began to take shape in 1958, just 5 years after Francis Crick and James D. Watson worked out the structure of DNA. In that year, Crick expounded what he called the "central dogma" of molecular biology: DNA's genetic information flows strictly one way, from a gene through a series of steps that ends in the creation of a protein. That principle developed into a modern orthodoxy, according to which a genome is a collection of discrete genes located at specific spots along a strand of DNA. This old view got the basics right: that genes encode proteins and that proteins do the myriad work necessary to keep an organism alive.

Researchers slowly realized, however, that genes occupy only about 1.5 percent of the genome. The other 98.5 percent, dubbed "junk DNA," was regarded as useless scraps left over from billions of years of random genetic mutations. As geneticists' knowledge progressed, this basic picture remained largely unquestioned. "At one time, people said, 'Why even bother to sequence the whole genome? Why not just sequence the [protein-coding part]?'" says Anindya Dutta, a geneticist at the University of Virginia in Charlottesville.

Closer examination of the full human genome is now causing scientists to return to some questions they thought they had settled. For one, they're revisiting the very notion of what a gene is. Rather than being distinct segments of code amid otherwise empty stretches of DNA—like houses along a barren country road—single genes are proving to be fragmented, intertwined with other genes, and scattered across the whole genome.

Even more surprisingly, the junk DNA may not be junk after all. Most of this supposedly useless DNA now appears to produce transcriptions of its genetic code, boosting the raw information output of the genome to about 62 times what genes alone would produce. If these active nongene regions don't carry code for making proteins, just what does their activity accomplish?

"What we thought was important before was really just the tip of the iceberg," says Hui Ge of the Whitehead Institute for Biomedical Research in Cambridge, Mass.

With the genome sequence in hand, exploration has moved at a brisk pace during the past 6 years. A milestone was reached in June, when a project called the Encyclopedia of DNA Elements (ENCODE) thoroughly mapped the functional regions in 1 percent of the human genome. The effort involved was staggering: Thirty-five teams of scientists from around the world worked for 4 years and compiled more than 600 million data points, the consortium reported in the June 14 Nature.

From the accumulating mountains of data, scientists are building a new picture of how the genome works as a whole. They have found mutations in nongene regions of DNA that are linked to common diseases such as diabetes and forms of cancer. And some researchers propose that DNA once labeled junk could have spawned the complex bodies of higher organisms—even the complexities of the human brain.

Second Fiddle To A Superstar

In the emerging picture of the genome's functioning, many of the key elements identified so far are molecules of RNA, a chemical cousin of DNA.

In the old central dogma, RNA had a strictly subservient role in the all-important task of making proteins. An RNA molecule is made from units of genetic code strung together, much like DNA. But while DNA has two strands twisted together into a double helix, RNA usually has only a single strand.

Protein synthesis begins when the two strands of a section of DNA unzip. Units of RNA then pair up with their counterparts on one of the DNA strands, forming a complementary messenger RNA (mRNA) molecule. The mRNA detaches and floats off to other parts of the cell, where it hooks up with machinery that transcribes its coded message into a protein.

If RNA's only job were making proteins, then nearly all the RNAs produced in cells should be transcripts of protein-coding genes. (A small fraction of RNAs serve in the protein-transcription machinery.) But in 2005, Jill Cheng and her colleagues at Affymetrix, a genomics company in Santa Clara, Calif., showed that less than half of the RNA produced by 10 of the chromosomes in human cells represented transcripts of traditional genes. In the team's experiments, 57 percent of the RNA was transcribed from noncoding, "junk" regions.

The results from ENCODE were even more striking. In the slice of DNA studied in that project, between 74 percent and 93 percent of the genome produced RNA transcripts. What becomes of this tremendous output is uncertain. John M. Greally of the Albert Einstein College of Medicine in New York says it's likely that some portion of it is made accidentally and simply discarded. But the discovery that so much of the genome is being transcribed into RNA underscores how out-of-date the central dogma has become.

Indeed, the closer researchers look, the more functions they find that RNA transcripts perform. An alphabet soup of new acronyms describes the newfound roles of RNAs. First there were short nuclear RNAs (snRNAs) and short nucleolar RNAs (snoRNAs), both of which reside inside the nucleus and help control production of other RNAs. These were joined by microRNAs (miRNAs) and short interfering RNAs (siRNAs), which can modulate the activity of protein-coding genes. In mice, about 34,000 of the RNA transcripts produced by the genome are nonprotein-coding, outnumbering the roughly 32,000 transcripts that code for proteins, according to a 2005 study by an international group of scientists called the Functional Annotation of Mouse Consortium.

These new families of RNAs add a layer of regulation that fine-tunes the production of proteins. While scientists already knew that some proteins influence the activity of other genes, "there are many more RNAs than proteins that play a regulatory role," Ge says.

Gene regulation may not sound sexy, but it's a powerful way for a cell to evolve complex behaviors using the tools—proteins—that it already has. Consider the difference between a one-bedroom bungalow and an ornate, three-story McMansion. Both are made from roughly the same materials—lumber, drywall, wiring, plumbing—and are put together with the same tools—hammers, saws, nails, and screws. What makes the mansion more complex is the way that its construction is orchestrated by rules that specify when and where each tool and material must be used.

In cells, regulation controls when and where proteins spring into action. If the traditional genome is a set of blueprints for an organism, RNA regulatory networks are the assembly instructions. In fact, some scientists think that these additional layers of complexity in genome regulation could be the answer to a long-standing puzzle.

Genome As Network

The biggest surprise in the first sequence of the human genome was how few protein-coding genes it contained.

"We humans do not have that many more genes than simpler organisms like flies or mice," Ge says. Earlier guesses of the number of genes in humans ran as high as 100,000, but the published sequence in fact contained only about 23,000. That's not much more than the roughly 21,000 genes possessed by the roundworm, a microscopic creature without a brain. If protein-coding genes are the only functional elements in an organism's DNA, where does the extra information come from that's needed to assemble and operate the complex bodies and brains of people, as compared with the simplicity of roundworms? "If we just look at the number of genes, it doesn't make sense," Ge says.

While the number of genes isn't much different in roundworms and people, the human genome is 30 times the size of the roundworms'. People have a much larger quantity of DNA beyond what codes for proteins. Since much of this "junk" DNA is being transcribed into RNA, perhaps it's responsible for much of the complexity of human bodies and brains. In fact, organisms simpler than roundworms, such as single-celled bacteria, carry little noncoding DNA and may have no regulatory RNA at all.

"Scientists have been suspecting that it is the regulatory networks that lead to this amazing complexity" in higher organisms, Ge says.

John S. Mattick of the University of Queensland in Brisbane, Australia, points to a known example of the importance of regulatory RNAs: their crucial role in fetal development. For example, most multicellular animals possess a gene called Notch that helps guide neural development. While the gene itself has much the same form in both simple and complex animals, its activity is regulated by miRNAs that are highly variable from one animal to another. Such miRNAs also influence a gene called Hox, which acts in many animals to define a fetus' body axis and the placement of its limbs.

What's more, the changes that distinguish human brains from those of chimpanzees and other apes could be due in part to evolutionary changes in RNAs that don't encode proteins. A group led by Katherine S. Pollard of the University of California, Davis identified DNA sequences shared by people and chimpanzees, but with large differences, meaning that they have evolved rapidly since the two species shared a common ancestor.

The researchers found that one of these sequences is a noncoding region of DNA that's related to brain function, they reported in the Sept. 14, 2006 Nature. Pollard and her colleagues speculate that this region produces a regulatory RNA and that changes in this RNA contributed to the evolution of the human brain.

With regulatory RNAs appearing to play such an instrumental role in animal development, it's no surprise that scientists are finding disease-associated mutations in regions of the genome formerly regarded as junk.

David Altshuler of the Broad Institute in Cambridge, Mass., and his colleagues looked for DNA mutations in 1,464 patients with type 2 diabetes. Three of the mutations that correlated with the disease were in DNA segments that don't code for proteins, the team reported in the June 1 Science. Other scientists have found mutations in noncoding DNA that link to diseases such as autism, breast cancer, lung cancer, prostate cancer, and schizophrenia.

To be sure, the specific functions of most of the noncoding DNA remain unknown. Projects such as ENCODE have focused on identifying the broad functional categories for active regions of the genome without working out the specific cellular function of each transcript, a task that will take biologists years, if not decades.

In fact, scientists debate whether some fraction of the genome's copious RNA output might do nothing at all. It may simply be that once the cellular machinery that transcribes DNA into RNA gets started, it sometimes doesn't know when to stop. On the other hand, making lots of RNA that does nothing would be a waste of a cell's energy. That's something that natural systems tend to avoid, so the fact of its production argues for at least some of this RNA being biologically active.

The Gene Is Dead

In the old view, each gene sat in splendid isolation on its segment of the genome. Other genes might be nearby, but scientists assumed that they didn't overlap each other.

Now it's clear that a single length of DNA can be transcribed in multiple ways to produce many different RNAs, some coding for proteins and others constituting regulatory RNAs. By starting and stopping in different places, the transcription machinery can generate a regulatory RNA from a length of DNA that overlaps a protein-coding gene. Moreover, the code for another regulatory RNA might run in the opposite direction on the facing strand of DNA. According to the ENCODE project results, up to 72 percent of known genes have transcripts on the facing DNA strand as well as the main strand.

"The same sequences are being used for multiple functions," says Thomas R. Gingeras of Affymetrix. That introduces complications into the evolution of the genome, which had until recently been assumed to act through single DNA mutations affecting single genes. Now, "a mutation in one of those sequences has to be interpreted not only in terms of [one gene], but [of] all the other transcripts going through the region," Gingeras explains.

The implications of this single mutation–multiple consequence model are still a matter of debate. In some cases, the RNA transcripts from DNA that overlaps a protein-coding gene regulate that same gene, so a mutation could affect both the structure and the regulation of a protein. But often, those transcripts regulate genes that are far away, or even on different chromosomes. This complex interweaving of genes, transcripts, and regulation makes the net effect of a single mutation on an organism much more difficult to predict, Gingeras says.

More fundamentally, it muddies scientists' conception of just what constitutes a gene. In the established definition, a gene is a discrete region of DNA that produces a single, identifiable protein in a cell. But the functioning of a protein often depends on a host of RNAs that control its activity. If a stretch of DNA known to be a protein-coding gene also produces regulatory RNAs essential for several other genes, is it somehow a part of all those other genes as well?

To make things even messier, the genetic code for a protein can be scattered far and wide around the genome. The ENCODE project revealed that about 90 percent of protein-coding genes possessed previously unknown coding fragments that were located far from the main gene, sometimes on other chromosomes. Many scientists now argue that this overlapping and dispersal of genes, along with the swelling ranks of functional RNAs, renders the standard gene concept of the central dogma obsolete.

Long Live The Gene

Offering a radical new conception of the genome, Gingeras proposes shifting the focus away from protein-coding genes. Instead, he suggests that the fundamental units of the genome could be defined as functional RNA transcripts.

Since some of these transcripts ferry code for proteins as dutiful mRNAs, this new perspective would encompass traditional genes. But it would also accommodate new classes of functional RNAs as they're discovered, while avoiding the confusion caused by several overlapping genes laying claim to a single stretch of DNA. The emerging picture of the genome "definitely shifts the emphasis from genes to transcripts," agrees Mark B. Gerstein, a bioinformaticist at Yale University.

Scientists' definition of a gene has evolved several times since Gregor Mendel first deduced the idea in the 1860s from his work with pea plants. Now, about 50 years after its last major revision, the gene concept is once again being called into question.

Prasanth, K.V., and D.L. Spector. 2007. Eukaryotic regulatory RNAs: An answer to the 'genome complexity' conundrum. Genes and Development 21(Jan. 1):11-42. Available at http://www.genesdev.org/cgi/content/full/21/1/11.

The ENCODE Project Consortium. 2007. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447(June 14):799-816. Available at http://dx.doi.org/10.1038/nature05874.