Wednesday, March 09, 2016

A 2004 kerfuffle over pervasive transcription in the mouse genome

The first drafts of the human genome sequence were published in 2001. There was still work to do on "finishing" the sequence but a lot of the International Human Genome Project (IHGP) team shifted to work on the mouse genome. The FANTOM Consortium and the RIKEN Genome Exploration Groups (I and II) published an analysis of mouse transcripts in December 2002.

Only a small proportion of the mouse genome is transcribed into mature messenger RNA transcripts. There is an international collaborative effort to identify all full-length mRNA transcripts from the mouse, and to ensure that each is represented in a physical collection of clones. Here we report the manual annotation of 60,770 full-length mouse complementary DNA sequences. These are clustered into 33,409 ‘transcriptional units’, contributing 90.1% of a newly established mouse transcriptome database. Of these transcriptional units, 4,258 are new protein-coding and 11,665 are new non-coding messages, indicating that non-coding RNA is a major component of the transcriptome. 41% of all transcriptional units showed evidence of alternative splicing. In protein-coding transcripts, 79% of splice variations altered the protein product. Whole-transcriptome analyses resulted in the identification of 2,431 sense–antisense pairs. The present work, completely supported by physical clones, provides the most comprehensive survey of a mammalian transcriptome so far, and is a valuable resource for functional genomics.

I haven't shown the complete list of authors. Some of the others are members of the Mouse Genome Sequencing Consortium. There's an overlap between the authors of this 2002 paper and the ENCODE papers that were published in 2007 and 2012 (e.g. Ewan Birney). The significance of this overlap will become clear.

Okazaki et al. begin their paper by noting that the total number of genes in the human genome is still unknown. They point to the fact that parts of the human genome (notably chromosome 21) are pervasively transcribed. This suggests, according to them, that there may be many more genes yet to be discovered,

One significant class of ‘genes’ missing from the existing genome annotation are those that give rise to non-protein-coding RNAs. Non-coding RNAs, although not highly transcribed, constitute a major functional output of the genome. In addition to their role in protein synthesis (ribosomal and transfer RNAs), non-coding RNAs have been implicated in control processes such as genomic imprinting and perhaps more globally in control of genetic networks.

The authors set out to define the complete "transcriptome" of the mouse in order to discover new genes. They constructed and analyzed a set of 60,770 expressed sequence tags (ESTs). ESTs are cloned complementary DNA (cDNA) fragments copied from purified mouse RNA molecules. These were clustered into 33,409 transcription unit (TUs). Many, but not all, of the TUs are based on multiple overlapping ESTs. The average length of a TU is 1,970 bp (1.97 kb). These are potential genes.

There's a bit of a problem with definitions since Okazaki et al. DEFINE a TU as "as unit of genetic information transcribed into mRNA." Presumably, this reflects their belief that all ESTs represent messenger RNAs because the cDNAs were derived from poly A+ RNA. However, it's clear that some of these potential genes may specify functional RNAs that are not translated (= noncoding RNAs).

The results are described in the abstract. After examining the sequences, they conclude that only 17,594 of the 33,409 potential genes have significant coding potential. That's only 52% of the total. They don't talk much about the other 48% of TUs. Most of them are "unclassifiable." These 15,815 TUs "may represent [genes for] functional non-coding RNAs."

It would be interesting to know whether all 17,594 potential protein-coding genes have been confirmed by subsequent analysis over the past 14 years. I doubt it very much since the total number of mouse protein-coding genes is now estimated to be about 20,000 and most of them are not expressed in the tissues analyzed by the FANTOM Consortium.

Wang et al. (2004) decided to look at the 15,815 TUs that could potentially be genes for noncoding RNAs. Remember that there is considerable debate over the number of genes for functional noncoding RNAs. The ENCODE Consortium claims that pervasive transcription of the human genome indicates that it is full of such genes and they play a key role in regulating the expression of protein-coding genes.

Back in 2004, there were just as many skeptics as there are today. Wang et al. examined the potential genes to see if they were any more conserved than random DNA intergenic sequences in the mouse genome. If they represent randomly transcribed regions of the genome that produce junk RNA by accidental transcription then you expect them to be evolving at the neutral rate whereas if they really are genes for functional RNAs then you expect them to show evidence of sequence conservation.1

Wang et al. compared all 33,409 TU sequences to the rat and human genomes. The divided the dataset into the same four categories used by Okazaki et al. and plotted the results as percentage of cDNAs (Y-axis) vs percentage sequence identity (X-axis) for rat (left) and human (right).

The purple lines represent typical conserved coding regions of homologous genes in the two species. This is a positive control. The other positive control is the small set of all known mouse genes for functional RNAs (brown line). The yellow line represents random intergenic sequences that are presumably junk DNA. They should be evolving at the neutral rate.

The other lines represent sequences of cDNAs from the FANTOM2 database as follows ...

Red: coding cDNAs - probably protein-coding genes (14,317)

Blue: marginal coding cDNAs - possible protein-coding genes (3,277)

Black: possibly genes for functional noncoding RNA (11,526)

Green: probably genes for functional noncoding RNA (3,450)

As you can see, the potential genes for protein-coding regions tend to be conserved but the data suggest that a good many of those potential genes are probably not real protein-coding genes. In the case of potential genes for functional noncoding RNA (black, green), the sequence similarities are no different than random neutrally-evolving DNA and very different from known genes for functional noncoding RNAs (brown).

Wang et al conclude the most of the 15,815 potential RNA-type genes are not genes at at all and the transcripts are junk RNA.

The simplest explanation is that non-functional transcripts can be produced at low copy numbers, escape the cell's messenger RNA surveillance system, and yet inflict no damage on the cell.

They go on to issue a caution that has largely been ignored.

Given that all of the best techniques for detecting RNA genes depend on sequence conservation, the absence of this cannot be summarily dismissed, even if isolated examples of RNA genes being weakly conserved can be found. Extraordinary claims require extraordinary proof — this is particularly true when much of the data support an alternative interpretation that they are simply non-functional cDNAs.

This is a direct criticism of the Okazaki et al. paper so the authors of that paper were invited to respond. There were 133 authors. We'll never know how many of them might have agreed with the criticism since the response came from Yoshihide Hayashizaki (Hayashizaki, 2004), the RIKEN Group leader at the Yokahama Institute in Yokahama, Japan. He acknowledges that he had help from several people who were not on the original list on authors—one of them was John Mattick.

Hayashizaki didn't like the Wang et al. paper. He had a couple of technical objections; namely, that the functional RNA positive control is only based on 19 RNAs and that the validity of the negative control (random junk DNA) is also questionable. The first objection is valid but if you look at the big picture it's not going to make much difference if the known functional RNAs are unrepresentative. The second objection assumes that intergenic sequences are actually conserved to some extent—in other words, they are not junk. Hayashizaki claims that "more of the genome is under evolutionary selection (both positive and negative) than has been appreciated."

He doesn't explain his view but I'm guessing that he looks at the average sequence similarity between rat and mouse (~83%) and between mouse and human (~65%) and assumes that the differences should be even greater if they are evolving neutrally.

But those technical objections are not his main arguments. I bet Sandwalk readers can already anticipate what Hayashizaki is going to say. Think about it before I give you the answer.

waiting ....

waiting ....

That's right! He pushes the same old replies that we've heard before from creationists. The RNAs must be functional because they are transcribed in a tissue-specific manner or in response to external stimuli. This is a silly argument since if the RNAs are just noise produced by accidental transcription then that kind of spurious transcription will still depend of the binding of transcription factors to random parts of the genome. Since the transcription factors are tissue-specific or activated by external stimuli, it follows that the spurious transcripts will show the same features as the real genes that are being turned on by those transcription factors.2

The second argument is that regulatory noncoding RNAs are "in the main, much less conserved than protein-coding sequences." Thus, the genes for these RNAs could look like they aren't conserved but still be genes for functional RNAs. They could also be mouse-specific genes that arose only in the mouse lineage even though there are related, nonfunctional, sequences in other species. This is an advanced form of question-begging.

Hayashizaki doesn't seem to have absorbed the main take-home message—the onus is on those who make extraordinary claims to come up with evidence for function. It's not good enough to just speculate with just-so stories since there's a valid alternative hypothesis based on the default assumption (neutral evolution).

The criticism and reply were published in October 2004. The following year, the same group published two more papers in the September 2 issue of Science. This was an issue devoted to functional RNAs.

There were numerous articles on the importance of noncoding RNAs in mammalian genomes and all of them were in "honor" of the two papers from the FANTOM Consortium and RIKEN (Carninci et al., 2005; Katayama et al., 2005).

Did anyone learn anything from the Wang et al. paper? Judge for yourselves by reading the press release from FANTOM/RIKEN [PDF].

The FANTOM Consortium for Genome Exploration Research Group, RIKEN Genomic Sciences Center (GSC), RIKEN Yokohama Institute and Genome Science Laboratory, Discovery and Research Institute, RIKEN Wako Institute (Genome Network Core Group), announce the publishing of two milestone papers this week in the prestigious journal Science, which will transform our understanding of how the genes in mammals are controlled.

The past five years have seen the completion of several mammalian genome sequences, but these are of limited value unless we can decode the way that they are translated into functions required to create a mature animal. Only around 2% of the genome is translated into proteins (coding transcripts), the building blocks of the cells that make up our bodies. But which 2%, and how is it controlled?

The key intermediate is the transcriptome, which now has been subject to the most comprehensive characterization ever. The groundbreaking study has used new technology that accurately tags the beginning and end of each of over 20 million RNA messages (transcripts) created by genes, resulting in a powerful profile of the regulating control of genes. In addition, it has also shown that overlapping sense/antisense transcript pairs (both strands) are almost universal in the genome, and that S/AS pairs are especially abundant in imprinted loci, keeping with the putative role of non-coding RNA in the mechanism of gene silencing.

Since mammals only have slightly more conventional genes (around 22,000) than a simple worm, the results of the FANTOM Consortium study clearly indicate that while proteins comprise the essential components of our cells, the development of multicellular organisms like mammals is controlled by vast amounts of regulatory non-coding RNAs that until recently was not suspected to exist or be relevant to our biology. The findings suggest that the difference between mouse and man may well lie in the control systems of these genes, and not in the structures of the proteins some of them code for.

"We have provided the biomedical research community with the tools to understand the controls that are needed to make a mammal. We have deciphered the genome sequence not only for the code for making the parts (proteins) of a mammal, but also the code for making the right forms, in the right amounts, in the right place, at the right time." states project leader Yoshihide Hyashizaki.

The FANTOM/RIKEN groups identified and named 56,722 protein-coding genes based on their cDNAs in the 2005 papers. The majority of these (65%) were supposed to produce multiple proteins by alternative splicing. (Nobody believes any of this in 2016.)

They claim that the mouse genome has up to 34,030 genes for noncoding RNAs. Here's what they say about the previous year's discussion.

The function of ncRNAs is a matter of debate (Wang et al., 2004). Some ncRNAs are highly conserved even in distant species: 1117 out of 2886 overlap chicken sequences, of which 780 do not overlap known CDS and 438 do not overlap known mRNAs on either strand, whereas 68 out of 2886 have BLAST-like alignment tool (BLAT) alignments to the Fugu genome, of which 40 do not overlap known CDS on either strand. These ncRNAs are at least as conserved as a reference set of known ncRNAs (Fig. 3A), contrary to a previous study (Wang et al., 2004).

I don't understand these numbers. What does it mean to say that 1117 ncRNAs overlap chicken sequences?

At least they acknowledged the earlier criticism, in contrast with the ENCODE paper seven years later. They tried to provide evidence that their noncoding RNAs were functional by showing that they were, on average, more highly conserved than random genomic DNA. Here's a copy of Figure 3A titled "Human-mouse conservation of coding and noncoding RNAs compared with random genomic sequence."

At first glance it certainly looks like their ncRNAs are more conserved than random sequences but I'm deeply suspicious of the results. I don't like the way they present the data since all parts of the curve are skewed downward by the small number of real, highly conserved, sequences at the top end. I don't count this as extraordinary evidence.

Lots of people disagree with me (surprise!). Here's a sample of comments in that very same issue of Science [Mapping RNA Form & Function].

Guy Riddihough in "In the Forests of RNA Dark Matter."

The phrase “dark matter” could well be ascribed to noncoding RNA in general. The discovery that much of the mammalian genome is transcribed, in some places without gaps (so-called transcriptional “forests”), shines a bright light on this embarrassing plentitude: an order of magnitude more transcripts than genes (pp. 1559, 1564, and 1529). Many of these noncoding RNAs (p. 1527) are conserved across species, yet their functions (if any) are largely unknown ...

John Mattick in "The Functional Genomics of Noncoding RNA."

That complex organisms have complex genetic programming should come as no surprise. That much of this programming may be transacted by noncoding RNAs may be. However, given the sheer extent of noncoding RNA transcription, it seems more and more likely that a large portion of the human genome may be functional by means of RNA. This also means that we may have seriously misunderstood the nature of genetic programming in the higher organisms (21) by assuming that most genetic information is expressed as and transacted by proteins, as it largely is in prokaryotes (22). If so, there is a long road ahead in functional genomics.

Jean-Michel Claverie in "Fewer Genes, More Noncoding RNA."

Recent data from the FANTOM 3 project (13, 14) confirm and amplify these findings. Through a technical tour de force, the members of this consortium have established that a staggering 62% of the mouse genome is transcribed. They have identified more than 181,000 independent transcripts, of which half consist of noncoding RNA. Moreover, they found that more than 70% of the mapped transcription units overlap to some extent with a transcript from the opposite strand (13, 14).

These results provide a solution to the discrepancy between the number of (protein-coding) genes and the number of transcripts—noncoding polyadenylated mRNA contributes to a large fraction of the 3′-EST sequences (and SAGE tags) subsequently clustered or remaining as singletons. Indeed, the noncoding Xist mRNA is abundantly represented in all EST projects. It is thus likely that sequences of noncoding transcripts have been accumulating in EST databases and have for the most part (including singleton and antisense ESTs) been erroneously interpreted as coming from the 3′-untranslated regions of protein-coding transcripts. Noncoding transcripts originating from intergenic regions, introns, or antisense strands have probably been right before our eyes for 8 years without having been discovered!

Some of us look back on those claims with amusement because we are convinced that mammalian genomes are full of junk and most of those transcripts are just noise or junk RNA.

However, the scary part is that there are still many biologists who believe that there are thousands and thousands of genes for regulatory RNAs, far more, in fact, than the number of protein-coding genes.

It's going be hard to dissuade them because they want to believe that mammalian genomes contain a lot of missing information that makes us more complex than nematodes or fruit flies.

1. There will be exceptions. It's possible to imagine a given RNA having a function that's not sequence dependent. But it's 2016, and after decades of work there are very few proven examples of such RNAs. It's safe to assume that sequence conservation is a good proxy for function.

2. The same argument is still being used today and it still makes no sense.

23 comments
:

It's going be hard to dissuade them because they want to believe that mammalian genomes contain a lot of missing information that makes us more complex than nematodes or fruit flies

Given that human behavior clearly is more complex than either nematodes or fruit flies despite not having significantly more protein coding genes, what would be alternative explanations for this? Is it all environment, and we haven't managed to provide the right education for flies than would allow them to reach their full potential? I'm no far of Mattick and his "dog's ass plots", but it seems the best argument against him is to show an alternative explanation that doesn't require thousands of RNA genes.

The short response is differential gene regulation through classic well established mechanisms led to a more 'complex' brain that allows us to spend inordinate amounts of time arguing that we're better than everything else. I'll call this the Trump fallacy (patent pending).

Human behavior is more complex. Let's assume this is because of bigger, more complex brains. How many new developmental genes do we need in order to explain this, or can it be explained by (mostly) the same old genes regulated/regulating differently, e.g., the genes responsible for brain development wait longer before "turning off" in humans? (Cf. various examples of neoteny.)

But the problem is the mechanisms in classic well established gene regulation involves protein coding genes to do the regulating. So it isn't clear how a significantly more complex behavior could arise by this method without a significantly larger number of protein coding genes to serve as additional regulators.

Alternatively, there's the argument that our behavior isn't really all that more complex and we are just vain like the pre-Copernicans who believed we lived at the center of the universe. There's something to be said for this in comparison to other mammals, but it is a bit hard to argue in regard to flies, I think.

No, you don't need thousands of new non-coding genes. You just need, if anything, a somewhat more complex regulatory network. Some of that is probably new families of regulatory RNAs and some of it is new transcription factor binding sites (for the same old transcription factors) in promoters. But only a modest increase in numbers of functional sequences seems to me necessary to produce considerably greater differentiation. We're still left with regulatory elements, RNA and DNA alike, being a few percent of the genome, or at least with no reason to think otherwise and much reason to think so.

1) One can generate an enormous variety of distinct "cell types" using just a few genes. If we count each and every B and T cell expressing a different Ig/TCR as a distinct cell types, then the numbers of cell types produced by just one locus is truly astronomical. Something similar seems to be happening with brain cells (protocadherins, DSCAM, etc.)

2) I don't see why each and every individual neuron has to be genetically and precisely specified. They are in some nematodes, rotifers, etc. but those are weird cases. In a complex vertebrate brain I see absolutely no need for that. All that one needs is to specify a limited number of cell types and to control where and how many of them will be produced during development. Which does not take that much extra in terms of regulation, especially given how the "how many" part, which is the easiest to tweak, seems to have been most important in our own evolution.

There really is a lot less to explain than many people would want there to be. And one can get into some interesting speculations on why that is -- if you ask me, I would guess that there is a subconscious desire for the brain to be exquisitely and precisely specified in all its complexity, because if it was, then that might make it more understandable. The alternative (things are specified down to a certain level, below that there is a lot of self-organization with quite a few degrees of freedom going on) is not so attractive. Which, if we are to get to the bottom of it, might well be the same as one of the reasons people believe in deities and love conspiracy theories (better to have the feeling that someone is in control than to fully realize how messy and fragile everything is).

Yes you can explain all these differences via classical gene regulation models. How else do you explain the differences between a chicken and a horse, and if you think these are due to differences in complexity which is more complex? And I disagree with you that we need 'somewhat more', our understanding of gene regulation is quite clear that numerous and profound differences that can arise via minor changes (look at insect morphology for example). But continue going with the Trump fallacy, we have all the complexity, we have the best complexity.

Larry says: The average length of a TU is 1,970 bp (1.97 kb). These are potential genes....After examining the sequences, they conclude that only 17,594 of the 33,409 potential genes have significant coding potential. That's only 52% of the total. They don't talk much about the other 48% of TUs. Most of them are "unclassifiable." These 15,815 TUs "may represent [genes for] functional non-coding RNAs."

Since mammals only have slightly more conventional genes (around 22,000) than a simple worm, the results of the FANTOM Consortium study clearly indicate that while proteins comprise the essential components of our cells, the development of multicellular organisms like mammals is controlled by vast amounts of regulatory non-coding RNAs....

If pervasive transcription really is largely the result of spurious transcription, then one will expect the same result when worms cells are analyzed in this fashion.

Then what will be the argument? It will have to be that, yes, the same amount of non-coding RNA synthesis occurs in worm cells, but in the case of humans more of that non-coding RNA is functional than in the worm... I suppose.

It will come down to the same problem associated with never (or at least not yet) actually demonstrating functionality of the elements they claim are functionally important.

It is better to know about transcription which will be translated into the protein, and regulatory sequences unit encoding for a protein may contain both a coding sequence, which direct and regulate the synthesis of that protein.

Laurence A. Moran

Larry Moran is a Professor in the Department of Biochemistry at the University of Toronto. You can contact him by looking up his email address on the University of Toronto website.

Sandwalk

The Sandwalk is the path behind the home of Charles Darwin where he used to walk every day, thinking about science. You can see the path in the woods in the upper left-hand corner of this image.

Disclaimer

Some readers of this blog may be under the impression that my personal opinions represent the official position of Canada, the Province of Ontario, the City of Toronto, the University of Toronto, the Faculty of Medicine, or the Department of Biochemistry. All of these institutions, plus every single one of my colleagues, students, friends, and relatives, want you to know that I do not speak for them. You should also know that they don't speak for me.

Subscribe to Sandwalk

Quotations

The old argument of design in nature, as given by Paley, which formerly seemed to me to be so conclusive, fails, now that the law of natural selection has been discovered. We can no longer argue that, for instance, the beautiful hinge of a bivalve shell must have been made by an intelligent being, like the hinge of a door by man. There seems to be no more design in the variability of organic beings and in the action of natural selection, than in the course which the wind blows.Charles Darwin (c1880)Although I am fully convinced of the truth of the views given in this volume, I by no means expect to convince experienced naturalists whose minds are stocked with a multitude of facts all viewed, during a long course of years, from a point of view directly opposite to mine. It is so easy to hide our ignorance under such expressions as "plan of creation," "unity of design," etc., and to think that we give an explanation when we only restate a fact. Any one whose disposition leads him to attach more weight to unexplained difficulties than to the explanation of a certain number of facts will certainly reject the theory.

Charles Darwin (1859)Science reveals where religion conceals. Where religion purports to explain, it actually resorts to tautology. To assert that "God did it" is no more than an admission of ignorance dressed deceitfully as an explanation...

Quotations

The world is not inhabited exclusively by fools, and when a subject arouses intense interest, as this one has, something other than semantics is usually at stake.
Stephen Jay Gould (1982)
I have championed contingency, and will continue to do so, because its large realm and legitimate claims have been so poorly attended by evolutionary scientists who cannot discern the beat of this different drummer while their brains and ears remain tuned to only the sounds of general theory.
Stephen Jay Gould (2002) p.1339
The essence of Darwinism lies in its claim that natural selection creates the fit. Variation is ubiquitous and random in direction. It supplies raw material only. Natural selection directs the course of evolutionary change.
Stephen Jay Gould (1977)
Rudyard Kipling asked how the leopard got its spots, the rhino its wrinkled skin. He called his answers "just-so stories." When evolutionists try to explain form and behavior, they also tell just-so stories—and the agent is natural selection. Virtuosity in invention replaces testability as the criterion for acceptance.
Stephen Jay Gould (1980)
Since 'change of gene frequencies in populations' is the 'official' definition of evolution, randomness has transgressed Darwin's border and asserted itself as an agent of evolutionary change.
Stephen Jay Gould (1983) p.335
The first commandment for all versions of NOMA might be summarized by stating: "Thou shalt not mix the magisteria by claiming that God directly ordains important events in the history of nature by special interference knowable only through revelation and not accessible to science." In common parlance, we refer to such special interference as "miracle"—operationally defined as a unique and temporary suspension of natural law to reorder the facts of nature by divine fiat.
Stephen Jay Gould (1999) p.84

Quotations

My own view is that conclusions about the evolution of human behavior should be based on research at least as rigorous as that used in studying nonhuman animals. And if you read the animal behavior journals, you'll see that this requirement sets the bar pretty high, so that many assertions about evolutionary psychology sink without a trace.

Jerry Coyne
Why Evolution Is TrueI once made the remark that two things disappeared in 1990: one was communism, the other was biochemistry and that only one of them should be allowed to come back.

Sydney Brenner
TIBS Dec. 2000
It is naïve to think that if a species' environment changes the species must adapt or else become extinct.... Just as a changed environment need not set in motion selection for new adaptations, new adaptations may evolve in an unchanging environment if new mutations arise that are superior to any pre-existing variations

Douglas Futuyma
One of the most frightening things in the Western world, and in this country in particular, is the number of people who believe in things that are scientifically false. If someone tells me that the earth is less than 10,000 years old, in my opinion he should see a psychiatrist.

Francis Crick
There will be no difficulty in computers being adapted to biology. There will be luddites. But they will be buried.

Sydney Brenner
An atheist before Darwin could have said, following Hume: 'I have no explanation for complex biological design. All I know is that God isn't a good explanation, so we must wait and hope that somebody comes up with a better one.' I can't help feeling that such a position, though logically sound, would have left one feeling pretty unsatisfied, and that although atheism might have been logically tenable before Darwin, Darwin made it possible to be an intellectually fulfilled atheist

Richard Dawkins
Another curious aspect of the theory of evolution is that everybody thinks he understand it. I mean philosophers, social scientists, and so on. While in fact very few people understand it, actually as it stands, even as it stood when Darwin expressed it, and even less as we now may be able to understand it in biology.

Jacques Monod
The false view of evolution as a process of global optimizing has been applied literally by engineers who, taken in by a mistaken metaphor, have attempted to find globally optimal solutions to design problems by writing programs that model evolution by natural selection.