Friday, September 14, 2007

Genome Size, Complexity, and the C-Value Paradox

Forty years ago it was thought that the amount of DNA in a genome correlated with the complexity of an organism. Back then, you often saw graphs like the one on the left. The idea was that the more complex the species the more genes it needed. Preliminary data seemed to confirm this idea.

In the late 1960's scientists started looking at the complexity of the genome itself. They soon discovered that large genomes were often composed of huge amounts of repetitive sequences. The amount of "unique sequence" DNA was only a few percent of the total DNA in these large genomes.1 This gave rise to the concept of junk DNA and the recognition that genome size was not a reliable indicator of the number of genes. That, plus the growing collection of genome size data, soon called into question the simplistic diagrams like the one shown here from an article by John Mattick in Scientific American (Mattick, 2004). (There are many things wrong with the diagram. Can you identify all of them? See What's wrong with this figure? at Genomicron).

Today we know that there isn't a direct correlation between genome size and complexity. Recent data, such as that from Ryan Gregory's website (right) reveals that the range of DNA sizes in many groups can vary over several orders of magnitude [Animal Genome Size Database]. Mammals don't have any more DNA in their genome than most flowering plants (angiosperms). Or even gymnosperms, for that matter.

Many of us have been teaching this basic fact for twenty years. The bottom line is ....

Anyone who states or implies that there is a significant correlation between total haploid genome size and species complexity is either ignorant or lying.

It is notoriously difficult to define complexity. That's only one of the reasons why such claims are wrong. Ryan Gregory wants everyone to know that the figure showing genome sizes in different phylogenetic groups is not meant to imply a hierarchy of complexity from algae to mammals.

A recent paper by Taft et al. (2007) says complexity can be "broadly defined as the number and different types of cells, and the degree of cellular organization." We can quibble about the definition but there's nothing better that I know of. The real question is whether organism complexity is a useful scientific concept.

Here's the problem. Have some scientists already made up their minds that mammals in general, and humans in particular, are the most complex organisms? Do they construct a definition f complexity that's guaranteed to confer the title of "most complex" on humans? Or, is complexity a real scientific phenomenon that hasn't yet been defined satisfactorily?

I, for one, don't know whether humans are more complex than an owl, or an octopus, or an orchid. For all I know, humans may be less complex by many scientific measure of complexity. Plants can grow and thrive on nothing but water, some minerals, and sunlight. We humans can't even make all of our own amino acids. Does that make us less complex than plants? Certainly it does at the molecular level.

Back in the olden days, when everyone was sure that humans were at the top of the complexity tree, the lack of correlation between genome size and complexity was called the C-value paradox where "C" stands for the haploid genome size. The term was popularized by Benjamin Lewin in his molecular biology textbooks. In Genes II (1983) he wrote.

The C value paradox takes its name from our inability to account for the content of the genome in terms of known function. One puzzling feature is the existence of huge variations in C values between species whose apparent complexity does not vary correspondingly. An extraordinary range of C values is found in amphibians where the smallest genomes are just below 109bp while the largest are almost 1011. It is hard to believe that this could reflect a 100-fold variation in the number of genes needed to specify different amphibians.

So, the paradox arises even if we don't know how to rank flowering plants and mammals of a complexity scale. It arises because there are so many examples of very similar species that have huge differences in the size of their genome. Onions, are another example—they are the reason why Ryan Gregory made up the Onion Test.

The onion test is a simple reality check for anyone who thinks they have come up with a universal function for non-coding DNA. Whatever your proposed function, ask yourself this question: Can I explain why an onion needs about five times more non-coding DNA for this function than a human?

Imagine the following scenario. You are absolutely convinced that humans are the most complex species but total genome size doesn't reflect your conviction. The C-value paradox is a real paradox for you. Knowing that much of our genome is possibly junk DNA still leaves room for plenty of genes. You take comfort in the fact that under all that junky genome, humans still have way more genes than simple nematodes and flowering plants. You were one of those people who wanted there to be 100,000 genes in the human genome [Facts and Myths Concerning the Historical Estimates of the Number of Genes in the Human Genome].

But when the genomes of these species are published, it turns out that even this faint hope evaporates. Humans, Arabidopsis (wall cress, right), and nematodes all have about the same number of genes.

Oops. Now we have a G-value paradox, where "G" is the number of genes (Hahn and Wray, 2002). The only way out of this box—without abandoning your assumption about humans being the most complex animals—is to make up some stories about the function of so-called junk DNA. If it turns out that there are lots of hidden genes in that junk then maybe it will rescue your assumption. This is where we get some combination of the excuses listed in The Deflated Ego Problem.

On the other hand, maybe humans really aren't all that much more complex, in terms of number of genes, than wall cress. Maybe they should have the same number of genes. Maybe the other differences in genome size really are due to variable amounts of non-functional junk DNA.

1. Thirty years ago we had to teach undergraduates about DNA reassociation kinetics and Cot curves—the most difficult thing I've ever had to teach. I'm sure glad we don't have to do that today.

22 comments
:

Thirty years ago we had to teach undergraduates about DNA reassociation kinetics and Cot curves—the most difficult thing I've ever had to teach. I'm sure glad we don't have to do that today.

I learned about cot curves in a molecular evolution course I took less than ten years ago. It was taught by an ol' school geneticists, so it's not representative of molecular evolution courses in general. But people do still learn about that stuff.

I was also teaching it about twelve years ago when I last taught an upper-level molecular genetics course. It always amused me to see seniors who had supposedly aced calculus three years previously get the deer in the headlights look when confronted with simple algebra. Always made me wonder just what the hell the math department was actually teaching.

Absolutely fascinating article Larry! Lots of challenges. You seem to be raising a question over that final human ‘conceit’ - that is, in the face of the ever-advancing principle of cosmic mediocrity that started with Copernicus, human pride might at least find some comfort in the thought that we are manifestly the most complex things this side of the nearest star! (and perhaps quite a few other stars as well!).

A necessary condition of complexity is variety, although ‘variety’ isn’t a sufficient condition to define the working systems of life, it clearly subsets them. Well, using this necessary condition to define complexity, perhaps there is some mileage to be gained from the following line of thought:

It has always been clear in the field of computation that complexity of output arises not just because of complex starting information (like complex DNA info), but also because of the length of time a computation takes place; that is, the complexity of the final output derives from two computational resources, namely the complexity of the initial information and the generation time.

Hence, applying these ideas it would seem to me that assembly time also has a bearing when considering the comparative complexity of organisms. With humans the ‘assembly time’ needs to be taken into account; and that assembly time must also include the assembly of proteins that we don’t manufacture ourselves but take on board as food. In short being at the top of the food chain our true DNA sequence implicitly includes parts of the DNA sequence of organisms from which we derive proteins; our effective DNA sequence is a concatenation of DNA sequences from other organisms. Moreover, a viable human also requires a lot of social training and perhaps that should be taken into account to.

So, for all you human complexity chauvinists out there, there is hope for you yet!

Timothy V Reeves wrote: "In short being at the top of the food chain our true DNA sequence implicitly includes parts of the DNA sequence of organisms from which we derive proteins...."

I am sure the notion that we are "at the top of the food chain" will be a great reassurance to you and any companions the next time you hike where there are grizzlies or swim where there are great white sharks.

I think it is clear that the "complexity is beautiful" guys are not only burning their candle from both ends, but has a blow torch on the middle.

In the one end complexity has a funky relationship with information and descriptions. Kolmogorov complexity (algorithmic information) is a measure of the resources needed to specify an object. A random string takes the most information to specify.

In the other end the difficulty with defining complexity is because there is no single measure that can capture all structural characteristics.

Taft's organism measure seems like a good measure as a first approximation to capture general complexity, but it leaves out many traits, behavioral complexity, et cetera.

Your description of computational complexity is interesting. Do you have any references?

Btw, my impression was that CS distinguished between computational cost in space (memory constraint) and time (time constraint). I thought computational complexity (and their classes) described the former and that it can be traded off for the later?!

In any case, I'm not sure your description of human 'assembly time' is entirely accurate.

First, the ova and its surrounding contributes a lot of starting information. It brings the cell machinery and maternal hormones that imprints directions on the fetus early on. Second, the difference in food needs (mainly vitamins I believe) and protein expression between us and much smaller animals isn't all that great. And comparing a child and an adult it seems the main difference from growth is size. :-o

On behavioral complexity, I believe you may be on to something simple yet powerful. Also, you can proficiently run the argument in reverse; evolutionary 'assembly time' is pretty much the same for all organisms so they should contain pretty much the same "starting information".

What I said above is exploratory – I’m not pushing it as a fact, just a subject for exploration. I admit that current ambiguities in defining complexity (as suggested by both Larry and Torbjorn) may scupper all attempts to detect and attribute any special complexity status to humans; the differences in the complexities of the higher organisms may be just too subtle to be picked up by our crude notions of complexity. (As Torbjorn suggested)

In an attempt to circumvent these issues I have used a necessary condition for complexity rather than a sufficient condition; A necessary condition of life is that it must display a high variety of structure/configuration/behaviour. However, this condition has the disadvantage of widening the definitional net so much that it gives randomness the highest complexity ‘status’. The ‘mutual information’ notion alluded to by Torbjorn tries to eliminate this, by peaking complexity some where between regularity and randomness. But as I am comparing organisms with organisms rather than organisms with say ‘gases’ the mutual information factor is inherent in the action of identifying the organism in the first place – that identification can only take place because of the mutual information that constitutes a group of cooperating molecules and cells.

So do humans have the greatest variety in terms of some combination of structure, configuration and especially behaviour? Well, assuming they do, then computationally speaking that implies humans will require the greatest number of steps in their construction, and that construction must include the steps in the construction of off-the-peg molecules (bits of protein according to Dave above) taken from other organisms, not to mention the long process of socialization. Hence humans can boast that they are biggest and best on at least one count – the shear construction work!

In spite of all that it may be that what really sets human biological configurations apart is not their structural variety but rather in some other way that, like a simple yet elegant algorithm, is best expressed as being “just damn clever”. Given the complexity of complexity space can we ever hope to fully mathematicize the notion of being ‘just damn clever’? It could be that my notion of complexity above should dispense completely with the idea of variety as a measure of complexity and perhaps I should fall back on quantity of construction steps only – that is, some biological configurations might be comparatively simple on the variety front but are extraordinarily difficult to find in the many pathways of complexity space because they are the equivalent of some distant far flung backwater that requires a lot of computational fuel to find.

Technical note to Torborn: The relationship of variety and computation steps is something I’m still thinking about. Hence no references yet. Unfortunately Chaitin, the man I go to for all things algorithmic, seems much less interested in computation time than he does program string length. Program strings map to an output, which may be just a simple yes or no or it may be Omega. In short Chaitin is interested in functions, or ‘halting programs’. I’m interested in developing systems that don’t stop such as evolution or even a simple non-halting counting algorithm that tracks through all configurations. Moreover, what goes on in memory I consider as ‘output’.

The correspondence between algorithmic constructions and physics is interesting. Computer scientist Scott Aaronson has a lot to say here. The impression I get is that he claims that physical processes are algorithmic at their base. (Even if that is quantum algorithmic.)

So you put the finger on a problem here. Algorithmic theory is interested in the resources (time and space) taken to deliver a result. While physics necessarily describes the ongoing process. It will be interesting to see if guys like Aaronson can figure out a more direct correspondence.

About humans, to be honest I don't see any large difference between humans and other species, and certainly not a qualitative difference.

Thanks for that Torbjorn. I'll look up Scott Aaronson and see what he says.

Yes, you may be completely right about the differences (or lack of them) between humans and comparably complex organisms. I was just engaging in some rather seat-of-the-pants speculations - that's how I like my science!

In any case the exercise may be as pointless as trying to compare the complexity of a sports car with a heavy truck - what really distinguishes the two is not so much a difference in complexity measure but a difference in function!

I just heard a professor tell 150 undergraduates that as you move up the ladder of evolution and organisms get more complex, the size of the gene increases (due to more introns). The textbook doesn't say this but kind of hints at it by using carefully chosen organisms arranged from bacterium to yeast to Drosophila to human, and hey it sure looks true! I came right back to my computer and looked up this blog post in order to remind myself of the reality and set my portion of the class straight in discussion. It should be required reading. Thank you for writing it.(Also: "ladder of evolution"?? what century is this?? ARGH)

Do you think that alternative splicing will have a major impact on the figures people are producing?

(I have no strong opinions on this subject and certainly don't have a particular axe to grind. I found your post interesting, but I was expecting to see maybe a little on the impact perhaps cladistic variability in amount of alternative splicing (is there any?)? Do you (or others_ believe that alternative splicing is evolutionarily relevant in terms of complexity, for example?

The word "complex" tends to mean "many parts" but tells us nothing about the interrelation of those parts, a functional mechanical watch is complex but also a mangled mechanical watch is comnplex, a living animal is complex but also a largely decomposed animal is complex. We are using the same word to mean opposite states! I would purpose we reserve the word "complex" for the definition "composed of many independent parts" and introduce the word "sophisticated" for the definition "composed of many interdependent parts". Complex systems may be predicted by probabilities, but sophisticated systems can only be predicted by a close tally of each part, genes are more sophisticated than complex, the environment is more complex than sophisticated, the issue here is not an argument about that which is more complex but rather it is an argument about that which is more sophisticated

Laurence A. Moran

Larry Moran is a Professor in the Department of Biochemistry at the University of Toronto. You can contact him by looking up his email address on the University of Toronto website.

Sandwalk

The Sandwalk is the path behind the home of Charles Darwin where he used to walk every day, thinking about science. You can see the path in the woods in the upper left-hand corner of this image.

Disclaimer

Some readers of this blog may be under the impression that my personal opinions represent the official position of Canada, the Province of Ontario, the City of Toronto, the University of Toronto, the Faculty of Medicine, or the Department of Biochemistry. All of these institutions, plus every single one of my colleagues, students, friends, and relatives, want you to know that I do not speak for them. You should also know that they don't speak for me.

Subscribe to Sandwalk

Quotations

The old argument of design in nature, as given by Paley, which formerly seemed to me to be so conclusive, fails, now that the law of natural selection has been discovered. We can no longer argue that, for instance, the beautiful hinge of a bivalve shell must have been made by an intelligent being, like the hinge of a door by man. There seems to be no more design in the variability of organic beings and in the action of natural selection, than in the course which the wind blows.Charles Darwin (c1880)Although I am fully convinced of the truth of the views given in this volume, I by no means expect to convince experienced naturalists whose minds are stocked with a multitude of facts all viewed, during a long course of years, from a point of view directly opposite to mine. It is so easy to hide our ignorance under such expressions as "plan of creation," "unity of design," etc., and to think that we give an explanation when we only restate a fact. Any one whose disposition leads him to attach more weight to unexplained difficulties than to the explanation of a certain number of facts will certainly reject the theory.

Charles Darwin (1859)Science reveals where religion conceals. Where religion purports to explain, it actually resorts to tautology. To assert that "God did it" is no more than an admission of ignorance dressed deceitfully as an explanation...

Quotations

The world is not inhabited exclusively by fools, and when a subject arouses intense interest, as this one has, something other than semantics is usually at stake.
Stephen Jay Gould (1982)
I have championed contingency, and will continue to do so, because its large realm and legitimate claims have been so poorly attended by evolutionary scientists who cannot discern the beat of this different drummer while their brains and ears remain tuned to only the sounds of general theory.
Stephen Jay Gould (2002) p.1339
The essence of Darwinism lies in its claim that natural selection creates the fit. Variation is ubiquitous and random in direction. It supplies raw material only. Natural selection directs the course of evolutionary change.
Stephen Jay Gould (1977)
Rudyard Kipling asked how the leopard got its spots, the rhino its wrinkled skin. He called his answers "just-so stories." When evolutionists try to explain form and behavior, they also tell just-so stories—and the agent is natural selection. Virtuosity in invention replaces testability as the criterion for acceptance.
Stephen Jay Gould (1980)
Since 'change of gene frequencies in populations' is the 'official' definition of evolution, randomness has transgressed Darwin's border and asserted itself as an agent of evolutionary change.
Stephen Jay Gould (1983) p.335
The first commandment for all versions of NOMA might be summarized by stating: "Thou shalt not mix the magisteria by claiming that God directly ordains important events in the history of nature by special interference knowable only through revelation and not accessible to science." In common parlance, we refer to such special interference as "miracle"—operationally defined as a unique and temporary suspension of natural law to reorder the facts of nature by divine fiat.
Stephen Jay Gould (1999) p.84

Quotations

My own view is that conclusions about the evolution of human behavior should be based on research at least as rigorous as that used in studying nonhuman animals. And if you read the animal behavior journals, you'll see that this requirement sets the bar pretty high, so that many assertions about evolutionary psychology sink without a trace.

Jerry Coyne
Why Evolution Is TrueI once made the remark that two things disappeared in 1990: one was communism, the other was biochemistry and that only one of them should be allowed to come back.

Sydney Brenner
TIBS Dec. 2000
It is naïve to think that if a species' environment changes the species must adapt or else become extinct.... Just as a changed environment need not set in motion selection for new adaptations, new adaptations may evolve in an unchanging environment if new mutations arise that are superior to any pre-existing variations

Douglas Futuyma
One of the most frightening things in the Western world, and in this country in particular, is the number of people who believe in things that are scientifically false. If someone tells me that the earth is less than 10,000 years old, in my opinion he should see a psychiatrist.

Francis Crick
There will be no difficulty in computers being adapted to biology. There will be luddites. But they will be buried.

Sydney Brenner
An atheist before Darwin could have said, following Hume: 'I have no explanation for complex biological design. All I know is that God isn't a good explanation, so we must wait and hope that somebody comes up with a better one.' I can't help feeling that such a position, though logically sound, would have left one feeling pretty unsatisfied, and that although atheism might have been logically tenable before Darwin, Darwin made it possible to be an intellectually fulfilled atheist

Richard Dawkins
Another curious aspect of the theory of evolution is that everybody thinks he understand it. I mean philosophers, social scientists, and so on. While in fact very few people understand it, actually as it stands, even as it stood when Darwin expressed it, and even less as we now may be able to understand it in biology.

Jacques Monod
The false view of evolution as a process of global optimizing has been applied literally by engineers who, taken in by a mistaken metaphor, have attempted to find globally optimal solutions to design problems by writing programs that model evolution by natural selection.