Abstract

The use of DNA sequences to estimate the timing of evolutionary events is increasingly popular, although it is fraught with practical difficulties. But the exponential growth of relevant information and improved methods of analysis are providing increasingly reliable sequence-derived dates, and it may become possible to reconcile fossil-derived and molecular estimates of divergence times within the next few years.

The history of life stretches back more than 3.6 billion years, to a time soon after liquid water had begun to accumulate from volcanic gases onto the newly solid surface of the Earth. Within just a few hundred million years, or perhaps less, photosynthetic bacteria teemed in the infant oceans. The fossil record has traditionally provided the only way to date this and all subsequent events in the history of life. Although enormously informative, however, the fossil record is far from perfect. It is both biased and incomplete: different organisms differ enormously in how well they can be fossilized, and many intervals of Earth's history are poorly represented.

The first protein sequences, obtained over 40 years ago, provided a second means of dating evolutionary events [1]. This involves calibrating the rate at which protein or DNA sequences evolve and then estimating when two evolutionary lineages diverged, using the sequence differences among their living representatives (Figure ​(Figure1).1). Like the fossil record, this genomic record is far from perfect: rates of sequence substitution vary over time and among lineages. Like the fossil record, however, the genomic record can provide a valuable source of information about the timing of evolutionary events when correctly interpreted.

Two approaches to dating evolutionary divergence times. Lineages x, y, z, i and j are shown going back (down) from the present day. Thick bars represent periods for which there is a fossil record for the lineage; dotted lines represent 'ghost' lineages,...

Rate variation is a problem

The idea of dating evolutionary divergences using calibrated sequence differences (Figure ​(Figure1a)1a) was first proposed in 1965 by Zuckerkandl and Pauling [1]. Soon afterwards, Ohta and Kimura [2,3] published the neutral model of protein evolution. In this, they proposed that most nucleotide substitutions within coding sequences are not functionally constrained and therefore accumulate at a constant rate; the neutral model therefore added a potent theoretical underpinning to the enterprise of dating divergence times using sequence data, in a method that soon became known as the 'molecular clock'.

As sequences from multiple species began to accumulate during the 1970s, it became apparent that a clock is not a particularly good metaphor for the process of molecular evolution [4]. Variation in rates of sequence substitution, both along a lineage and between different lineages, is now known to be pervasive [5,6,7]. The reasons for this variation remain poorly understood, despite some interesting correlations [8,9]. Although estimating divergence times from sequence data does not depend on constant substitution rates [10,11,12], variation in these rates greatly reduces the precision of such estimates and remains the primary challenge in using sequence data to date evolutionary events [11,12,13,14,15].

Early studies that used sequence data to estimate key evolutionary divergence times typically examined just one protein from a few species - this was before DNA sequencing was even possible - and used rather simple methods of analysis. Some of these early analyses produced estimates of divergence times that were far earlier than those derived from the fossil record [16,17]. In the past few years, however, a large increase has been seen in the number of studies using sequences to estimate evolutionary divergences (Figure ​(Figure2).2). Datasets have become much larger and methods of analysis considerably more sophisticated, but neither the discrepancy between fossil and molecular dates nor the attendant controversy have disappeared.

Revised chronology of the 'Tree of Life'. The present is represented by the horizontal line at the top and geological periods are shown on the left with their approximate dates. The Phanerozoic era encompasses the Paleozoic, Mesozoic and Cenozoic (Cen)...

Dating key branch points

Divergences between the kingdoms

Among the most intriguing and obscure events in the history of life are the origins of the major kingdoms. Because these events all involved single-celled organisms with relatively poor fossilization potential, the timing of the divergence times between kingdoms has been difficult to establish. On the basis of fossil evidence, the great divide between prokaryotes and eukaryotes occurred about 1.4 billion years ago (Ga) [18]; estimates from sequence data suggest earlier divergence times of 2.1 Ga for the split between archaebacteria and eukaryotes [19] and over 3 Ga for the split between eubacteria and eukaryotes [12,19]. Divergence times of the plant, animal, and fungal kingdoms derived from molecular evidence range from 1.2 Ga to 1.4 Ga [10,12,20], again considerably deeper (longer ago) than is suggested by the fossil record.

Diversification of metazoan body plans

The diversification of animals (metazoa) is one of the most famous evolutionary radiations (see Figure ​Figure2b)2b) [21,22]. The fossil record suggests an abrupt appearance of many different animal phyla about 530 million years ago (Ma), during a Cambrian 'explosion' of new body plans. Over a dozen studies have estimated metazoan divergence times using sequence data, using a variety of datasets, measures of genetic distance, and methods of analysis (see, for example, [12,16,20,23,24]). Although dates differ considerably among these and the other studies published to date, every one falls well before the date of the first unequivocal animal fossils (Figure ​(Figure2).2). Furthermore, where analyses have dated the divergence times of multiple groups of animals, the results indicate an extended rather than an explosive interval of radiation. Even in the absence of precise dates, the rejection of the hypothesis of explosive Cambrian-era divergences in itself provides insights into the causes of the metazoan radiation. For instance, the idea that the origin of the Hox cluster of homeobox-containing developmental control genes directly triggered the diversification of bilaterian animals is not supported, as the Hox cluster predates the appearance of most metazoan body plans by a substantial interval [25].

The colonization of land

An early, important ecological event was the establishment of terrestrial ecosystems. The fossil record suggests that green plants colonized land about 480 Ma [26], but a recent estimate from sequence comparisons reached the conclusion that this event happened about 600 Ma [27]. Divergence times among lineages of ascomycete and basidomycete fungi, which are wholly terrestrial, have been estimated at over 800 Ma [27,28]. As fungi are not autotrophic, they may have colonized land as lichens, in association with green algae [27]. If confirmed, these very early dates for the origin of terrestrial ecosystems would raise questions as to why it took so long for the first animals to colonize land. Fossils suggest that the first terrestrial animals were chelicerate arthropods, related to spiders [26]; vertebrates did not follow until nearly 100 million years later. The true first animals on land may well have been tardigrades (minute creatures that are distantly related to arthropods) and nematodes, however, as both groups are abundant on land today but have left extremely poor fossil records.

The origin of flowering plants

One of the key events in the history of land plants is the origin of angiosperms, or flowering plants, a group that has dominated terrestrial ecosystems since the late Cretaceous. The fossil record of angiosperms extends back to the early Cretaceous, approximately 130 Ma [29]. Early molecular estimates (such as [17]), calibrated using dates of divergence of vertebrate groups from the fossil record, pointed to divergences in the Palaeozoic era (which ended at the Permian-Triassic boundary, about 250 Ma), but more recent analyses calibrated using dates from the plant fossil record [29,30,31] have produced estimates of around 150-200 Ma. Although these later estimates have substantially reduced the discrepancy between sequence-derived and fossil-derived estimates, they have not eliminated it. The timing of angiosperm origins is of considerable interest: it may help explain how flowering plants came to dominate terrestrial ecosystems and how they developed such intimate associations with insect pollinators.

Radiation of birds and mammals

Within the vertebrates, the radiations of the modern mammal and bird orders have received considerable attention (see Figure ​Figure2c).2c). Birds and mammals were present during the Mesozoic era, when dinosaurs and pterosaurs dominated terrestrial ecosystems. It was not until just after the mass extinction at the end of the Cretaceous period (65 Ma), however, that unequivocal representatives of present-day orders of mammals and birds appeared in the fossil record [32]. Yet many independent sequence-based estimates of divergence times of different orders of eutherian (placental) mammals are all firmly in the Cretaceous, between 75 and 100 Ma (for example, see [12,33,34,35,36]). Similarly, multiple estimates of divergence times for modern (neognathine) bird orders are also within the Cretaceous, between 70 and 120 Ma [33,36,37,38,39]. As with the metazoan radiation, dates differ among studies, but there is near unanimity that divergence times significantly precede the first appearances of the relevant groups in the fossil record. If confirmed, these molecular estimates of divergence times have some very interesting implications for understanding factors that influence the turnover of faunas. The present ecological dominance of birds and mammals is something we take for granted; yet this circumstance may, for example, have required the chance impact of an asteroid to remove well-entrenched dinosaur and pterosaur competitors.

The origin of the genus Homo

Human origins, for obvious reasons, have also attracted considerable attention. Numerous studies have estimated the timing of the divergence of humans from our closest relatives, the chimpanzees; the most reliable studies place this date at about 4.5-6.5 Ma (see, for example, [9,40,41]). These dates are not very much deeper than the first appearances of humans in the rather sparse primate fossil record. The human-chimp comparison is also interesting because of the abundance of information available: it is likely that, within a few years, a direct comparison between the complete genomes of the two species will be possible. This particular divergence will probably be one of the first for which we can evaluate whether large increases in sequence information can improve estimates of divergence times.

Reconciling rocks and clocks

Divergence-time estimates derived from fossils and sequences are often at odds (Figure ​(Figure2).2). For some of the most interesting events in the history of life that we would like to be able to date, the discrepancy is simply too large to ignore. A common reaction among paleontologists is that because sequence-based estimates are inconsistent, they are likely to be in error [32,42,43]; some molecular biologists, in turn, have pointed to the imperfection of the fossil record as the source of the discrepancy [20]. What are the prospects for reconciling these seemingly discordant sources of temporal information?

For a start, it is important to realize that both fossils and sequence data provide biased and imperfect perspectives into the timing of evolutionary events. The quality of the fossil record is notoriously heterogeneous, because of the large variations in preservation potential, changes in sea level and sea chemistry, current exposure of rocks to erosion, and other factors [44]. The result is extraordinarily complete coverage in the fossil record of narrow intervals and locations in Earth's history and much poorer or non-existent coverage elsewhere. A fundamental property of the fossil record is that it always underestimates divergence times because it is incomplete [45]; and even in the few cases for which the record is nearly complete, specimens that are in fact members of distinct lineages may not be recognized as such because they look so similar [29,44].

The quality of information that can be extracted from sequence data is equally notorious, but for rather different reasons. Variation in rates of sequence substitution is unpredictable and often rather large; furthermore, different lineages may have different patterns of rate variation [4,5,6,8,9]. Methods for estimating divergence times from sequence data do not rely on constant rates of substitution, but they do perform better when rate variation is small [10,11,12]. Unlike the fossil record, molecular evidence can both under- and over-estimate divergence times.

We are left with just a few basic possibilities to explain the discrepancies between divergence-time estimates based on fossils and sequences. One is that there is a fundamental bias towards overestimation of the time since divergence in sequences and that this bias is absent from the fossil record. There is no reason, however, to suspect that this is the case; indeed, estimates from fossils and sequences are often not very different (for example for the human-chimp and angiosperm divergences). Suggestions that rates of sequence evolution might be higher during radiations [46] are not supported by empirical evidence [23,39].

Another possibility is that the fossil record often underestimates divergence times. This is certainly the case for many taxa. For instance, there is essentially no fossil record for several animal phyla - such as flatworms, nematodes, and rotifers - yet we know on phylogenetic grounds that they must have been present for at least 500 million years [21,43]. The simple fact that the fossil record is a subsample of past diversity can also lead to substantial underestimates of divergence times. For example, a simple model of primate diversification using the times of appearance in the fossil record together with measures of fossilization potential suggests that 'modern' primates arose about 80 Ma, much closer to sequence-based estimates of divergence times than to the actual first appearance in the fossil record [47].

A third important cause of the discrepancy between fossil-based and sequence-based timing estimates is that they actually measure different events [23,43,44]. Sequence differences reflect the time since two taxa last shared a common ancestor (their divergence time), whereas fossils reflect the appearance of anatomical structures that define a specific group (its origin). The two events may be widely separated in time: early members of a group can be quite different in anatomy, habitat, and size from later, more familiar members [29,44]. This could lead to an apparent absence of a particular lineage from the fossil record, even though it existed at the time [45,48].

Discrepancies between fossil- and sequence-based estimates of divergence times could, in principle, be resolved through new fossil discoveries that close the gap. In cases for which the fossil record is generally rather good, this seems relatively unlikely. It has been argued, for instance, that the relatively high quality of the mammal fossil record makes it highly unlikely that representatives of modern mammal orders were present before the end of the Cretaceous but escaped fossilization [32,46].

But even in well-studied groups, surprises still occur. Several recent discoveries of Cretaceous bird and mammal fossils may be representatives of extant orders [48,49,50] and, if confirmed, would narrow the gap between fossil-based and sequence-based estimates of divergence times. Recent discoveries from Chengjiang, China, extend the fossil record of vertebrates, traditionally considered relatively complete, back in time by more than 10% of the previously estimated time since their origin [51]. The discovery of possible metazoan embryos from Duoshanto, China, would similarly extend the fossil record of metazoans back by about 12% if confirmed [52]. These expansions of the stratigraphic range of groups of organisms are not enough to erase discrepancies between fossil and sequence dates, but they serve as clear reminders that the final word on divergence times is not yet in from the fossil record.

Improving sequence-based estimations

Early attempts to use sequence data to reconstruct phylogenetic relationships were not uniformly successful: they often produced results that conflicted with each other or with common sense. These difficulties did not escape notice, prompting more than a few calls for abandoning such a manifestly misleading source of information about evolutionary history. The situation today is dramatically different. Molecular data are now routinely used in phylogenetic analyses and generally yield consistent and well-supported results. Although increases in the size of datasets have helped, the biggest gains have come from vastly improved analytical methods. In retrospect, using sequence data to infer phylogenetic relationships was not an inherently flawed approach, but the early analytical methods used were inadequate.

The parallels of divergence-time estimation with estimation of phylogenetic relationships are clear. The analytical methods in widespread use today are based on the original approach of Zuckerkandl and Pauling [1] (Figure ​(Figure1).1). This approach suffers from two basic weaknesses: it relies on averaging multiple measures of the same divergence time to overcome the problem of rate variation, and it explicitly assumes that calibration points taken from the fossil record are accurate. Efforts to improve analytical methods have largely focused on the problem of rate variation, although inaccurate calibrations are probably an equally important source of error in divergence-time estimates.

One approach to rate variation has been to fine-tune the traditional approach. Genetic distances in general use today take into account several properties of sequence evolution, correcting for multiple substitutions at the same site in the sequence, for rate variation among sites, and for differences in the probability of different types of mutation [12]. Some authors have argued for removing taxa or genes from an analysis if they exceed an arbitrary degree of rate variation from the mean [38,53], but others have questioned the legitimacy of this approach and noted that, in any case, it does not reduce the magnitude of error associated with divergence time estimates [11,12,24,38]. The importance of dense phylogenetic sampling (using data from many species) has been stressed by some authors, both as a means of obtaining better calibrations and of better delineating rate variation among lineages [23,34,39].

A second approach is to assign different rates of sequence evolution to different lineages. This 'local clock' method involves calculating branch lengths for a phylogenetic tree encompassing the taxa of interest and then directly assigning different rates to different clades (groups of related organisms) [13,38,41]. More general models, using maximum-likelihood or non-parametric methods, derive continuous distributions of rate variation from a specific model of sequence evolution [11,14,54]. The latter methods are less arbitrary and provide more meaningful error bars on divergence-time estimates.

A third approach is to use Bayesian statistics to infer divergence times. This method builds on information provided by the investigator about phylogenetic relationships and divergence times (called the 'prior') to calculate a refined estimate of the variables to be assessed (the 'posterior'), given both the sequence data available and an explicit model of evolution [15,31]. These methods not only allow for rate variation but also incorporate uncertainties about dates used for calibration (for example, one calibration point may be given as 65 ± 3 Ma and another as 83 ± 15 Ma). With dense taxonomic sampling and a realistic model of evolution, Bayesian methods can substantially increase the accuracy of divergence-time estimates [34,55].

In conclusion, assigning dates to branches on the 'Tree of Life' remains problematic, because both of the available sources of information are far from perfect. Of one point, however, we can be quite confident: the molecular datasets pertinent to this issue will become vastly larger in the very near future, whereas new information from fossils will continue to accumulate only sporadically. With more sequence data and better analytical methods, estimates of divergence times will probably converge on consistent dates with smaller error bars. Although some of the discrepancies between fossil-based and sequence-based dates (Figure ​(Figure2)2) may disappear as a consequence, others may not. Already, studies using independent molecular datasets and different methods of analysis often concur that particular divergence times are substantially deeper than indicated by the fossil record. In such cases, and for groups for which no fossils are available, sequence data may be our best indication of the true divergence times. It would indeed be shortsighted to ignore the enormous, and still largely untapped, store of information that genomes hold regarding the timing of important evolutionary events.

Chen F-C, Li W-S. Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees. Am J Hum Genet. 2001;68:444–456.[PMC free article][PubMed]