ADVENTURES OF A MATHEMATICIAN IN EVOLUTIONARY BIOLOGY

My adventures in biology began only in the second summer of my PhD study. In high school my favorite subjects were mathematics, chemistry, and physics and I intended to become a physical scientist. For these reasons, I ended up taking only one biology course during high school. I studied engineering for my bachelor degree and geophysics for my master’s degree (both in Taiwan).I then went to Brown University to pursue a Ph.D. in applied mathematics. In the second summer I thought it would be good to have biology as a field of application because it was a subject that had received little attention from mathematicians. I talked to professors in biology and was fortunate to meet Dr. Masatoshi Nei, who told me that mathematics was much needed in population genetics. My Ph.D. advisor Dr. Wendell Fleming enthusiastically supported my intention and I started to work on population genetics with Dr. Nei. I immediately found genetics very interesting because it explains many intriguing phenomena in the living world.

1. Theory of Population Genetics

Population genetics deals with the dynamics of genes in populations, such as how a new mutation may spread through a population or species. It also deals with the maintenance of genetic variability in a population. Naturally, such topics require sophisticated mathematics and are very attractive to me. In the 1970s, a longstanding issue was, “How much genetic variability can be maintained in a population under the joint effects of mutation, natural selection, and random drift?” The case involving only two of the three evolutionary factors such as mutation and random drift had already been solved then, but adding a third factor complicated the mathematics tremendously. I was very delighted when I solved this problem in 1977. In addition, Dr. Nei and I developed formulas for computing the variance of genetic variation (heterozygosity) and methods for estimating DNA sequence divergences or nucleotide diversity from restriction enzyme data. These methods were widely used in the 1980s and 1990s. In 1987 I developed a stochastic model for studying the population dynamics of nearly neutral mutations, which was very helpful for understanding the nonrandom usage of synonymous codons in protein-coding genes. In addition, the method my postdoc Yun-Xin Fu and I developed in 1993 for detecting natural selection in DNA sequences from single nucleotide polymorphism (SNP) data has been one of the most widely used methods for analyzing SNP data.

2. Molecular Evolution

Comparative methods of DNA sequence analysis
In 1980 I noted that a substantial number of DNA sequences had already been determined. Appreciating the great potential of such data for evolutionary studies, I decided to devote most of my time to the study of DNA sequence evolution. In 1981 I developed a method for estimating the rate of nucleotide substitution in pseudogenes. Besides the methodological contribution, the significance of this study was that it showed that pseudogenes had evolved at the highest rate among all then-known mammalian DNA sequences, supporting the view that mutation rather than natural selection is the most important factor in determining the rate of evolution in a sequence. In 1981, I also developed a method of phylogenetic reconstruction that takes into account unequal rates of evolution among evolutionary lineages. This method was frequently used in the 1980s and greatly stimulated further methodological development.

I also developed many other commonly used methods for comparative analysis and phylogenetic reconstruction. My 1985 method and the 1993 modified version have been a standard method for estimating the synonymous and non-synonymous divergences between two protein-coding genes. I also did extensive studies of the bootstrap method, a computational technique for evaluating the statistical confidence of a tree. The bootstrap technique had been used in almost every statistical evaluation of molecular phylogenies, but there was no theoretical examination of its reliability until my 1992 work with Andrey Zharkikh. One of our important findings was that bootstrapping tends to give underestimates of actual support. We later developed a method for correcting the estimation bias, which represents one of the most sophisticated theoretical studies on molecular phylogenetics.

The molecular clock

Since the early1980s I have been pursuing the Molecular Clock, a hypothesis that postulates a constant rate of protein or DNA sequence evolution over time and among evolutionary lineages. Although this concept was extremely controversial in the 1960s and early 1970s, it became widely accepted among molecular evolutionists in the late 1970s because, since protein sequence data seemed to support the hypothesis, it was strongly advocated by many authorities (e.g., Kimura and Wilson).

In 1985 my postdoc Chung-I Wu and I provided the first strong evidence from DNA sequence data against this hypothesis. This study was the first to use a significant amount of DNA sequence data to show that the rate of evolution in the rodent lineage was at least twice that in the primate lineage. In this study we also developed a statistical method for testing the molecular clock hypothesis, which has become a standard method in the field. Later, my postdoc Tomoko Tanimura and I showed that the molecular clock runs more slowly in man than in monkeys. These studies provided strong evidence for the generation-time effect, that is, the rate of molecular evolution is faster in short-lived organisms than in long-lived ones. My colleagues and I subsequently published a series of papers with extensive amounts of DNA sequence data to support this view. For example, we estimated that the substitution rate in mice and rats is more than 5 times higher than that in higher primates. Nevertheless, even in 2002 there were still claims for a global molecular clock in mammals. So, my postdoc Soojin Yi and I have recently published a paper to show that the rates of nucleotide substitution in Old World monkeys, apes, and humans are much lower than that in rodents. This study took advantage of the large amounts of genomic sequence data in hominoids and recently discovered fossils for early ancestors of humans, so the conclusion seems definitive. Note that the molecular clock hypothesis has been a central issue in molecular evolution since it was proposed in 1965. Because rate constancy has frequently been assumed in the estimation of divergence times between species and because the assumption of rate constancy has often been taken as strong support for the neutral mutation hypothesis, it is obviously important to show that this assumption is often violated. I am delighted to see that several methods have been developed for estimating species divergence times without assuming a molecular clock.

Male-driven Evolution

In the late 1980s I decided to establish a molecular biology laboratory, so that I could obtain data to address intriguing issues in molecular evolution. One of such issues is the male-driven evolution hypothesis, which postulates that in humans and many other vertebrates mutation occurs mainly in males. My laboratory was the first one to obtain suitable DNA sequence data to give a good estimate of the male-to-female ratio of mutation rate in higher primates, including humans and apes. My lab continued to show that male-driven evolution occurs in every mammalian DNA sequence studied and that it is weaker in mice and rats than in primates, providing another line of evidence for the generation-time effect on the molecular clock. However, two recent Nature papers claimed that the ratio in humans is only ~2 rather than 5-6. To counter these claims, my lab obtained new sequence data and showed that our original estimate was accurate. Furthermore, we pointed out that the low estimates in the two recent studies were due to conceptual and computational errors. Thus, my lab has provided ample support for Haldane’s (1935) concept of male-driven evolution. Besides providing evidence for strong male-driven evolution in humans and apes and for the generation-time effect, our studies also support the view that errors during DNA replication in the germ-line are the major source of mutation.

3. Evolutionary Genomics

As early as 1991 I obtained an estimate of 0.001 for the nucleotide diversity in the human genome, which is defined as the number of nucleotide difference per site between two DNA sequences at the same locus randomly chosen from the human population. This study was conducted before the genomic era --- indeed, 10 years before the completion of the human genome. It turned out that this estimate has been supported by extensive data in recent years, some of which were from my lab. More recently, we found that the nucleotide diversity within Africans is actually larger than that between Africans and non-Africans, lending support to the Out of Africa model for the origin of modern humans. Furthermore, my student Feng-Chi Chen and I were the first to show that the nucleotide divergence between the human and chimpanzee genomes is only ~1%. We have also made significant contributions to the study of human genes. First, we found fragments of transposable elements in a large number of coding regions in human genes. Second, we developed a powerful method for predicting human genes using a comparative genomic approach, which was used to detect numerous potential coding regions that had not been detected by any other current gene prediction methods. Our studies suggested that the number of human genes is probably considerably larger than the estimate of ~30,000 by the Human Genome Project.

In the last few years we have focused on the evolution of duplicate genes at the genomic level, a central issue in molecular evolution because gene duplication is the major source of genetic novelties. My student Z. Gu and I developed a rigorous method for detecting duplicate genes in a eukaryotic genome and gave the first estimates of the extents of gene duplication in the genomes of Drosophila, C. elegans, and yeast. A central question in the evolution of duplicate genes is how fast and how often duplicate genes diverge in gene expression. Using extensive microarray gene expression data from yeast we showed that a large number of duplicate genes have diverged quickly in expression and the vast majority of duplicate genes will sooner or later diverge in expression. This was the first study on this subject using data from an entire genome. We later found that this conclusion also holds for duplicate genes in the human genome. Recently, we showed that duplicate genes play a very significant role in genetic robustness against loss-of-function mutations, refuting the prevailing view that the role of duplicate genes in genetic robustness is negligible. This study took advantage of a nearly complete set of single-gene-deletion mutant strains of the entire yeast genome.

4. Concluding Remarks

In summary, since the 1970s I have been making efforts to deal with current issues in evolutionary biology and to integrate evolution and genetics, using both theoretical and empirical approaches. My adventures have been highly exciting and fruitful. I have made a lucky choice in the subject of my research because genetics and molecular biology have been expanding rapidly ever since I started my scientific career in the 1970s. Even more rapid expansion will occur because of the development in genomics including functional genomics and proteomics. I look forward to more exciting adventures in the years to come.

One research area that I intend to go into is the evolution of gene regulation and regulatory modules. The significance of regulatory evolution has been recognized since the 1960s. For example, the conspicuous morphological differences between humans and chimpanzees were suggested to be mainly due to regulatory differences between the two species. However, regulatory evolution has not been much studied in the past mainly because of technical difficulties in obtaining suitable data. Fortunately, past advances in molecular biology, especially in gene regulation, and recent advances in genomics have enabled one to study not only the regulation of single genes, but also the co-regulation of many genes. Thus, it is now possible to carry out detailed investigation of regulatory evolution in eukaryotes, including the evolution of regulatory modules, which are sets of co-regulated genes, as well as the study the evolution of the entire regulatory network in closely related eukaryotes. This represents a new direction of research in my laboratory.