Precise estimates of mutation rate and spectrum in yeastPROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICAZhu, Y. O., Siegal, M. L., Hall, D. W., Petrov, D. A.2014; 111 (22): E2310-E2318

Abstract

Mutation is the ultimate source of genetic variation. The most direct and unbiased method of studying spontaneous mutations is via mutation accumulation (MA) lines. Until recently, MA experiments were limited by the cost of sequencing and thus provided us with small numbers of mutational events and therefore imprecise estimates of rates and patterns of mutation. We used whole-genome sequencing to identify nearly 1,000 spontaneous mutation events accumulated over ∼311,000 generations in 145 diploid MA lines of the budding yeast Saccharomyces cerevisiae. MA experiments are usually assumed to have negligible levels of selection, but even mild selection will remove strongly deleterious events. We take advantage of such patterns of selection and show that mutation classes such as indels and aneuploidies (especially monosomies) are proportionately much more likely to contribute mutations of large effect. We also provide conservative estimates of indel, aneuploidy, environment-dependent dominant lethal, and recessive lethal mutation rates. To our knowledge, for the first time in yeast MA data, we identified a sufficiently large number of single-nucleotide mutations to measure context-dependent mutation rates and were able to (i) confirm strong AT bias of mutation in yeast driven by high rate of mutations from C/G to T/A and (ii) detect a higher rate of mutation at C/G nucleotides in two specific contexts consistent with cytosine methylation in S. cerevisiae.

Abstract

The role of positive selection in human evolution remains controversial. On the one hand, scans for positive selection have identified hundreds of candidate loci, and the genome-wide patterns of polymorphism show signatures consistent with frequent positive selection. On the other hand, recent studies have argued that many of the candidate loci are false positives and that most genome-wide signatures of adaptation are in fact due to reduction of neutral diversity by linked deleterious mutations, known as background selection. Here we analyze human polymorphism data from the 1000 Genomes Project and detect signatures of positive selection once we correct for the effects of background selection. We show that levels of neutral polymorphism are lower near amino acid substitutions, with the strongest reduction observed specifically near functionally consequential amino acid substitutions. Furthermore, amino acid substitutions are associated with signatures of recent adaptation that should not be generated by background selection, such as unusually long and frequent haplotypes and specific distortions in the site frequency spectrum. We use forward simulations to argue that the observed signatures require a high rate of strongly adaptive substitutions near amino acid changes. We further demonstrate that the observed signatures of positive selection correlate better with the presence of regulatory sequences, as predicted by the ENCODE Project Consortium, than with the positions of amino acid substitutions. Our results suggest that adaptation was frequent in human evolution and provide support for the hypothesis of King and Wilson that adaptive divergence is primarily driven by regulatory changes.

Abstract

Organisms can often adapt surprisingly quickly to evolutionary challenges, such as the application of pesticides or antibiotics, suggesting an abundant supply of adaptive genetic variation. In these situations, adaptation should commonly produce 'soft' selective sweeps, where multiple adaptive alleles sweep through the population at the same time, either because the alleles were already present as standing genetic variation or arose independently by recurrent de novo mutations. Most well-known examples of rapid molecular adaptation indeed show signatures of such soft selective sweeps. Here, we review the current understanding of the mechanisms that produce soft sweeps and the approaches used for their identification in population genomic data. We argue that soft sweeps might be the dominant mode of adaptation in many species.

Frequent adaptation and the McDonald-Kreitman testPROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICAMesser, P. W., Petrov, D. A.2013; 110 (21): 8615-8620

Abstract

Population genomic studies have shown that genetic draft and background selection can profoundly affect the genome-wide patterns of molecular variation. We performed forward simulations under realistic gene-structure and selection scenarios to investigate whether such linkage effects impinge on the ability of the McDonald-Kreitman (MK) test to infer the rate of positive selection (α) from polymorphism and divergence data. We find that in the presence of slightly deleterious mutations, MK estimates of α severely underestimate the true rate of adaptation even if all polymorphisms with population frequencies under 50% are excluded. Furthermore, already under intermediate rates of adaptation, genetic draft substantially distorts the site frequency spectra at neutral and functional sites from the expectations under mutation-selection-drift balance. MK-type approaches that first infer demography from synonymous sites and then use the inferred demography to correct the estimation of α obtain almost the correct α in our simulations. However, these approaches typically infer a severe past population expansion although there was no such expansion in the simulations, casting doubt on the accuracy of methods that infer demography from synonymous polymorphism data. We propose a simple asymptotic extension of the MK test that yields accurate estimates of α in our simulations and should provide a fruitful direction for future studies.

Abstract

Synonymous sites are generally assumed to be subject to weak selective constraint. For this reason, they are often neglected as a possible source of important functional variation. We use site frequency spectra from deep population sequencing data to show that, contrary to this expectation, 22% of four-fold synonymous (4D) sites in Drosophila melanogaster evolve under very strong selective constraint while few, if any, appear to be under weak constraint. Linking polymorphism with divergence data, we further find that the fraction of synonymous sites exposed to strong purifying selection is higher for those positions that show slower evolution on the Drosophila phylogeny. The function underlying the inferred strong constraint appears to be separate from splicing enhancers, nucleosome positioning, and the translational optimization generating canonical codon bias. The fraction of synonymous sites under strong constraint within a gene correlates well with gene expression, particularly in the mid-late embryo, pupae, and adult developmental stages. Genes enriched in strongly constrained synonymous sites tend to be particularly functionally important and are often involved in key developmental pathways. Given that the observed widespread constraint acting on synonymous sites is likely not limited to Drosophila, the role of synonymous sites in genetic disease and adaptation should be reevaluated.

Abstract

The fruit fly Drosophila is a classic model organism to study adaptation as well as the relationship between genetic variation and phenotypes. Although associated bacterial communities might be important for many aspects of Drosophila biology, knowledge about their diversity, composition, and factors shaping them is limited. We used 454-based sequencing of a variable region of the bacterial 16S ribosomal RNA gene to characterize the bacterial communities associated with wild and laboratory Drosophila isolates. In order to specifically investigate effects of food source and host species on bacterial communities, we analyzed samples from wild Drosophila melanogaster and D. simulans collected from a variety of natural substrates, as well as from adults and larvae of nine laboratory-reared Drosophila species. We find no evidence for host species effects in lab-reared flies; instead, lab of origin and stochastic effects, which could influence studies of Drosophila phenotypes, are pronounced. In contrast, the natural Drosophila-associated microbiota appears to be predominantly shaped by food substrate with an additional but smaller effect of host species identity. We identify a core member of this natural microbiota that belongs to the genus Gluconobacter and is common to all wild-caught flies in this study, but absent from the laboratory. This makes it a strong candidate for being part of what could be a natural D. melanogaster and D. simulans core microbiome. Furthermore, we were able to identify candidate pathogens in natural fly isolates.

Heterozygote advantage as a natural consequence of adaptation in diploidsPROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICASellis, D., Callahan, B. J., Petrov, D. A., Messer, P. W.2011; 108 (51): 20666-20671

Abstract

Molecular adaptation is typically assumed to proceed by sequential fixation of beneficial mutations. In diploids, this picture presupposes that for most adaptive mutations, the homozygotes have a higher fitness than the heterozygotes. Here, we show that contrary to this expectation, a substantial proportion of adaptive mutations should display heterozygote advantage. This feature of adaptation in diploids emerges naturally from the primary importance of the fitness of heterozygotes for the invasion of new adaptive mutations. We formalize this result in the framework of Fisher's influential geometric model of adaptation. We find that in diploids, adaptation should often proceed through a succession of short-lived balanced states that maintain substantially higher levels of phenotypic and fitness variation in the population compared with classic adaptive walks. In fast-changing environments, this variation produces a diversity advantage that allows diploids to remain better adapted compared with haploids despite the disadvantage associated with the presence of unfit homozygotes. The short-lived balanced states arising during adaptive walks should be mostly invisible to current scans for long-term balancing selection. Instead, they should leave signatures of incomplete selective sweeps, which do appear to be common in many species. Our results also raise the possibility that balancing selection, as a natural consequence of frequent adaptation, might play a more prominent role among the forces maintaining genetic variation than is commonly recognized.

High sensitivity to aligner and high rate of false positives in the estimates of positive selection in the 12 Drosophila genomesGENOME RESEARCHMarkova-Raina, P., Petrov, D.2011; 21 (6): 863-874

Abstract

We investigate the effect of aligner choice on inferences of positive selection using site-specific models of molecular evolution. We find that independently of the choice of aligner, the rate of false positives is unacceptably high. Our study is a whole-genome analysis of all protein-coding genes in 12 Drosophila genomes annotated in either all 12 species (~6690 genes) or in the six melanogaster group species. We compare six popular aligners: PRANK, T-Coffee, ClustalW, ProbCons, AMAP, and MUSCLE, and find that the aligner choice strongly influences the estimates of positive selection. Differences persist when we use (1) different stringency cutoffs, (2) different selection inference models, (3) alignments with or without gaps, and/or additional masking, (4) per-site versus per-gene statistics, (5) closely related melanogaster group species versus more distant 12 Drosophila genomes. Furthermore, we find that these differences are consequential for downstream analyses such as determination of over/under-represented GO terms associated with positive selection. Visual analysis indicates that most sites inferred as positively selected are, in fact, misaligned at the codon level, resulting in false positive rates of 48%-82%. PRANK, which has been reported to outperform other aligners in simulations, performed best in our empirical study as well. Unfortunately, PRANK still had a high, and unacceptable for most applications, false positives rate of 50%-55%. We identify misannotations and indels, many of which appear to be located in disordered protein regions, as primary culprits for the high misalignment-related error levels and discuss possible workaround approaches to this apparently pervasive problem in genome-wide evolutionary analyses.

Abstract

Transposable elements (TEs) are the primary contributors to the genome bulk in many organisms and are major players in genome evolution. A clear and thorough understanding of the population dynamics of TEs is therefore essential for full comprehension of the eukaryotic genome evolution and function. Although TEs in Drosophila melanogaster have received much attention, population dynamics of most TE families in this species remains entirely unexplored. It is not clear whether the same population processes can account for the population behaviors of all TEs in Drosophila or whether, as has been suggested previously, different orders behave according to very different rules. In this work, we analyzed population frequencies for a large number of individual TEs (755 TEs) in five North American and one sub-Saharan African D. melanogaster populations (75 strains in total). These TEs have been annotated in the reference D. melanogaster euchromatic genome and have been sampled from all three major orders (non-LTR, LTR, and TIR) and from all families with more than 20 TE copies (55 families in total). We find strong evidence that TEs in Drosophila across all orders and families are subject to purifying selection at the level of ectopic recombination. We showed that strength of this selection varies predictably with recombination rate, length of individual TEs, and copy number and length of other TEs in the same family. Importantly, these rules do not appear to vary across orders. Finally, we built a statistical model that considered only individual TE-level (such as the TE length) and family-level properties (such as the copy number) and were able to explain more than 40% of the variation in TE frequencies in D. melanogaster.

Abstract

Transposable elements (TEs) are repetitive DNA sequences that are ubiquitous, extremely abundant and dynamic components of practically all genomes. Much effort has gone into annotation of TE copies in reference genomes. The sequencing cost reduction and the newly available next-generation sequencing (NGS) data from multiple strains within a species offer an unprecedented opportunity to study population genomics of TEs in a range of organisms. Here, we present a computational pipeline (T-lex) that uses NGS data to detect the presence/absence of annotated TE copies. T-lex can use data from a large number of strains and returns estimates of population frequencies of individual TE insertions in a reasonable time. We experimentally validated the accuracy of T-lex detecting presence or absence of 768 previously identified TE copies in two resequenced Drosophila melanogaster strains. Approximately 95% of the TE insertions were detected with 100% sensitivity and 97% specificity. We show that even at low levels of coverage T-lex produces accurate results for TE copies that it can identify reliably but that the rate of 'no data' calls increases as the coverage falls below 15×. T-lex is a broadly applicable and flexible tool that can be used in any genome provided the availability of the reference genome, individual TE copy annotation and NGS data.

Abstract

Mutation is the engine that drives evolution and adaptation forward in that it generates the variation on which natural selection acts. Mutation is a random process that nevertheless occurs according to certain biases. Elucidating mutational biases and the way they vary across species and within genomes is crucial to understanding evolution and adaptation. Here we demonstrate that clonal pathogens that evolve under severely relaxed selection are uniquely suitable for studying mutational biases in bacteria. We estimate mutational patterns using sequence datasets from five such clonal pathogens belonging to four diverse bacterial clades that span most of the range of genomic nucleotide content. We demonstrate that across different types of sites and in all four clades mutation is consistently biased towards AT. This is true even in clades that have high genomic GC content. In all studied cases the mutational bias towards AT is primarily due to the high rate of C/G to T/A transitions. These results suggest that bacterial mutational biases are far less variable than previously thought. They further demonstrate that variation in nucleotide content cannot stem entirely from variation in mutational biases and that natural selection and/or a natural selection-like process such as biased gene conversion strongly affect nucleotide content.

Evidence that Adaptation in Drosophila Is Not Limited by Mutation at Single SitesPLOS GENETICSKarasov, T., Messer, P. W., Petrov, D. A.2010; 6 (6)

Abstract

Adaptation in eukaryotes is generally assumed to be mutation-limited because of small effective population sizes. This view is difficult to reconcile, however, with the observation that adaptation to anthropogenic changes, such as the introduction of pesticides, can occur very rapidly. Here we investigate adaptation at a key insecticide resistance locus (Ace) in Drosophila melanogaster and show that multiple simple and complex resistance alleles evolved quickly and repeatedly within individual populations. Our results imply that the current effective population size of modern D. melanogaster populations is likely to be substantially larger (> or = 100-fold) than commonly believed. This discrepancy arises because estimates of the effective population size are generally derived from levels of standing variation and thus reveal long-term population dynamics dominated by sharp--even if infrequent--bottlenecks. The short-term effective population sizes relevant for strong adaptation, on the other hand, might be much closer to census population sizes. Adaptation in Drosophila may therefore not be limited by waiting for mutations at single sites, and complex adaptive alleles can be generated quickly without fixation of intermediate states. Adaptive events should also commonly involve the simultaneous rise in frequency of independently generated adaptive mutations. These so-called soft sweeps have very distinct effects on the linked neutral polymorphisms compared to the standard hard sweeps in mutation-limited scenarios. Methods for the mapping of adaptive mutations or association mapping of evolutionarily relevant mutations may thus need to be reconsidered.

Abstract

Investigating spatial patterns of loci under selection can give insight into how populations evolved in response to selective pressures and can provide monitoring tools for detecting the impact of environmental changes on populations. Drosophila is a particularly good model to study adaptation to environmental heterogeneity since it is a tropical species that originated in sub-Saharan Africa and has only recently colonized the rest of the world. There is strong evidence for the adaptive role of Transposable Elements (TEs) in the evolution of Drosophila, and TEs might play an important role specifically in adaptation to temperate climates. In this work, we analyzed the frequency of a set of putatively adaptive and putatively neutral TEs in populations with contrasting climates that were collected near the endpoints of two known latitudinal clines in Australia and North America. The contrasting results obtained for putatively adaptive and putatively neutral TEs and the consistency of the patterns between continents strongly suggest that putatively adaptive TEs are involved in adaptation to temperate climates. We integrated information on population behavior, possible environmental selective agents, and both molecular and functional information of the TEs and their nearby genes to infer the plausible phenotypic consequences of these insertions. We conclude that adaptation to temperate environments is widespread in Drosophila and that TEs play a significant role in this adaptation. It is remarkable that such a diverse set of TEs located next to a diverse set of genes are consistently adaptive to temperate climate-related factors. We argue that reverse population genomic analyses, as the one described in this work, are necessary to arrive at a comprehensive picture of adaptation.

Abstract

The molecular mechanisms underlying major phenotypic changes that have evolved repeatedly in nature are generally unknown. Pelvic loss in different natural populations of threespine stickleback fish has occurred through regulatory mutations deleting a tissue-specific enhancer of the Pituitary homeobox transcription factor 1 (Pitx1) gene. The high prevalence of deletion mutations at Pitx1 may be influenced by inherent structural features of the locus. Although Pitx1 null mutations are lethal in laboratory animals, Pitx1 regulatory mutations show molecular signatures of positive selection in pelvic-reduced populations. These studies illustrate how major expression and morphological changes can arise from single mutational leaps in natural populations, producing new adaptive alleles via recurrent regulatory alterations in a key developmental control gene.

Abstract

Different synonymous codons are favored by natural selection for translation efficiency and accuracy in different organisms. The rules governing the identities of favored codons in different organisms remain obscure. In fact, it is not known whether such rules exist or whether favored codons are chosen randomly in evolution in a process akin to a series of frozen accidents. Here, we study this question by identifying for the first time the favored codons in 675 bacteria, 52 archea, and 10 fungi. We use a number of tests to show that the identified codons are indeed likely to be favored and find that across all studied organisms the identity of favored codons tracks the GC content of the genomes. Once the effect of the genomic GC content on selectively favored codon choice is taken into account, additional universal amino acid specific rules governing the identity of favored codons become apparent. Our results provide for the first time a clear set of rules governing the evolution of selectively favored codon usage. Based on these results, we describe a putative scenario for how evolutionary shifts in the identity of selectively favored codons can occur without even temporary weakening of natural selection for codon bias.

Abstract

Over the past four decades, the predominant view of molecular evolution saw little connection between natural selection and genome evolution, assuming that the functionally constrained fraction of the genome is relatively small and that adaptation is sufficiently infrequent to play little role in shaping patterns of variation within and even between species. Recent evidence from Drosophila, reviewed here, suggests that this view may be invalid. Analyses of genetic variation within and between species reveal that much of the Drosophila genome is under purifying selection, and thus of functional importance, and that a large fraction of coding and noncoding differences between species are adaptive. The findings further indicate that, in Drosophila, adaptations may be both common and strong enough that the fate of neutral mutations depends on their chance linkage to adaptive mutations as much as on the vagaries of genetic drift. The emerging evidence has implications for a wide variety of fields, from conservation genetics to bioinformatics, and presents challenges to modelers and experimentalists alike.

Abstract

Much effort and interest have focused on assessing the importance of natural selection, particularly positive natural selection, in shaping the human genome. Although scans for positive selection have identified candidate loci that may be associated with positive selection in humans, such scans do not indicate whether adaptation is frequent in general in humans. Studies based on the reasoning of the MacDonald-Kreitman test, which, in principle, can be used to evaluate the extent of positive selection, suggested that adaptation is detectable in the human genome but that it is less common than in Drosophila or Escherichia coli. Both positive and purifying natural selection at functional sites should affect levels and patterns of polymorphism at linked nonfunctional sites. Here, we search for these effects by analyzing patterns of neutral polymorphism in humans in relation to the rates of recombination, functional density, and functional divergence with chimpanzees. We find that the levels of neutral polymorphism are lower in the regions of lower recombination and in the regions of higher functional density or divergence. These correlations persist after controlling for the variation in GC content, density of simple repeats, selective constraint, mutation rate, and depth of sequencing coverage. We argue that these results are most plausibly explained by the effects of natural selection at functional sites -- either recurrent selective sweeps or background selection -- on the levels of linked neutral polymorphism. Natural selection at both coding and regulatory sites appears to affect linked neutral polymorphism, reducing neutral polymorphism by 6% genome-wide and by 11% in the gene-rich half of the human genome. These findings suggest that the effects of natural selection at linked sites cannot be ignored in the study of neutral human polymorphism.

Abstract

Mycobacterium tuberculosis infects one third of the human world population and kills someone every 15 seconds. For more than a century, scientists and clinicians have been distinguishing between the human- and animal-adapted members of the M. tuberculosis complex (MTBC). However, all human-adapted strains of MTBC have traditionally been considered to be essentially identical. We surveyed sequence diversity within a global collection of strains belonging to MTBC using seven megabase pairs of DNA sequence data. We show that the members of MTBC affecting humans are more genetically diverse than generally assumed, and that this diversity can be linked to human demographic and migratory events. We further demonstrate that these organisms are under extremely reduced purifying selection and that, as a result of increased genetic drift, much of this genetic diversity is likely to have functional consequences. Our findings suggest that the current increases in human population, urbanization, and global travel, combined with the population genetic characteristics of M. tuberculosis described here, could contribute to the emergence and spread of drug-resistant tuberculosis.

Abstract

Although transposable elements (TEs) are known to be potent sources of mutation, their contribution to the generation of recent adaptive changes has never been systematically assessed. In this work, we conduct a genome-wide screen for adaptive TE insertions in Drosophila melanogaster that have taken place during or after the spread of this species out of Africa. We determine population frequencies of 902 of the 1,572 TEs in Release 3 of the D. melanogaster genome and identify a set of 13 putatively adaptive TEs. These 13 TEs increased in population frequency sharply after the spread out of Africa. We argue that many of these TEs are in fact adaptive by demonstrating that the regions flanking five of these TEs display signatures of partial selective sweeps. Furthermore, we show that eight out of the 13 putatively adaptive elements show population frequency heterogeneity consistent with these elements playing a role in adaptation to temperate climates. We conclude that TEs have contributed considerably to recent adaptive evolution (one TE-induced adaptation every 200-1,250 y). The majority of these adaptive insertions are likely to be involved in regulatory changes. Our results also suggest that TE-induced adaptations arise more often from standing variants than from new mutations. Such a high rate of TE-induced adaptation is inconsistent with the number of fixed TEs in the D. melanogaster genome, and we discuss possible explanations for this discrepancy.

Abstract

The loss of functional redundancy is the key process in the evolution of duplicated genes. Here we systematically assess the extent of functional redundancy among a large set of duplicated genes in Saccharomyces cerevisiae. We quantify growth rate in rich medium for a large number of S. cerevisiae strains that carry single and double deletions of duplicated and singleton genes. We demonstrate that duplicated genes can maintain substantial redundancy for extensive periods of time following duplication ( approximately 100 million years). We find high levels of redundancy among genes duplicated both via the whole genome duplication and via smaller scale duplications. Further, we see no evidence that two duplicated genes together contribute to fitness in rich medium substantially beyond that of their ancestral progenitor gene. We argue that duplicate genes do not often evolve to behave like singleton genes even after very long periods of time.

Abstract

A beneficial mutation that has nearly but not yet fixed in a population produces a characteristic haplotype configuration, called a partial selective sweep. Whether nonadaptive processes might generate similar haplotype configurations has not been extensively explored. Here, we consider 5 population genetic data sets taken from regions flanking high-frequency transposable elements in North American strains of Drosophila melanogaster, each of which appears to be consistent with the expectations of a partial selective sweep. We use coalescent simulations to explore whether incorporation of the species' demographic history, purifying selection against the element, or suppression of recombination caused by the element could generate putatively adaptive haplotype configurations. Whereas most of the data sets would be rejected as nonneutral under the standard neutral null model, only the data set for which there is strong external evidence in support of an adaptive transposition appears to be nonneutral under the more complex null model and in particular when demography is taken into account. High-frequency, derived mutations from a recently bottlenecked population, such as we study here, are of great interest to evolutionary genetics in the context of scans for adaptive events; we discuss the broader implications of our findings in this context.

Abstract

The effect of recurrent selective sweeps is a spatially heterogeneous reduction in neutral polymorphism throughout the genome. The pattern of reduction depends on the selective advantage and recurrence rate of the sweeps. Because many adaptive substitutions responsible for these sweeps also contribute to nonsynonymous divergence, the spatial distribution of nonsynonymous divergence also reflects the distribution of adaptive substitutions. Thus, the spatial correspondence between neutral polymorphism and nonsynonymous divergence may be especially informative about the process of adaptation. Here we study this correspondence using genomewide polymorphism data from Drosophila simulans and the divergence between D. simulans and D. melanogaster. Focusing on highly recombining portions of the autosomes, at a spatial scale appropriate to the study of selective sweeps, we find that neutral polymorphism is both lower and, as measured by a new statistic Q(S), less homogeneous where nonsynonymous divergence is higher and that the spatial structure of this correlation is best explained by the action of strong recurrent selective sweeps. We introduce a method to infer, from the spatial correspondence between polymorphism and divergence, the rate and selective strength of adaptation. Our results independently confirm a high rate of adaptive substitution (approximately 1/3000 generations) and newly suggest that many adaptations are of surprisingly great selective effect (approximately 1%), reducing the effective population size by approximately 15% even in highly recombining regions of the genome.

Abstract

Levels of molecular diversity in Drosophila have repeatedly been shown to be higher in ancestral, African populations than in derived, non-African populations. This pattern holds for both coding and noncoding regions for a variety of molecular markers including single nucleotide polymorphisms and microsatellites. Comparisons of X-linked and autosomal diversity have yielded results largely dependent on population of origin.In an attempt to further elucidate patterns of sequence diversity in Drosophila melanogaster, we studied nucleotide variation at putatively nonfunctional X-linked and autosomal loci in sub-Saharan African and North American strains of D. melanogaster. We combine our experimental results with data from previous studies of molecular polymorphism in this species. We confirm that levels of diversity are consistently higher in African versus North American strains. The relative reduction of diversity for X-linked and autosomal loci in the derived, North American strains depends heavily on the studied loci. While the compiled dataset, comprised primarily of regions within or in close proximity to genes, shows a much more severe reduction of diversity on the X chromosome compared to autosomes in derived strains, the dataset consisting of intergenic loci located far from genes shows very similar reductions of diversities for X-linked and autosomal loci in derived strains. In addition, levels of diversity at X-linked and autosomal loci in the presumably ancestral African population are more similar than expected under an assumption of neutrality and equal numbers of breeding males and females.We show that simple demographic scenarios under assumptions of neutral theory cannot explain all of the observed patterns of molecular diversity. We suggest that the simplest model is a population bottleneck that retains an ancestral female-biased sex ratio, coupled with higher rates of positive selection at X-linked loci in close proximity to genes specifically in derived, non-African populations.

Abstract

Obligate pathogenic bacteria lose more genes relative to facultative pathogens, which, in turn, lose more genes than free-living bacteria. It was suggested that the increased gene loss in obligate pathogens may be due to a reduction in the effectiveness of purifying selection. Less attention has been given to the causes of increased gene loss in facultative pathogens.We examined in detail the rate of gene loss in two groups of facultative pathogenic bacteria: pathogenic Escherichia coli, and Shigella. We show that Shigella strains are losing genes at an accelerated rate relative to pathogenic E. coli. We demonstrate that a genome-wide reduction in the effectiveness of selection contributes to the observed increase in the rate of gene loss in Shigella.When compared with their closely related pathogenic E. coli relatives, the more niche-limited Shigella strains appear to be losing genes at a significantly accelerated rate. A genome-wide reduction in the effectiveness of purifying selection plays a role in creating this observed difference. Our results demonstrate that differences in the effectiveness of selection contribute to differences in rate of gene loss in facultative pathogenic bacteria. We discuss how the lifestyle and pathogenicity of Shigella may alter the effectiveness of selection, thus influencing the rate of gene loss.

Abstract

Comparing patterns of molecular evolution between autosomes and sex chromosomes (such as X and W chromosomes) can provide insight into the forces underlying genome evolution. Here we investigate patterns of codon bias evolution on the X chromosome and autosomes in Drosophila and Caenorhabditis. We demonstrate that X-linked genes have significantly higher codon bias compared to autosomal genes in both Drosophila and Caenorhabditis. Furthermore, genes that become X-linked evolve higher codon bias gradually, over tens of millions of years. We provide several lines of evidence that this elevation in codon bias is due exclusively to their chromosomal location and not to any other property of X-linked genes. We present two possible explanations for these observations. One possibility is that natural selection is more efficient on the X chromosome due to effective haploidy of the X chromosomes in males and persistently low effective numbers of reproducing males compared to that of females. Alternatively, X-linked genes might experience stronger natural selection for higher codon bias as a result of maladaptive reduction of their dosage engendered by the loss of the Y-linked homologs.

Abstract

To study adaptation, it is essential to identify multiple adaptive mutations and to characterize their molecular, phenotypic, selective, and ecological consequences. Here we describe a genomic screen for adaptive insertions of transposable elements in Drosophila. Using a pilot application of this screen, we have identified an adaptive transposable element insertion, which truncates a gene and apparently generates a functional protein in the process. The insertion of this transposable element confers increased resistance to an organophosphate pesticide and has spread in D. melanogaster recently.

Abstract

This study presents the first global, 1-Mbp-level analysis of patterns of nucleotide substitutions along the human lineage. The study is based on the analysis of a large amount of repetitive elements deposited into the human genome since the mammalian radiation, yielding a number of results that would have been difficult to obtain using the more conventional comparative method of analysis. This analysis revealed substantial and consistent variability of rates of substitution, with the variability ranging up to twofold among different regions. The rates of substitutions of C or G nucleotides with A or T nucleotides vary much more sharply than the reverse rates, suggesting that much of that variation is due to differences in mutation rates rather than in the probabilities of fixation of C/G vs. A/T nucleotides across the genome. For all types of substitution we observe substantially more hotspots than coldspots, with hotspots showing substantial clustering over tens of Mbp's. Our analysis revealed that GC-content of surrounding sequences is the best predictor of the rates of substitution. The pattern of substitution appears very different near telomeres compared to the rest of the genome and cannot be explained by the genome-wide correlations of the substitution rates with GC content or exon density. The telomere pattern of substitution is consistent with natural selection or biased gene conversion acting to increase the GC-content of the sequences that are within 10-15 Mbp away from the telomere.

Abstract

A central goal in genome biology is to understand the origin and maintenance of genic diversity. Over evolutionary time, each gene's contribution to the genic content of an organism depends not only on its probability of long-term survival, but also on its propensity to generate duplicates that are themselves capable of long-term survival. In this study we investigate which types of genes are likely to generate functional and persistent duplicates. We demonstrate that genes that have generated duplicates in the C. elegans and S. cerevisiae genomes were 25%-50% more constrained prior to duplication than the genes that failed to leave duplicates. We further show that conserved genes have been consistently prolific in generating duplicates for hundreds of millions of years in these two species. These findings reveal one way in which gene duplication shapes the content of eukaryotic genomes. Our finding that the set of duplicate genes is biased has important implications for genome-scale studies.

Abstract

Differences in the regional substitution patterns in the human genome created patterns of large-scale variation of base composition known as genomic isochores. To gain insight into the origin of the genomic isochores, we develop a maximum-likelihood approach to determine the history of substitution patterns in the human genome. This approach utilizes the vast amount of repetitive sequence deposited in the human genome over the past approximately 250 Myr. Using this approach, we estimate the frequencies of seven types of substitutions: the four transversions, two transitions, and the methyl-assisted transition of cytosine in CpG. Comparing substitutional patterns in repetitive elements of various ages, we reconstruct the history of the base-substitutional process in the different isochores for the past 250 Myr. At around 90 MYA (around the time of the mammalian radiation), we find an abrupt fourfold to eightfold increase of the cytosine transition rate in CpG pairs compared with that of the reptilian ancestor. Further analysis of nucleotide substitutions in regions with different GC content reveals concurrent changes in the substitutional patterns. Although the substitutional pattern was dependent on the regional GC content in such ways that it preserved the regional GC content before the mammalian radiation, it lost this dependence afterward. The substitutional pattern changed from an isochore-preserving to an isochore-degrading one. We conclude that isochores have been established before the radiation of the eutherian mammals and have been subject to the process of homogenization since then.

Abstract

The Drosophila melanogaster genome contains approximately 100 distinct families of transposable elements (TEs). In the euchromatic part of the genome, each family is present in a small number of copies (5-150 copies), with individual copies of TEs often present at very low frequencies in populations. This pattern is likely to reflect a balance between the inflow of TEs by transposition and the removal of TEs by natural selection. The nature of natural selection acting against TEs remains controversial. We provide evidence that selection against chromosome abnormalities caused by ectopic recombination limits the spread of some TEs. We also demonstrate for the first time that some TE families in the Drosophila euchromatin appear to be only marginally affected by purifying selection and contain many copies at high population frequencies. We argue that TEs in these families attain high population frequencies and even reach fixation as a result of low family-wide transposition rates leading to low TE copy numbers and consequently reduced strength of selection acting on individual TE copies. Fixation of TEs in these families should provide an upward pressure on the size of intergenic sequences counterbalancing rapid DNA loss through small deletions. Copy-number-dependent selection on TE families caused by ectopic recombination may also promote diversity among TEs in the Drosophila genome.

Abstract

The paper describes a mutational equilibrium model of genome size evolution. This model is different from both adaptive and junk DNA models of genome size evolution in that it does not assume that genome size is maintained either by positive or stabilizing selection for the optimum genome size (as in adaptive theories) or by purifying selection against too much junk DNA (as in junk DNA theories). Instead the genome size is suggested to evolve until the loss of DNA through more frequent small deletions is equal to the rate of DNA gain through more frequent long insertions. The empirical basis for this theory is the finding of a strong correlation and of a clear power-function relationship between the rate of mutational DNA loss (per bp) through small deletions and genome size in animals. Genome size scales as a negative 1.3 power function of the deletion rate per nucleotide. Such a relationship is not predicted by either adaptive or junk DNA theories. However, if genome size is maintained at equilibrium by the balance of mutational forces, this empirilical relationship can be readily accommodated. Within this framework, this finding would imply that the rate of DNA gain through large insertions scales up a quarter-power function of genome size. On this view, as genome size grows, the rate of growth through large insertions is increasing as a quarter power function of genome size and the rate of DNA loss through small deletions increases linearly, until eventually, at the stable equilibrium genome size value, rates of growth and loss equal each other. The current data also suggest that the long-term variation is genome size in animals is brought about to a significant extent by changes in the intrinsic rates of DNA loss through small deletions. Both the origin of mutational biases and the adaptive consequences of such a mode of evolution of genome size are discussed.

Evolution of genome size: new approaches to an old problemTRENDS IN GENETICSPetrov, D. A.2001; 17 (1): 23-28

Abstract

Eukaryotic genomes come in a wide variety of sizes. Haploid DNA contents (C values) range > 80,000-fold without an apparent correlation with either the complexity of the organism or the number of genes. This puzzling observation, the C-value paradox, has remained a mystery for almost half a century, despite much progress in the elucidation of the structure and function of genomes. Here I argue that new approaches focussing on the genetic mechanisms that generate genome-size differences could shed much light on the evolution of genome size.

Abstract

Eukaryotic genome sizes range over five orders of magnitude. This variation cannot be explained by differences in organismic complexity (the C value paradox). To test the hypothesis that some variation in genome size can be attributed to differences in the patterns of insertion and deletion (indel) mutations among organisms, this study examines the indel spectrum in Laupala crickets, which have a genome size 11 times larger than that of Drosophila. Consistent with the hypothesis, DNA loss is more than 40 times slower in Laupala than in Drosophila.

Patterns of nucleotide substitution in Drosophila and mammalian genomesPROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICAPetrov, D. A., Hartl, D. L.1999; 96 (4): 1475-1479

Abstract

To estimate patterns of molecular evolution of unconstrained DNA sequences, we used maximum parsimony to separate phylogenetic trees of a non-long terminal repeat retrotransposable element into either internal branches, representing mainly the constrained evolution of active lineages, or into terminal branches, representing mainly nonfunctional "dead-on-arrival" copies that are unconstrained by selection and evolve as pseudogenes. The pattern of nucleotide substitutions in unconstrained sequences is expected to be congruent with the pattern of point mutation. We examined the retrotransposon Helena in the Drosophila virilis species group (subgenus Drosophila) and the Drosophila melanogaster species subgroup (subgenus Sophophora). The patterns of point mutation are indistinguishable, suggesting considerable stability over evolutionary time (40-60 million years). The relative frequencies of different point mutations are unequal, but the "transition bias" results largely from an approximately 2-fold excess of G.C to A.T substitutions. Spontaneous mutation is biased toward A.T base pairs, with an expected mutational equilibrium of approximately 65% A + T (quite similar to that of long introns). These data also enable the first detailed comparison of patterns of point mutations in Drosophila and mammals. Although the patterns are different, all of the statistical significance comes from a much greater rate of G.C to A.T substitution in mammals, probably because of methylated cytosine "hotspots." When the G.C to A.T substitutions are discounted, the remaining differences are considerably reduced and not statistically significant.

Abstract

Pseudogenes are common in mammals but virtually absent in Drosophila. All putative Drosophila pseudogenes show patterns of molecular evolution that are inconsistent with the lack of functional constraints. The absence of bona fide pseudogenes is not only puzzling, it also hampers attempts to estimate rates and patterns of neutral DNA change. The estimation problem is especially acute in the case of deletions and insertions, which are likely to have large effects when they occur in functional genes and are therefore subject to strong purifying selection. We propose a solution to this problem by taking advantage of the propensity of retrotransposable elements without long terminal repeats (non-LTR) to create non-functional, 'dead-on-arrival' copies of themselves as a common by-product of their transpositional cycle. Phylogenetic analysis of a non-LTR element, Helena, demonstrates that copies lose DNA at an unusually high rate, suggesting that lack of pseudogenes in Drosophila is the product of rampant deletion of DNA in unconstrained regions. This finding has important implications for the study of genome evolution in general and the 'C-value paradox' in particular.

DIVERSE TRANSPOSABLE ELEMENTS ARE MOBILIZED IN HYBRID DYSGENESIS IN DROSOPHILA-VIRILISPROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICAPetrov, D. A., Schutzman, J. L., Hartl, D. L., Lozovskaya, E. R.1995; 92 (17): 8050-8054

Abstract

We describe a system of hybrid dysgenesis in Drosophila virilis in which at least four unrelated transposable elements are all mobilized following a dysgenic cross. The data are largely consistent with the superposition of at least three different systems of hybrid dysgenesis, each repressing a different transposable element, which break down following the hybrid cross, possibly because they share a common pathway in the host. The data are also consistent with a mechanism in which mobilization of a single element triggers that of others, perhaps through chromosome breakage. The mobilization of multiple, unrelated elements in hybrid dysgenesis is reminiscent of McClintock's evidence [McClintock, B. (1955) Brookhaven Symp. Biol. 8, 58-74] for simultaneous mobilization of different transposable elements in maize.

Comparative population genomics: power and principles for the inference of functionalityTRENDS IN GENETICSLawrie, D. S., Petrov, D. A.2014; 30 (4): 133-139

Abstract

The availability of sequenced genomes from multiple related organisms allows the detection and localization of functional genomic elements based on the idea that such elements evolve more slowly than neutral sequences. Although such comparative genomics methods have proven useful in discovering functional elements and ascertaining levels of functional constraint in the genome as a whole, here we outline limitations intrinsic to this approach that cannot be overcome by sequencing more species. We argue that it is essential to supplement comparative genomics with ultra-deep sampling of populations from closely related species to enable substantially more powerful genomic scans for functional elements. The convergence of sequencing technology and population genetics theory has made such projects feasible and has exciting implications for functional genomics.

Abstract

Studies of the population dynamics of transposable elements (TEs) in Drosophila melanogaster indicate that consistent forces are affecting TEs independently of their modes of transposition and regulation. New sequencing technologies enable biologists to sample genomes at an unprecedented scale in order to quantify genome-wide polymorphism for annotated and novel TE insertions. In this review, we first present new insights gleaned from high-throughput data for population genomics studies of D. melanogaster. We then consider the latest population genomics models for TE evolution and present examples of functional evidence revealed by genome-wide studies of TE population dynamics in D. melanogaster. Although most of the TE insertions are deleterious or neutral, some TE insertions increase the fitness of the individual that carries them and play a role in genome adaptation.

Abstract

High-throughput DNA sequencing technologies have revolutionized genomic analysis, including the de novo assembly of whole genomes. Nevertheless, assembly of complex genomes remains challenging, in part due to the presence of dispersed repeats which introduce ambiguity during genome reconstruction. Transposable elements (TEs) can be particularly problematic, especially for TE families exhibiting high sequence identity, high copy number, or complex genomic arrangements. While TEs strongly affect genome function and evolution, most current de novo assembly approaches cannot resolve long, identical, and abundant families of TEs. Here, we applied a novel Illumina technology called TruSeq synthetic long-reads, which are generated through highly-parallel library preparation and local assembly of short read data and which achieve lengths of 1.5-18.5 Kbp with an extremely low error rate ([Formula: see text]0.03% per base). To test the utility of this technology, we sequenced and assembled the genome of the model organism Drosophila melanogaster (reference genome strain y; cn, bw, sp) achieving an N50 contig size of 69.7 Kbp and covering 96.9% of the euchromatic chromosome arms of the current reference genome. TruSeq synthetic long-read technology enables placement of individual TE copies in their proper genomic locations as well as accurate reconstruction of TE sequences. We entirely recovered and accurately placed 4,229 (77.8%) of the 5,434 annotated transposable elements with perfect identity to the current reference genome. As TEs are ubiquitous features of genomes of many species, TruSeq synthetic long-reads, and likely other methods that generate long-reads, offer a powerful approach to improve de novo assemblies of whole genomes.

Abstract

The analysis of molecular data from natural populations has allowed researchers to answer diverse ecological questions that were previously intractable. In particular, ecologists are often interested in the demographic history of populations, information that is rarely available from historical records. Methods have been developed to infer demographic parameters from genomic data, but it is not well understood how inferred parameters compare to true population history or depend on aspects of experimental design. Here, we present and evaluate a method of SNP discovery using RNA sequencing and demographic inference using the program δaδi, which uses a diffusion approximation to the allele frequency spectrum to fit demographic models. We test these methods in a population of the checkerspot butterfly Euphydryas gillettii. This population was intentionally introduced to Gothic, Colorado in 1977 and has as experienced extreme fluctuations including bottlenecks of fewer than 25 adults, as documented by nearly annual field surveys. Using RNA sequencing of eight individuals from Colorado and eight individuals from a native population in Wyoming, we generate the first genomic resources for this system. While demographic inference is commonly used to examine ancient demography, our study demonstrates that our inexpensive, all-in-one approach to marker discovery and genotyping provides sufficient data to accurately infer the timing of a recent bottleneck. This demographic scenario is relevant for many species of conservation concern, few of which have sequenced genomes. Our results are remarkably insensitive to sample size or number of genomic markers, which has important implications for applying this method to other nonmodel systems.

On the Limitations of Using Ribosomal Genes as References for the Study of Codon Usage: A RebuttalPLOS ONEHershberg, R., Petrov, D. A.2012; 7 (12)

Abstract

In a recent paper published in PLOS ONE, Wang et al. challenge our finding that the identity of optimal codons in different genomes follows a set of clear rules. Here we provide a rebuttal of their paper and demonstrate that the results of our original PLOS Genetics paper stand. This provides us with an opportunity to bring up an aspect of how codon usage has been studied that should be of general interest. The Wang et al. study, as well as many other studies, used ribosomal genes as a reference set for the study of patterns of codon usage. We discuss here the assumptions that are made in order to justify using ribosomal genes to study codon bias, suggest that this practice can at times be problematic, and discuss its limitations.

Abstract

High-throughput pooled resequencing offers significant potential for whole genome population sequencing. However, its main drawback is the loss of haplotype information. In order to regain some of this information, we present LDx, a computational tool for estimating linkage disequilibrium (LD) from pooled resequencing data. LDx uses an approximate maximum likelihood approach to estimate LD (r(2)) between pairs of SNPs that can be observed within and among single reads. LDx also reports r(2) estimates derived solely from observed genotype counts. We demonstrate that the LDx estimates are highly correlated with r(2) estimated from individually resequenced strains. We discuss the performance of LDx using more stringent quality conditions and infer via simulation the degree to which performance can improve based on read depth. Finally we demonstrate two possible uses of LDx with real and simulated pooled resequencing data. First, we use LDx to infer genomewide patterns of decay of LD with physical distance in D. melanogaster population resequencing data. Second, we demonstrate that r(2) estimates from LDx are capable of distinguishing alternative demographic models representing plausible demographic histories of D. melanogaster.

Abstract

General parameters of selection, such as the frequency and strength of positive selection in natural populations or the role of introgression, are still insufficiently understood. The house mouse (Mus musculus) is a particularly well-suited model system to approach such questions, since it has a defined history of splits into subspecies and populations and since extensive genome information is available. We have used high-density single-nucleotide polymorphism (SNP) typing arrays to assess genomic patterns of positive selection and introgression of alleles in two natural populations of each of the subspecies M. m. domesticus and M. m. musculus. Applying different statistical procedures, we find a large number of regions subject to apparent selective sweeps, indicating frequent positive selection on rare alleles or novel mutations. Genes in the regions include well-studied imprinted loci (e.g. Plagl1/Zac1), homologues of human genes involved in adaptations (e.g. alpha-amylase genes) or in genetic diseases (e.g. Huntingtin and Parkin). Haplotype matching between the two subspecies reveals a large number of haplotypes that show patterns of introgression from specific populations of the respective other subspecies, with at least 10% of the genome being affected by partial or full introgression. Using neutral simulations for comparison, we find that the size and the fraction of introgressed haplotypes are not compatible with a pure migration or incomplete lineage sorting model. Hence, it appears that introgressed haplotypes can rise in frequency due to positive selection and thus can contribute to the adaptive genomic landscape of natural populations. Our data support the notion that natural genomes are subject to complex adaptive processes, including the introgression of haplotypes from other differentiated populations or species at a larger scale than previously assumed for animals. This implies that some of the admixture found in inbred strains of mice may also have a natural origin.

Abstract

The sequencing of pooled non-barcoded individuals is an inexpensive and efficient means of assessing genome-wide population allele frequencies, yet its accuracy has not been thoroughly tested. We assessed the accuracy of this approach on whole, complex eukaryotic genomes by resequencing pools of largely isogenic, individually sequenced Drosophila melanogaster strains. We called SNPs in the pooled data and estimated false positive and false negative rates using the SNPs called in individual strain as a reference. We also estimated allele frequency of the SNPs using "pooled" data and compared them with "true" frequencies taken from the estimates in the individual strains. We demonstrate that pooled sequencing provides a faithful estimate of population allele frequency with the error well approximated by binomial sampling, and is a reliable means of novel SNP discovery with low false positive rates. However, a sufficient number of strains should be used in the pooling because variation in the amount of DNA derived from individual strains is a substantial source of noise when the number of pooled strains is low. Our results and analysis confirm that pooled sequencing is a very powerful and cost-effective technique for assessing of patterns of sequence variation in populations on genome-wide scales, and is applicable to any dataset where sequencing individuals or individual cells is impossible, difficult, time consuming, or expensive.

Abstract

To characterize chromosomal error types and parental origin of aneuploidy in cleavage-stage embryos using an informatics-based technique that enables the elucidation of aneuploidy-causing mechanisms.Analysis of blastomeres biopsied from cleavage-stage embryos for preimplantation genetic screening during IVF.Laboratory.Couples undergoing IVF treatment.Two hundred seventy-four blastomeres were subjected to array-based genotyping and informatics-based techniques to characterize chromosomal error types and parental origin of aneuploidy across all 24 chromosomes.Chromosomal error types (monosomy vs. trisomy; mitotic vs. meiotic) and parental origin (maternal vs. paternal).The rate of maternal meiotic trisomy rose significantly with age, whereas other types of trisomy showed no correlation with age. Trisomies were mostly maternal in origin, whereas paternal and maternal monosomies were roughly equal in frequency. No examples of paternal meiotic trisomy were observed. Segmental error rates were found to be independent of maternal age.All types of aneuploidy that rose with increasing maternal age can be attributed to disjunction errors during meiosis of the oocyte. Chromosome gains were predominantly maternal in origin and occurred during meiosis, whereas chromosome losses were not biased in terms of parental origin of the chromosome. The ability to determine the parental origin for each chromosome, as well as being able to detect whether multiple homologs from a single parent were present, allowed greater insights into the origin of aneuploidy.

Abstract

Recent research is starting to shed light on the factors that influence the population and evolutionary dynamics of transposable elements (TEs) and TE life cycles. Genomes differ sharply in the number of TE copies, in the level of TE activity, in the diversity of TE families and types, and in the proportion of old and young TEs. In this chapter, we focus on two well-studied genomes with strikingly different architectures, humans and Drosophila, which represent two extremes in terms of TE diversity and population dynamics. We argue that some of the answers might lie in (1) the larger population size and consequently more effective selection against new TE insertions due to ectopic recombination in flies compared to humans; and (2) in the faster rate of DNA loss in flies compared to humans leading to much faster removal of fixed TE copies from the fly genome.

Abstract

Comparative genomics has become widely accepted as the major framework for the ascertainment of functionally important regions in genomes. The underlying paradigm of this approach is that most of the functional regions are assumed to be under selective constraint, which in turn reduces the rate of evolution relative to neutrality. This assumption allows detection of functional regions through sequence conservation. However, constraint does not always lead to sequence conservation. When purifying selection is weak and mutation is biased, constrained regions can even evolve faster than neutral sequences and thus can appear to be under positive selection. Moreover, conservation estimates depend also on the orientation of selection relative to mutational biases and can vary over time. In the light of recent data of the ubiquity of mutational biases and weak selective forces, these effects should reduce the power of conservation analyses to define functional regions using comparative genomics data. We argue that the estimation of true mutational biases and the use of explicit evolutionary models are essential to improve methods inferring the action of natural selection and functionality in genome sequences.

Abstract

Recombination rate is a key evolutionary parameter that determines the degree to which sites are linked. Estimating recombination rates is thus of crucial importance for population genetic and molecular evolutionary studies. We present here a user-friendly web-based tool that can be used to retrieve recombination rate estimates for single and/or multiple loci in the Drosophila melanogaster genome given a user-defined choice of the genome release. We used the Marey map approach that is based on comparing the genetic and physical maps to infer recombination rates along the major chromosomes of the D.melanogaster genome. Our implementation of this approach is based on building third-order polynomials which are used to interpolate recombination rates at all points on the chromosome except for telomeric and centromeric regions in which such polynomials are known to provide particularly poor estimation.

Abstract

Genes in the same organism vary in the time since their evolutionary origin. Without horizontal gene transfer, young genes are necessarily restricted to a few closely related species, whereas old genes can be broadly distributed across the phylogeny. It has been shown that young genes evolve faster than old genes; however, the evolutionary forces responsible for this pattern remain obscure. Here, we classify human-chimp protein-coding genes into different age classes, according to the breath of their phylogenetic distribution. We estimate the strength of purifying selection and the rate of adaptive selection for genes in different age classes. We find that older genes carry fewer and less frequent nonsynonymous single-nucleotide polymorphisms than younger genes suggesting that older genes experience a stronger purifying selection at the protein-coding level. We infer the distribution of fitness effects of new deleterious mutations and find that older genes have proportionally more slightly deleterious mutations and fewer nearly neutral mutations than younger genes. To investigate the role of adaptive selection of genes in different age classes, we determine the selection coefficient (gamma = 2N(e)s) of genes using the MKPRF approach and estimate the ratio of the rate of adaptive nonsynonymous substitution to synonymous substitution (omega(A)) using the DoFE method. Although the proportion of positively selected genes (gamma > 0) is significantly higher in younger genes, we find no correlation between omega(A) and gene age. Collectively, these results provide strong evidence that younger genes are subject to weaker purifying selection and more tenuous evidence that they also undergo adaptive evolution more frequently.

Abstract

Genes that underlie human disease are important subjects of systems biology research. In the present study, we demonstrate that Mendelian and complex disease genes have distinct and consistent protein-protein interaction (PPI) properties. We show that five different network properties can be reduced to two independent metrics when applied to the human PPI network. These two metrics largely coincide with the degree (number of connections) and the clustering coefficient (the number of connections among the neighbors of a particular protein). We demonstrate that disease genes have simultaneously unusually high degree and unusually low clustering coefficient. Such genes can be described as brokers in that they connect many proteins that would not be connected otherwise. We show that these results are robust to the effect of gene age and inspection bias variation. Notably, genes identified in genome-wide association study (GWAS) have network patterns that are almost indistinguishable from the network patterns of nondisease genes and significantly different from the network patterns of complex disease genes identified through non-GWAS means. This suggests either that GWAS focused on a distinct set of diseases associated with an unusual set of genes or that mapping of GWAS-identified single nucleotide polymorphisms onto the causally affected neighboring genes is error prone.

Abstract

Transposable elements (TEs) are short DNA sequences with the capacity to move between different sites in the genome. This ability provides them with the capacity to mutate the genome in many different ways, from subtle regulatory mutations to gross genomic rearrangements. The potential adaptive significance of TEs was recognized by those involved in their initial discovery although it was hotly debated afterwards. For more than two decades, TEs were considered to be intragenomic parasites leading to almost exclusively detrimental effects to the host genome. The sequencing of the Drosophila melanogaster genome provided an unprecedented opportunity to study TEs and led to the identification of the first TE-induced adaptations in this species. These studies were followed by a systematic genome-wide search for adaptive insertions that allowed for the first time to infer that TEs contribute substantially to adaptive evolution. This study also revealed that there are at least twice as many TE-induced adaptations that remain to be identified. To gain a better understanding of the adaptive role of TEs in the genome we clearly need to (i) identify as many adaptive TEs as possible in a range of Drosophila species as well as (ii) carry out in-depth investigations of the effects of adaptive TEs on as many phenotypes as possible.

Abstract

A recent genomewide screen identified 13 transposable elements that are likely to have been adaptive during or after the spread of Drosophila melanogaster out of Africa. One of these insertions, Bari-Juvenile hormone epoxy hydrolase (Bari-Jheh), was associated with the selective sweep of its flanking neutral variation and with reduction of expression of one of its neighboring genes: Jheh3. Here, we provide further evidence that Bari-Jheh insertion is adaptive. We delimit the extent of the selective sweep and show that Bari-Jheh is the only mutation linked to the sweep. Bari-Jheh also lowers the expression of its other flanking gene, Jheh2. Subtle consequences of Bari-Jheh insertion on life-history traits are consistent with the effects of reduced expression of the Jheh genes. Finally, we analyze molecular evolution of Jheh genes in both the long- and the short-term and conclude that Bari-Jheh appears to be a very rare adaptive event in the history of these genes. We discuss the implications of these findings for the detection and understanding of adaptation.

Abstract

The basal transcription machinery is responsible for initiating transcription at core promoters. During metazoan evolution, its components have expanded in number and diversified to increase the complexity of transcriptional regulation in tissues and developmental stages. To explore the evolutionary events and forces underlying this diversification, we analyzed the evolution of the Drosophila testis TAFs (TBP-associated factors), paralogs of TAFs from the basal transcription factor TFIID that are essential for normal transcription during spermatogenesis of a large set of specific genes involved in terminal differentiation of male gametes. There are five testis-specific TAFs in Drosophila, each expressed only in primary spermatocytes and each a paralog of a different generally expressed TFIID subunit. An examination of the presence of paralogs across taxa as well as molecular clock dating indicates that all five testis TAFs likely arose within a span of approximately 38 My 63-250 Ma by independent duplication events from their generally expressed paralogs. Furthermore, the evolution of the testis TAFs has been rapid, with apparent further accelerations in multiple Drosophila lineages. Analysis of between-species divergence and intraspecies polymorphism indicates that the major forces of evolution on these genes have been reduced purifying selection, pervasive positive selection, and coevolution. Other genes that exhibit similar patterns of evolution in the Drosophila lineages are also characterized by enriched expression in the testis, suggesting that the pervasive positive selection acting on the tTAFs is likely to be related to their expression in the testis.

Abstract

Transposable elements (TEs) constitute a substantial fraction of the genomes of many species, and it is thus important to understand their population dynamics. The strength of natural selection against TEs is a key parameter in understanding these dynamics. In principle, the strength of selection can be inferred from the frequencies of a sample of TEs. However, complicated demographic histories, such as found in Drosophila melanogaster, could lead to a substantial distortion of the TE frequency distribution compared with that expected for a panmictic, constant-sized population. The current methodology for the estimation of selection intensity acting against TEs does not take into account demographic history and might generate erroneous estimates especially for TE families under weak selection. Here, we develop a flexible maximum likelihood methodology that explicitly accounts both for demographic history and for the ascertainment biases of identifying TEs. We apply this method to the newly generated frequency data of the BS family of non-long terminal repeat retrotransposons in D. melanogaster in concert with two recent models of the demographic history of the species to infer the intensity of selection against this family. We find the estimate to differ substantially compared with a prior estimate that was made assuming a model of constant population size. Further, we find there to be relatively little information about selection intensity present in the derived non-African frequency data and that the ancestral African subpopulation is much more informative in this respect. These findings highlight the importance of accounting for demographic history and bear on study design for the inference of selection coefficients generally.

Abstract

A number of studies have showed that recently created genes differ from the genes created in deep evolutionary past in many aspects. Here, we determined the age of emergence and propensity for gene loss (PGL) of all human protein-coding genes and compared disease genes with non-disease genes in terms of their evolutionary rate, strength of purifying selection, mRNA expression, and genetic redundancy. The older and the less prone to loss, non-disease genes have been evolving 1.5- to 3-fold slower between humans and chimps than young non-disease genes, whereas Mendelian disease genes have been evolving very slowly regardless of their ages and PGL. Complex disease genes showed an intermediate pattern. Disease genes also have higher mRNA expression heterogeneity across multiple tissues than non-disease genes regardless of age and PGL. Young and middle-aged disease genes have fewer similar paralogs as non-disease genes of the same age. We reasoned that genes were more likely to be involved in human disease if they were under a strong functional constraint, expressed heterogeneously across tissues, and lacked genetic redundancy. Young human genes that have been evolving under strong constraint between humans and chimps might also be enriched for genes that encode important primate or even human-specific functions.

Abstract

In a wide variety of organisms, synonymous codons are used with different frequencies, a phenomenon known as codon bias. Population genetic studies have shown that synonymous sites are under weak selection and that codon bias is maintained by a balance between selection, mutation, and genetic drift. It appears that the major cause for selection on codon bias is that certain preferred codons are translated more accurately and/or efficiently. However, additional and sometimes maybe even contradictory selective forces appear to affect codon usage as well. In this review, we discuss the current understanding of the ways in which natural selection participates in the creation and maintenance of codon bias. We also raise several open questions: (i) Is natural selection weak independently of the level of codon bias? It is possible that selection for preferred codons is weak only when codon bias approaches equilibrium and may be quite strong on genes with codon bias levels that are much lower and/or above equilibrium. (ii) What determines the identity of the major codons? (iii) How do shifts in codon bias occur? (iv) What is the exact nature of selection on codon bias? We discuss these questions in depth and offer some ideas on how they can be addressed using a combination of computational and experimental analyses.

Abstract

Sex chromosomes have arisen from autosomes many times over the course of evolution. This process generates chromosomal heteromorphy between the sexes, which has important implications for the evolution of coding and noncoding sequences on the sex chromosomes versus the autosomes. The formation of sex chromosomes from autosomes involves a reduction in gene dosage, which can modify properties of selection pressure on sex-linked genes. This transition also generates differences in the effective population size and dominance characteristics of novel mutations on the sex chromosome versus the autosomes. All of these changes may affect both patterns of in situ gene evolution and the rates of interchromosomal gene duplication and movement. Here we present a synopsis of the current understanding of the origin of sex chromosomes, theoretical context for differences in rates and patterns of molecular evolution on the X chromosome versus the autosomes, as well as a summary of empirical molecular evolutionary data from Drosophila and mammalian genomes.

Abstract

Several lines of evidence suggest that codon usage in the Drosophila saltans and D. willistoni lineages has shifted towards a less frequent use of GC-ending codons. Introns in these lineages show a parallel shift toward a lower GC content. These patterns have been alternatively ascribed to either a shift in mutational patterns or changes in the definition of preferred and unpreferred codons in these lineages.To gain additional insight into this question, we quantified background substitutional patterns in the saltans/willistoni group using inactive copies of a novel, Q-like retrotransposable element. We demonstrate that the pattern of background substitutions in the saltans/willistoni lineage has shifted to a significant degree, primarily due to changes in mutational biases. These differences predict a lower equilibrium GC content in the genomes of the saltans/willistoni species compared with that in the D. melanogaster species group. The magnitude of the difference can readily account for changes in intronic GC content, but it appears insufficient to explain changes in codon usage within the saltans/willistoni lineage.We suggest that the observed changes in codon usage in the saltans/willistoni clade reflects either lineage-specific changes in the definitions of preferred and unpreferred codons, or a weaker selective pressure on codon bias in this lineage.

Fitness cost of LINE-1 (L1) activity in humansPROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICABoissinot, S., Davis, J., Entezam, A., Petrov, D., Furano, A. V.2006; 103 (25): 9590-9594

Abstract

The self-replicating LINE-1 (L1) retrotransposon family is the dominant retrotransposon family in mammals and has generated 30-40% of their genomes. Active L1 families are present in modern mammals but the important question of whether these currently active families affect the genetic fitness of their hosts has not been addressed. This issue is of particular relevance to humans as Homo sapiens contains the active L1 Ta1 subfamily of the human specific Ta (L1Pa1) L1 family. Although DNA insertions generated by the Ta1 subfamily can cause genetic defects in current humans, these are relatively rare, and it is not known whether Ta1-generated inserts or any other property of Ta1 elements have been sufficiently deleterious to reduce the fitness of humans. Here we show that full-length (FL) Ta1 elements, but not the truncated Ta1 elements or SINE (Alu) insertions generated by Ta1 activity, were subject to negative selection. Thus, one or more properties unique to FL L1 elements constitute a genetic burden for modern humans. We also found that the FL Ta1 elements became more deleterious as the expansion of Ta1 has proceeded. Because this expansion is ongoing, the Ta1 subfamily almost certainly continues to decrease the fitness of modern humans.

Abstract

Analysis of the genome-wide patterns of single-nucleotide substitution reveals that the human GC content structure is out of equilibrium. The substitutions are decreasing the overall GC content (GC), at the same time making its range narrower. Investigation of single-nucleotide polymorphisms (SNPs) revealed that presently the decrease in GC content is due to a uniform mutational preference for A:T pairs, while its projected range is due to a variability in the fixation preference for G:C pairs. However, it is important to determine whether lessons learned about evolutionary processes operating at the present time (that is reflected in the SNP data) can be extended back into the evolutionary past. We describe here a new approach to this problem that utilizes the juxtaposition of forward and reverse substitution rates to determine the relative importance of variability in mutation rates and fixation probabilities in shaping long-term substitutional patterns. We use this approach to demonstrate that the forces shaping GC content structure over the recent past (since the appearance of the SNPs) extend all the way back to the mammalian radiation approximately 90 million years ago. In addition, we find a small but significant effect that has not been detected in the SNP data-relatively high rates of C:G-->A:T germline mutation in low-GC regions of the genome.

Abstract

Recent analysis of the human and mouse genomes has shown that a substantial proportion of protein coding genes and cis-regulatory elements contain transposable element (TE) sequences, implicating TE domestication as a mechanism for the origin of genetic novelty. To understand the general role of TE domestication in eukaryotic genome evolution, it is important to assess the acquisition of functional TE sequences by host genomes in a variety of different species, and to understand in greater depth the population dynamics of these mutational events.Using an in silico screen for host genes that contain TE sequences, we identified a set of 63 mature "chimeric" transcripts supported by expressed sequence tag (EST) evidence in the Drosophila melanogaster genome. We found a paucity of chimeric TEs relative to expectations derived from non-chimeric TEs, indicating that the majority (approximately 80%) of TEs that generate chimeric transcripts are deleterious and are not observed in the genome sequence. Using a pooled-PCR strategy to assay the presence of gene-TE chimeras in wild strains, we found that over half of the observed chimeric TE insertions are restricted to the sequenced strain, and approximately 15% are found at high frequencies in North American D. melanogaster populations. Estimated population frequencies of chimeric TEs did not differ significantly from non-chimeric TEs, suggesting that the distribution of fitness effects for the observed subset of chimeric TEs is indistinguishable from the general set of TEs in the genome sequence.In contrast to mammalian genomes, we found that fewer than 1% of Drosophila genes produce mRNAs that include bona fide TE sequences. This observation can be explained by the results of our population genomic analysis, which indicates that most potential chimeric TEs in D. melanogaster are deleterious but that a small proportion may contribute to the evolution of novel gene sequences such as nested or intercalated gene structures. Our results highlight the need to establish the fixity of putative cases of TE domestication identified using genome sequences in order to demonstrate their functional importance, and reveal that the contribution of TE domestication to genome evolution may vary drastically among animal taxa.

Abstract

Gene duplication is the fundamental source of new genes. Biases in duplication have profound implications for the dynamics of gene content during evolution. In this article, we compare genes arising from whole gene duplication (WGD), smaller scale duplication (SSD) and singletons in Saccharomyces cerevisiae. Our results demonstrate that genes duplicated by WGD and SSD are similarly biased with respect to codon bias and evolutionary rate, although differing significantly in their functional constituency.

Abstract

The patterns and processes of molecular evolution may differ between the X chromosome and the autosomes in Drosophila melanogaster. This may in part be due to differences in the effective population size between the two chromosome sets and in part to the hemizygosity of the X chromosome in Drosophila males. These and other factors may lead to differences both in the gene complements of the X and the autosomes and in the properties of the genes residing on those chromosomes. Here we show that codon bias and recombination rate are correlated strongly and negatively on the X chromosome, and that this correlation cannot be explained by indirect relationships with other known determinants of codon bias. This is in dramatic contrast to the weak positive correlation found on the autosomes. We explored possible explanations for these patterns, which required a comprehensive analysis of the relationships among multiple genetic properties such as protein length and expression level. This analysis highlights conserved features of coding sequence evolution on the X and the autosomes and illuminates interesting differences between these two chromosome sets.

Abstract

The tempo at which a protein evolves depends not only on the rate at which mutations arise but also on the selective effects that those mutations have at the organismal level. It is intuitive that proteins functioning during different stages of development may be predisposed to having mutations of different selective effects. For example, it has been hypothesized that changes to proteins expressed during early development should have larger phenotypic consequences because later stages depend on them. Conversely, changes to proteins expressed much later in development should have smaller consequences at the organismal level. Here we assess whether proteins expressed at different times during Drosophila development vary systematically in their rates of evolution. We find that proteins expressed early in development and particularly during mid-late embryonic development evolve unusually slowly. In addition, proteins expressed in adult males show an elevated evolutionary rate. These two trends are independent of each other and cannot be explained by peculiar rates of mutation or levels of codon bias. Moreover, the observed patterns appear to hold across several functional classes of genes, although the exact developmental time of the slowest protein evolution differs among each class. We discuss our results in connection with data on the evolution of development.

Abstract

Mutation is the underlying force that provides the variation upon which evolutionary forces can act. It is important to understand how mutation rates vary within genomes and how the probabilities of fixation of new mutations vary as well. If substitutional processes across the genome are heterogeneous, then examining patterns of coding sequence evolution without taking these underlying variations into account may be misleading. Here we present the first rigorous test of substitution rate heterogeneity in the Drosophila melanogaster genome using almost 1500 nonfunctional fragments of the transposable element DNAREP1_DM. Not only do our analyses suggest that substitutional patterns in heterochromatic and euchromatic sequences are different, but also they provide support in favor of a recombination-associated substitutional bias toward G and C in this species. The magnitude of this bias is entirely sufficient to explain recombination-associated patterns of codon usage on the autosomes of the D. melanogaster genome. We also document a bias toward lower GC content in the pattern of small insertions and deletions (indels). In addition, the GC content of noncoding DNA in Drosophila is higher than would be predicted on the basis of the pattern of nucleotide substitutions and small indels. However, we argue that the fast turnover of noncoding sequences in Drosophila makes it difficult to assess the importance of the GC biases in nucleotide substitutions and small indels in shaping the base composition of noncoding sequences.

Elevated evolutionary rates in the laboratory strain of Saccharomyces cerevisiaePROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICAGu, Z. L., David, L., Petrov, D., Jones, T., Davis, R. W., Steinmetz, L. M.2005; 102 (4): 1092-1097

Abstract

By using the maximum likelihood method, we made a genome-wide comparison of the evolutionary rates in the lineages leading to the laboratory strain (S288c) and a wild strain (YJM789) of Saccharomyces cerevisiae and found that genes in the laboratory strain tend to evolve faster than in the wild strain. The pattern of elevated evolution suggests that relaxation of selection intensity is the dominant underlying reason, which is consistent with recurrent bottlenecks in the S. cerevisiae laboratory strain population. Supporting this conclusion are the following observations: (i) the increases in nonsynonymous evolutionary rate occur for genes in all functional categories; (ii) most of the synonymous evolutionary rate increases in S288c occur in genes with strong codon usage bias; (iii) genes under stronger negative selection have a larger increase in nonsynonymous evolutionary rate; and (iv) more genes with adaptive evolution were detected in the laboratory strain, but they do not account for the majority of the increased evolution. The present discoveries suggest that experimental and possible industrial manipulations of the laboratory strain of yeast could have had a strong effect on the genetic makeup of this model organism. Furthermore, they imply an evolution of laboratory model organisms away from their wild counterparts, questioning the relevancy of the models especially when extensive laboratory cultivation has occurred. In addition, these results shed light on the evolution of livestock and crop species that have been under human domestication for years.

Abstract

If large genomes are truly saturated with unnecessary 'junk' DNA, it would seem natural that there would be costs associated ith accumulation and replication of this excess DNA. Here we examine the available evidence to support this hypothesis, which we term the 'large genome constraint'. We examine the large genome constraint at three scales: evolution, ecology, and the plant phenotype.In evolution, we tested the hypothesis that plant lineages with large genomes are diversifying more slowly. We found that genera with large genomes are less likely to be highly specious -- suggesting a large genome constraint on speciation. In ecology, we found that species with large genomes are under-represented in extreme environments -- again suggesting a large genome constraint for the distribution and abundance of species. Ultimately, if these ecological and evolutionary constraints are real, the genome size effect must be expressed in the phenotype and confer selective disadvantages. Therefore, in phenotype, we review data on the physiological correlates of genome size, and present new analyses involving maximum photosynthetic rate and specific leaf area. Most notably, we found that species with large genomes have reduced maximum photosynthetic rates - again suggesting a large genome constraint on plant performance. Finally, we discuss whether these phenotypic correlations may help explain why species with large genomes are trimmed from the evolutionary tree and have restricted ecological distributions.Our review tentatively supports the large genome constraint hypothesis.

Abstract

Eukaryotic enhancers act over very long distances, yet still show remarkable specificity for their own promoter. To better understand mechanisms underlying this enhancer-promoter specificity, we used transvection to analyze enhancer choice between two promoters, one located in cis to the enhancer and the other in trans to the enhancer, at the yellow gene of Drosophila melanogaster. Previously, we demonstrated that enhancers at yellow prefer to act on the cis-linked promoter, but that mutation of core promoter elements in the cis-linked promoter releases enhancers to act in trans. Here, we address the mechanism by which these elements affect enhancer choice. We consider and explicitly test three models that are based on promoter competency, promoter pairing, and promoter identity. Through targeted gene replacement of the endogenous yellow gene, we show that competency of the cis-linked promoter is a key parameter in the cis-trans choice of an enhancer. In fact, complete replacement of the yellow promoter with both TATA-containing and TATA-less heterologous promoters maintains enhancer action in cis.

Abstract

Closely related species of Drosophila tend to have similar genome sizes. The strong imbalance in favor of small deletions relative to insertions implies that the unconstrained DNA in Drosophila is unlikely to be passively inherited from even closely related ancestors, and yet most DNA in Drosophila genomes is intergenic and potentially unconstrained. In an attempt to investigate the maintenance of this intergenic DNA, we studied the evolution of an intergenic locus on the fourth chromosome of the Drosophila melanogaster genome. This 1.2-kb locus is marked by two distinct, large insertion events: a nuclear transposition of a mitochondrial sequence and a transposition of a nonautonomous DNA transposon DNAREP1_DM. Because we could trace the evolutionary histories of these sequences, we were able to reconstruct the length evolution of this region in some detail. We sequenced this locus in all four species of the D. melanogaster species complex: D. melanogaster, D. simulans, D. sechellia, and D. mauritiana. Although this locus is similar in size in these four species, less than 10% of the sequence from the most recent common ancestor remains in D. melanogaster and all of its sister species. This region appears to have increased in size through several distinct insertions in the ancestor of the D. melanogaster species complex and has been shrinking since the split of these lineages. In addition, we found no evidence suggesting that the size of this locus has been maintained over evolutionary time; these results are consistent with the model of a dynamic equilibrium between persistent DNA loss through small deletions and more sporadic DNA gain through less frequent but longer insertions. The apparent stability of genome size in Drosophila may belie very rapid sequence turnover at intergenic loci.

Abstract

The hundreds of mitochondrial pseudogenes in the human nuclear genome sequence (numts) constitute an excellent system for studying and dating DNA duplications and insertions. These pseudogenes are associated with many complete mitochondrial genome sequences and through those with a good fossil record. By comparing individual numts with primate and other mammalian mitochondrial genome sequences, we estimate that these numts arose continuously over the last 58 million years. Our pairwise comparisons between numts suggest that most human numts arose from different mitochondrial insertion events and not by DNA duplication within the nuclear genome. The nuclear genome appears to accumulate mtDNA insertions at a rate high enough to predict within-population polymorphism for the presence/absence of many recent mtDNA insertions. Pairwise analysis of numts and their flanking DNA produces an estimate for the DNA duplication rate in humans of 2.2 x 10(-9) per numt per year. Thus, a nucleotide site is about as likely to be involved in a duplication event as it is to change by point substitution. This estimate of the rate of DNA duplication of noncoding DNA is based on sequences that are not in duplication hotspots, and is close to the rate reported for functional genes in other species.

Abstract

Studies of "dead-on-arrival" transposable elements in Drosophila melanogaster found that deletions outnumber insertions approximately 8:1 with a median size for deletions of approximately 10 bp. These results are consistent with the deletion and insertion profiles found in most other Drosophila pseudogenes. In contrast, a recent study of D. melanogaster introns found a deletion/insertion ratio of 1.35:1, with 84% of deletions being shorter than 10 bp. This discrepancy could be explained if deletions, especially long deletions, are more frequently strongly deleterious than insertions and are eliminated disproportionately from intron sequences. To test this possibility, we use analysis and simulations to examine how deletions and insertions of different lengths affect different components of splicing and determine the distribution of deletions and insertions that preserve the original exons. We find that, consistent with our predictions, longer deletions affect splicing at a much higher rate compared to insertions and short deletions. We also explore other potential constraints in introns and show that most of these also disproportionately affect large deletions. Altogether we demonstrate that constraints in introns may explain much of the difference in the pattern of deletions and insertions observed in Drosophila introns and pseudogenes.

Abstract

Eukaryotes have both 'intron containing' and 'intron less' genes. Several databases are available for 'intron containing' genes in eukaryotes. In this note, we describe a database for 'intron less' genes from eukaryotes. 'Intron less' eukaryotic genes having prokaryotic architecture will help to understand gene evolution in a much simpler way unlike 'intron containing' genes.SEGE is available at http://intron.bic.nus.edu.sg/seg/mmeena@ntu.edu.sg

Abstract

Mutation is often said to be random. Although it must be true that mutation is ignorant about the adaptive needs of the organism and thus is random relative to them as a rule, mutation is not truly random in other respects. Nucleotide substitutions, deletions, insertions, inversions, duplications and other types of mutation occur at different rates and are effected by different mechanisms. Moreover the rates of different mutations vary from organism to organism. Differences in mutational biases, along with natural selection, could impact gene and genome evolution in important ways. For instance, several recent studies have suggested that differences in insertion/deletion biases lead to profound differences in the rate of DNA loss in animals and that this difference per se can lead to significant changes in genome size. In particular, Drosophila melanogaster appears to have a very high rate of deletions and the correspondingly high rate of DNA loss and a very compact genome. To assess the validity of these studies we must first assess the validity of the measurements of indel biases themselves. Here I demonstrate the robustness of indel bias measurements in Drosophila, by comparing indel patterns in different types of nonfunctional sequences. The indel pattern and the high rate of DNA loss appears to be shared by all known nonfunctional sequences, both euchromatic and heterochromatic, transposable and non-transposable, repetitive and unique. Unfortunately all available nonfunctional sequences are untranscribed and thus effects of transcription on indel bias cannot be assessed. I also discuss in detail why it is unlikely that natural selection for or against DNA loss significantly affects current estimates of indel biases.

Abstract

Several studies have shown DNA loss to be inversely correlated with genome size in animals. These studies include a comparison between Drosophila and the cricket, Laupala, but there has been no assessment of DNA loss in insects with very large genomes. Podisma pedestris, the brown mountain grasshopper, has a genome over 100 times as large as that of Drosophila and 10 times as large as that of Laupala. We used 58 paralogous nuclear pseudogenes of mitochondrial origin to study the characteristics of insertion, deletion, and point substitution in P. pedestris and Italopodisma. In animals, these pseudogenes are "dead on arrival"; they are abundant in many different eukaryotes, and their mitochondrial origin simplifies the identification of point substitutions accumulated in nuclear pseudogene lineages. There appears to be a mononucleotide repeat within the 643-bp pseudogene sequence studied that acts as a strong hot spot for insertions or deletions (indels). Because the data for other insect species did not contain such an unusual region, hot spots were excluded from species comparisons. The rate of DNA loss relative to point substitution appears to be considerably and significantly lower in the grasshoppers studied than in Drosophila or Laupala. This suggests that the inverse correlation between genome size and the rate of DNA loss can be extended to comparisons between insects with large or gigantic genomes (i.e., Laupala and Podisma). The low rate of DNA loss implies that in grasshoppers, the accumulation of point mutations is a more potent force for obscuring ancient pseudogenes than their loss through indel accumulation, whereas the reverse is true for Drosophila. The main factor contributing to the difference in the rates of DNA loss estimated for grasshoppers, crickets, and Drosophila appears to be deletion size. Large deletions are relatively rare in Podisma and Italopodisma.

Abstract

We recently proposed that patterns of evolution of non-LTR retrotransposable elements can be used to study patterns of spontaneous mutation. Transposition of non-LTR retrotransposable elements commonly results in creation of 5' truncated, "dead-on-arrival" copies. These inactive copies are effectively pseudogenes and, according to the neutral theory, their molecular evolution ought to reflect rates and patterns of spontaneous mutation. Maximum parsimony can be used to separate the evolution of active lineages of a non-LTR element from the fate of the "dead-on-arrival" insertions and to directly assess the relative frequencies of different types of spontaneous mutations. We applied this approach using a non-LTR element, Helena, in the Drosophila virilis group and have demonstrated a surprisingly high incidence of large deletions and the virtual absence of insertions. Based on these results, we suggested that Drosophila in general may exhibit a high rate of spontaneous large deletions and have hypothesized that such a high rate of DNA loss may help to explain the puzzling dearth of bona fide pseudogenes in Drosophila. We also speculated that variation in the rate of spontaneous deletion may contribute to the divergence of genome size in different taxa by affecting the amount of superfluous "junk" DNA such as, for example, pseudogenes or long introns. In this paper, we extend our analysis to the D. melanogaster subgroup, which last shared a common ancestor with the D. virilis group approximately 40 MYA. In a different region of the same transposable element, Helena, we demonstrate that inactive copies accumulate deletions in species of the D. melanogaster subgroup at a rate very similar to that of the D. virilis group. These results strongly suggest that the high rate of DNA loss is a general feature of Drosophila and not a peculiar property of a particular stretch of DNA in a particular species group.

Abstract

We have recently described a novel method of estimating neutral rates and patterns of spontaneous mutation (Petrov et al., 1996). This method takes advantage of the propensity of non-LTR retrotransposable elements to create non-functional, 'dead-on-arrival' copies as a product of transposition. Maximum parsimony analysis is used to separate the evolution of actively transposing lineages of a non-LTR element from the fate of individual inactive insertions, and thereby allows one to assess directly the relative rates of different types of mutation, including point substitutions, deletions and insertions. Because non-LTR elements enjoy wide phylogenetic distribution, this method can be used in taxa that do not harbor a significant number of bona fide pseudogenes, as is the case in Drosophila (Jeffs and Ashburner, 1991; Weiner et al., 1986). We used this method with Helena, a non-LTR retrotransposable element present in the Drosophila virilis species group. A striking finding was the virtual absence of insertions and remarkably high incidence of large deletions, which combine to produce a high overall rate of DNA loss. On average, the rate of DNA loss in D. virilis is approximately 75 times faster than that estimated for mammalian pseudogenes (Petrov et al., 1996). The high rate of DNA loss should lead to rapid elimination of non-essential DNA and thus may explain the seemingly paradoxical dearth of pseudogenes in Drosophila. Varying rates of DNA loss may also contribute to differences in genome size (Graur et al., 1989; Petrov et al., 1996), thus explaining the celebrated 'C-value' paradox (John and Miklos, 1988). In this paper we outline the theoretical basis of our method, examine the data from this perspective, and discuss potential problems that may bias our estimates.

Abstract

Transposable elements are a major source of genetic change, including the creation of novel genes, the alteration of gene expression in development, and the genesis of major genomic rearrangements. They are ubiquitous among contemporary organisms and probably as old as life itself. The long coexistence of transposable elements in the genome would be expected to be accompanied by host-element coevolution. Indeed, the important role of host factors in the regulation of transposable elements has been illuminated by recent studies of several systems in Drosophila. These include host factors that regulate the P element, a host mutation that renders the genome permissive for gypsy mobilization and infection, and newly induced mutations that affect the expression of transposon insertion mutations. The finding of a type of hybrid dysgenesis in D. virilis, in which multiple unrelated transposable elements are mobilized simultaneously, may also be relevant to host-factor regulation of transposition.

Abstract

Methods of genome analysis, including the cloning and manipulation of large fragments of DNA, have opened new strategies for uniting molecular evolutionary genetics with chromosome evolution. We have begun the development of a physical map of the genome of Drosophila virilis based on large DNA fragments cloned in bacteriophage P1. A library of 10,080 P1 clones with average insert sizes of 65.8 kb, containing approximately 3.7 copies of the haploid genome of D. virilis, has been constructed and characterized. Approximately 75% of the clones have inserts exceeding 50 kb, and approximately 25% have inserts exceeding 80 kb. A sample of 186 randomly selected clones was mapped by in situ hybridization with the salivary gland chromosomes. A method for identifying D. virilis clones containing homologs of D. melanogaster genes has also been developed using hybridization with specific probes obtained from D. melanogaster by means of the polymerase chain reaction. This method proved successful for nine of ten genes and resulted in the recovery of 14 clones. The hybridization patterns of a sample of P1 clones containing repetitive DNA were also determined. A significant fraction of these clones hybridizes to multiple euchromatic sites but not to the chromocenter, which is a pattern of hybridization that is very rare among clones derived from D. melanogaster. The materials and methods described will make it possible to carry out a direct study of molecular evolution at the level of chromosome structure and organization as well as at the level of individual genes.

Abstract

He-T sequences are a complex repetitive family of DNA sequences in Drosophila that are associated with telomeric regions, pericentromeric heterochromatin, and the Y chromosome. A component of the He-T family containing open reading frames (ORFs) is described. These ORF-containing elements within the He-T family are designated T-elements, since hybridization in situ with the polytene salivary gland chromosomes results in detectable signal exclusively at the chromosome tips. One T-element that has been sequenced includes ORFs of 1,428 and 1,614 bp. The ORFs are overlapping but one nucleotide out of frame with respect to each other. The longer ORF contains cysteine-histidine motifs strongly resembling nucleic acid binding domains of gag-like proteins, and the overall organization of the T-element ORFs is reminiscent of LINE elements. The T-elements are transcribed and appear to be conserved in Drosophila species related to D. melanogaster. The results suggest that T-elements may play a role in the structure and/or function of telomeres.

Abstract

Highly polymorphic segments of the human genome containing variable numbers of tandem repeats (VNTRs) have been widely used to establish DNA profiles of individuals for use in forensics. Methods of estimating the probability of occurrence of matching DNA profiles between two randomly selected individuals have been subject to extensive debate regarding the possibility of significant substructure occurring within the major races. We have sampled two Caucasian subpopulations, Finns and Italians, at four commonly used VNTR loci to determine the extent to which the subgroups differ from each other and from a mixed Caucasian database. The data were also analyzed for the occurrence of linkage disequilibrium among the loci. The allele frequency distributions of some loci were found to differ significantly among the subpopulations in a manner consistent with population substructure. Major differences were also found in the probability of occurrence of matching DNA profiles between two individuals chosen at random from the same subpopulation. With respect to the Finnish and Italian subpopulations, the conventional product rule for estimating the probability of a multilocus VNTR match using a mixed Caucasian database consistently yields estimates that are artificially small. Systematic errors of this type were not found using the interim ceiling principle recently advocated in the National Research Council's report [National Research Council (1992) DNA Technology in Forensic Science (Natl. Acad. Sci., Washington)]. The interim ceiling principle is based on currently available racial or ethnic databases and sets an arbitrary lower limit on each VNTR allele frequency. In the future the ceiling frequencies are expected to be established from more adequate data acquired for relevant VNTR loci from multiple subpopulations.

Conference Proceedings

Abstract

Pseudogenes are nonfunctional copies of protein-coding genes that are presumed to evolve without selective constraints on their coding function. They are of considerable utility in evolutionary genetics because, in the absence of selection, different types of mutations in pseudogenes should have equal probabilities of fixation. This theoretical inference justifies the estimation of patterns of spontaneous mutation from the analysis of patterns of substitutions in pseudogenes. Although it is possible to test whether pseudogene sequences evolve without constraints for their protein-coding function, it is much more difficult to ascertain whether pseudogenes may affect fitness in ways unrelated to their nucleotide sequence. Consider the possibility that a pseudogene affects fitness merely by increasing genome size. If a larger genome is deleterious--for example, because of increased energetic costs associated with genome replication and maintenance--then deletions, which decrease the length of a pseudogene, should be selectively advantageous relative to insertions or nucleotide substitutions. In this article we examine the implications of selection for genome size relative to small (1-400 bp) deletions, in light of empirical evidence pertaining to the size distribution of deletions observed in Drosophila and mammalian pseudogenes. There is a large difference in the deletion spectra between these organisms. We argue that this difference cannot easily be attributed to selection for overall genome size, since the magnitude of selection is unlikely to be strong enough to significantly affect the probability of fixation of small deletions in Drosophila.

Abstract

A novel method for estimating neutral rates and patterns of DNA evolution in Drosophila takes advantage of the propensity of non-LTR retrotransposable elements to create nonfunctional, transpositionally inactive copies as a product of transposition. For many LINE elements, most copies present in a genome at any one time are nonfunctional "dead-on-arrival" (DOA) copies. Because these are off-shoots of active, transpositionally competent "master" lineages, in a gene tree of a LINE element from multiple samples from related species, the DOA lineages are expected to map to the terminal branches and the active lineages to the internal branches, the primary exceptions being when the sample includes DOA copies that are allelic or orthologous. Analysis of nucleotide substitutions and other changes along the terminal branches therefore allows estimation of the fixation process in the DOA copies, which are unconstrained with respect to protein coding; and under selective neutrality, the fixation process estimates the underlying mutational pattern. We have studied the retroelement Helena in Drosophila. An unexpectedly high rate of DNA loss was observed, yielding a half-life of unconstrained DNA sequences approximately 60-fold faster in Drosophila than in mammals. The high rate of DNA loss suggests a straightforward explanation of the seeming paradox that Drosophila has many fewer pseudogenes than found in mammalian species. Differential rates of deletion in different taxa might also contribute to the celebrated C-value paradox of why some closely related organisms can have very different DNA contents. New data presented here rule out the possibility that the transposition process itself is highly mutagenic, hence the observed linear relation between number of deletions and number of nucleotide substitutions is most easily explained by the hypothesis that both types of changes accumulate in unconstrained sequences over time.