Abstract

Genomes can encode a variety of proteins with unrelated architectures and activities. It is known that protein-coding genes of de novo origin have significantly contributed to this diversity. However, the molecular mechanisms and evolutionary processes behind these originations are still poorly understood. Here we show that the last 102 codons of a novel gene, Noble, assembled directly from non-coding DNA following an intronic deletion that induced alternative intron retention at the Drosophila melanogaster Rieske Iron Sulphur Protein (RFeSP) locus. A systematic analysis of the evolutionary processes behind the origin of Noble showed that its emergence was strongly biased by natural selection on and around the RFeSP locus. Noble mRNA is shown to encode a bona fide protein that lacks an iron sulphur domain and localizes to mitochondria. Together, these results demonstrate the generation of a novel protein at a naturally selected site.

Natural selection and neutral drift have been postulated to shape de novo coding sequences following their assembly from non-coding DNA1,2,3. However, the processes, or constraints, that lead to the origin of novel coding regions have seldom been studied systematically. This might be because, despite recent advances in genome sequencing, it remains a challenge to reconstruct with confidence the evolutionary pathway of the origination of any novel coding region1,2,3,4,5. Random genetic drift, population bottlenecks, genetic sweeps and the extinction of species are a few of the natural processes that affect the frequency of transitional alleles and commonly contribute to a discontinuous mutational lineage through time. Fortunately, decades of theoretical work on the neutral theories of evolution as a null hypothesis for molecular evolution6,7,8,9 have provided a solid theoretical framework for understanding gene origination. This work also allows us to test whether any de novo gene origination would arise as a consequence of non-adaptive mechanisms by the stochastic accumulation of neutral or quasi-neutral mutations.

Rieske iron sulphur proteins (RFeSPs) are essential, highly conserved functional constituents of energy-transducing respiratory complexes10. Drosophila melanogaster is predicted to have a complex RFeSP locus encoding at least two different proteins by an alternative intron-retention mechanism, according to published reference sequences11,12,13,14 (Fig. 1a). Briefly, the conserved RFeSP isoform (annotated as RFeSP-PB) is encoded by the RFeSP-RB transcript, which arises following splicing of the second intron of the locus (hereafter referred to as intron2). An alternative transcript, RFeSP-RA, forms following intron2 retention, which shifts the reading frame of the 3′-end of the gene. The resulting RFeSP-PA protein is predicted to contain 102 amino acids (aa) of novel sequence at its carboxy (C)-terminus instead of the last 72 aa of the C-terminal iron-sulphur cluster-binding domain found in RFeSP-PB (Fig. 1a).

Here, the evolutionary history of RFeSP-PA was systematically investigated, and both the neutrality and stochasticity of its origin were tested. We found out that the last 102 codons of RFeSP-RA assembled de novo from non-coding DNA in a single step after a nearly neutral intronic deletion caused the alternative retention of the second intron of the RFeSP-RB gene. Analyses of the evolutionary processes affecting the RFeSP locus before the emergence of RFeSP-RA then allowed us to determine and dissect the role played by natural selection as a significant source of bias affecting the origination of RFeSP-RA.

Results

RFeSP-RA is associated with a polymorphic intronic deletion

To confirm the annotated prediction that the D. melanogaster RFeSP locus encodes two isoforms, reverse transcriptase (RT)–PCR was performed to amplify across intron2 using cDNAs from two different standard fly stocks (Fig. 1b). Both the novel RFeSP-RA and the conserved RFeSP-RB isoforms are produced in the Berkeley Drosophila Reference Sequencing Strain11,12 (reference genome strain y1; cn1bw1sp1). However, even though total RFeSP transcript levels were similar, no RFeSP-RA was detectable in another standard strain w1118 (Fig. 1b).

To test whether the alternative splicing of RFeSP was associated with any underlying genetic alteration, PCR was performed using genomic DNA isolated from both w1118 and the reference genome strains. The reference genome strain carried an ~50-bp shorter intron2 than the w1118 strain (Fig. 1b,c). These experiments showed that RFeSP-RA expression was associated with a variation in intron2 length (Fig. 1c).

Single-step assembly of 102 de novo codons of RFeSP-RA

To discover the origin and frequency of the intron2 variants that produce the novel RFeSP-RA transcript, ~300 bp of DNA sequence spanning intron2 were obtained from 57 lines of D. melanogaster of geographically diverse origin, as well as from a series of lines from closely related Drosophila species (Supplementary Table S1). The sequences were aligned by hand and clustered into haplotypes (Supplementary Fig. S1). Results suggested that the RFeSP-RA-productive intron2 variant of the reference genome strain was identical to, and most likely originated from, the Canton-S wild-type stock. The number of strains with this short Canton-S-like intron2 haplotype was low compared with the number of strains with the longer intron2 variants, which were most similar in length to the w1118intron2 allele (Supplementary Fig. S1). These longer intron2 sequences clustered into two major allelic groups hereafter named as intron2a and intron2b, which are 115 and 117 bp in size, respectively (Fig. 1d and Supplementary Fig. S1). Using phylogenetically informative single-nucleotide polymorphisms within intron2, we determined that an intron2b allele directly gave rise to the RFeSP-RA-productive intron2 allele found in Canton-S by a 62-bp deletion (Fig. 1d); hence, the latter was named intron2bΔ62. This finding raised the possibility that the deletion intron2bΔ62 directly caused the emergence of RFeSP-RA mRNA and the generation of the last 102 codons of RFeSP-RA in a single step. Supporting this interpretation, we found that no RFeSP-RA-like mRNA was detectable by RT–PCR in a strain carrying the intron2b genotype directly ancestral to intron2bΔ62 (Fig. 1e). Furthermore, no RFeSP-RA cDNA could be detected by RT–PCR in a nonsense-mediated decay (NMD)15,16 defective background carrying the ancestral intron2b allele (Fig. 1e). This indicated that in the ancestral intron2b allele, an RFeSP-RA-like mRNA is not being generated and then degraded by NMD. These results strongly suggest that the intron2bΔ62 deletion itself was the cause of the de novo RFeSP-RA mRNA emergence.

A plausible mechanism to explain the facultative intron2 retention is that the putative branch point is positioned only 31 bp downstream of the 5′ splice donor in intron2bΔ62, (Fig. 1d; Supplementary Fig. S1). This distance is shorter than the ~38-bp limit found between the 5′ splice donor and the branch point in previous D. melanogaster intron-sequence analyses17. In the alleles that are efficiently spliced, the predicted branch points are longer than the 38-bp limit. Together, these data suggest that the intron2bΔ62 deletion directly caused the emergence of the RFeSP-RA mRNA by creating a suboptimal distance between the 5′ splice donor and the branch point in this allele (that is, intron recognition is poor, but still possible), giving rise to inefficient splicing of this intron. Given that RFeSP is an essential gene18, Canton-S flies might have survived and/or fixed intron2bΔ62 because it still allowed production of the canonical RFeSP protein, albeit less efficiently.

Nonneutral evolution of RFeSP intron2 alleles

To determine the mutational events, as well as the selective pressures that allowed the intron2bΔ62 deletion, the recent evolutionary history of its immediately ancestral allele, intron2b, was investigated. Molecular phylogenetic analyses indicate that virtually no intronic sequence gain has taken place and/or has become fixed in the melanogaster subgroup for 6–12 million years (MYs)19,20 (see Fig. 1d). Instead, several deletions occurred in intron2 during melanogaster subgroup speciation. Phylogenetic analyses of the deletions showed that they could be treated as irreversible shared derived cladistic characters21. Cladistic parsimony implies that the D. melanogaster intron2a and intron2b groups could not have originated from each other and that they must have originated independently from a 'complete' intron2a+b (Fig. 1d). Although sequencing efforts failed to find such intron2a+b segregating in D. melanogaster, even in sub-Saharan populations where this species originated22,23, many examples of intron2a+b-like introns were found in other melanogaster subgroup species, allowing us to devise the likely overall structure of the melanogaster subgroup intron2 ancestor (Fig. 1d). From this molecular phylogeny, it was concluded that the intron2a and intron2b groups are ancient and their existence as allelic groups either precedes, or coincides, with D. melanogaster speciation.

As the ancient nature of intron2 allele groups could have important implications for the understanding of the evolutionary processes that acted on RFeSP before the emergence of RFeSP-RA, the RFeSP locus was investigated further using a population genetics perspective. We found that RFeSP intron2b alleles had strikingly less nucleotide diversity than intron2a, and, although neutrality tests were generally nonsignificant when all intron2 alleles were considered together, when analysed separately the neutral hypothesis was rejected in three out of four neutrality tests for the intron2b group alleles, while none were rejected for intron2a (Fig. 2a; Supplementary Table S2). Furthermore, a difference between intron2 groups was also evident when the average ratio between nonsynonymous and synonymous substitution rates (dN/dS) on the coding regions of each group was calculated, revealing a complete absence of nucleotide substitution in the coding regions of intron2b group alleles21,24 (Supplementary Fig. S2). Two conclusions were drawn from these results about the evolution of the intron2b group: first, it deviated from that expected from neutrally drifting alleles, and second, it deviated from what one would expect if it were as ancient as the intron2a group. As intron2b is the ancestral allele of intron2bΔ62, these findings demonstrate that RFeSP-RA emerged from skewed nucleotide sequences.

To distinguish the mechanism for the reduced polymorphisms found in intron2b, we carried out linkage disequilibrium analyses between RFeSP intron2 groups and two possible proximal sites previously described to have been associated with positive selection25,26 (Supplementary Figs S3 and S4). Results showed that gene-copy polymorphisms in the tightly linked (~0.2 cM) Odorant receptor 22 (Or22) locus could significantly explain a large fraction of intermediate (a subset of intron2a) and high-frequency (all alleles from intron2b) RFeSP haplotypes (Supplementary Fig. S4). These analyses suggested that, apart from population history, positive selection could account for both the dip in nucleotide diversity in all high-frequency alleles, as well as for the linkage disequilibrium between them and variation at the Or22 locus (Supplementary Fig. S4). These findings warrant further study by using Chr2 isochromosomal lines and sequencing of multiple adjacent loci to probe further into this association.

RFeSP-RA codons were biased by negative selection on RFeSP

To further study the possible effect of selection on the nucleotide sequence that eventually became part of RFeSP-RA, the earliest time since when this exact RFeSP locus has been under selective pressure was determined. dN/dS ratios were generally not measurable between melanogaster subgroup species, because there were virtually no nonsynonymous changes in the surveyed sequences (Fig. 2b). Albeit synonymous changes occurred, they were underrepresented. For instance, only 29.6% (8/27) and 22.2% (4/18) of the segregating polymorphisms found for D. melanogaster and D. simulans, respectively, were synonymous changes (Supplementary Fig. S1). Although these values are not statistically significantly different (Fischer's exact test, P>0.1) than the expected 40–47% of the possible neutral sites on the coding region relative to the intron (see Methods), these estimates tend to or deviate significantly from the ~60% changes on the coding region expected from randomly distributed mutations (P=0.054 and 0.018, for D. melanogaster and D. simulans respectively; Fischer's exact test). A similar scenario is found in species of the yakuba/erecta clade, in which only 16.7% (8/48) of the DNA sequence variation found in the surveyed RFeSP loci of these species clusters outside intron2 (Supplementary Fig. S1), which departs significantly from the ~60% expected from randomly distributed mutations and the 40–46% expected from neutral site mutations (Fischer's exact test, P<0.001 and P=0.001, respectively). These results strongly suggest that RFeSP has been continuously under purifying selection since D. melanogaster and yakuba/erecta clade species last shared a common ancestor.

dN/dS analyses of RFeSP-coding region sequences obtained from a variety of key Drosophila taxa further suggested that negative selection was active on the RFeSP locus since all Old world Sophophora flies last shared a common ancestor 25–55 MY ago (MYA)19,20 or earlier (Fig. 2b). Importantly, synteny at this chromosomal region has been maintained since D. melanogaster and D. grimshawi last shared a common ancestor about 40–60 MYA19,20, strongly suggesting that we have followed the evolution of RFeSP sequences originating from the same chromosomal context (Supplementary Fig. S5).

The repeated elimination of deleterious alleles from RFeSP loci in D. melanogster ancestors by negative selection was important for the emergence of RFeSP-RA. This imposed a strong bias on the mutations that could accumulate through time on RFeSP, significantly influencing the alternative reading frames, one of which would harbour the future coding sequence of RFeSP-RA (Fig. 3a). For instance, the product of the RFeSP-RA transcript could not have been created by the intron2bΔ62 deletion if there were premature translation termination codons (PTCs) in the alternative reading frame downstream of the ancestral intron2b allele. Indeed, two independent conservative (synonymous) changes were found in the RFeSP-RB mRNA isoform that eliminated two cryptic PTCs (Fig. 3b) roughly between 15–20 and 30–60 MYA, respectively19,20. Hence, the removal of the cryptic PTCs became fixed before the intron2bΔ62 deletion or even before the intron2 divergence into intron2a and intron2b alleles (Fig. 3b). The only cryptic in-frame-PTCs remaining after these fixations were those within the intron2b intron, which were removed in one step by the intron2bΔ62 deletion. Considering that these changes happened in the context of low dN/dS levels, these conservative changes are strong evidence that the future sequence of RFeSP-RA was a by-product of purifying selection on RFeSP ancestors.

Next, we ruled out that chance alone could account for the fixation of the PTC-losses during the evolution of the RFeSP-RA reading frame. The reduced amount of nucleotide diversity in coding regions of RFeSP compared with its adjacent intron2 had already provided hints of mutational bias on the coding region (Fig. 2a; Supplementary Fig. S1). A detailed survey of 222 bp of the third exon of RFeSP-RB (from which >70% of the novel coding region of RFeSP-RA originated; Supplementary Table S3) showed that the loss of the PTCs during the evolution of RFeSP-RA could have followed trends in codon usage bias during the evolution of the melanogaster group (for example, one PTC was removed while the Tyr codon preference switched from TAT to TAC in Old World sophophorans (Supplementary Fig. S6)). This shows that at least one PTC loss was not random, because purifying selection could have been eliminating the mutants with suboptimal codons from populations.

Negative selection on RFeSP favours RFeSP-RA persistence

Results from the RFeSP codon survey (Supplementary Table S3) also revealed that once RFeSP-RA arose inside the RFeSP locus, it became unlikely that it would be lost by mutation alone. That is, the likelihood that an additional neutral mutation hits any of these 222 nucleotides of RFeSP-RB and at the same time removes RFeSP-RA (by introducing a PTC) is low (P=0.0015, 0.0165 or 0.0225, if one considers only neutral sites and codon bias, neutral sites and no codon bias, or all possible changes in RFeSP-RB that result in PTCs in RFeSP-RA (even those resulting in aa changes in RFeSP), respectively; Supplementary Table S3). These calculations assume that RFeSP-RA is a neutral or only slightly deleterious feature. If RFeSP-RA has already been (or occasionally becomes) recruited into a functional pathway, it can be predicted that it will itself be subject to natural selection, reducing even further the possibilities of its loss by mutation.

The RFeSP intron2 evolved early during Diptera divergence

The position of intron2 in the D. melanogaster RFeSP locus (that is, inducing splicing at the aspartic acid, Asp158 codon of RFeSP) was essential for the origination of the novel RFeSP-RA transcript by alternative intron retention, so its evolution was investigated further. Molecular phylogenetic analyses of published genomes suggested that an equivalent to the D. melanogaster intron2 had been gained either in an ancestor of the Antliophora (monophyletic group comprising mecopteran lineages, Mecoptera, Siphonaptera and Diptera, which are commonly known as scorpionflies, fleas and true flies, respectively)27, or later in a dipteran ancestor, which would conservatively place the intron gain in the Permian (300 MYA) or Jurassic (200 MYA) era, respectively27,28 (Fig. 4a). To resolve between these possibilities, sampling was increased across Holometabola (insects with complete metamorphosis), focusing on Antliophora. Results confirmed that apart from the 12-genome reference Drosophila species29, the 16 additional Drosophila taxa sequenced in the present study also had the intron2 at Asp158. Furthermore, data from non-Drosophila species confirmed that the positioning of intron2 at Asp158, or an equivalently positioned aa (referred to as Asp158 hereafter for simplicity) in other species, was found exclusively in Diptera. Two lower dipteran taxa did not have any intron: the mosquito Culex pipiens and the crane fly Tipula sp. (Fig. 4a). Whereas the absence in C. pipiens is attributable to a secondary loss due to the presence of the intron in both Anopheles and Aedes mosquitos, the same is not certain for Tipula sp. (Supplementary Discussion). In addition, the sampled dipterans share the secondary loss of a nearby ancient intron localized at arginine Arg135, which is 70-nucleotide upstream of the Diptera intron2 at Asp158 (Fig. 4a). The simplest explanation for this finding is that the RFeSP locus suffered a 70-nucleotide upstream (Arg135) intron loss and an independent intron gain at Asp158 at the time when an ancestor of most or all of the present day Diptera diverged from other Mecopterida (see Fig. 4b for possible scenarios). Therefore, the Asp158 intron has been stably positioned for at least 200 MY in the lineage that led to D. melanogaster27. Intron losses and gains, as well as their persistence, are generally considered to be evolutionarily conservative silent mutations, as they do not necessarily alter the aa-coding sequence30. We therefore interpret these results as evidence that stabilizing selection via purifying selection was functioning at the ancestral locus of the D. melanogaster RFeSP locus as the Asp158 intron was gained.

Not out of the blue encodes a mitochondrial protein

Five key events have been described herein that were essential for the origination of RFeSP-RA (for a scheme with events, see Fig. 5a). Namely, they were: the positioning of the RFeSP intron2 in an early dipteran ancestor at Asp158; the alternative open-reading frame evolution; the deletions within intron2; the dip in intron2b allele diversity; and the reiterated deletion intron2bΔ62. A simple interpretation of these successive mutations is that none of them are expected to have been strongly deleterious, or on the other hand to have been a direct cause of positive selection. That RFeSP-RA was generated by the accumulation of neutral or quasi-neutral mutations gives strong support to neutral theories of evolution.

A second prediction of the neutral theories of evolution would be that these mutations accumulated stochastically, because of demographical constraints. By following the evolutionary history of the RFeSP locus with high confidence for several MY, we determined that when a productive RFeSP-RA mRNA came about concomitantly with the intron2bΔ62 deletion, the codons that introduced the novel 102-aa C-terminal part of the RFeSP-PA protein were already set and sculpted by MY of reiterated selected nucleotide sequences that did not affect the RFeSP(-RB) product (Fig. 5b). This leads to the conclusion that the emergence of RFeSP-RA by the accumulation of neutral mutations cannot be explained by chance alone; natural selection is required to explain this origination. Hence, the novel RFeSP-RA gene was renamed as Not out of the blue (Noble). Noble alludes to the fact that its emergence was influenced by a nonrandom component. Also, it conveys a message about the putative function of its protein product. That is, by lacking a Rieske iron sulphur cluster domain, the Noble protein is likely to be chemically inert or inactive towards oxygen, just like 'Noble' metals (see Fig. 1a). The respiratory proficient RFeSP-RB gene is hereafter referred to as RFeSP.

Next, transgenic and targeted mutagenesis experiments were used to confirm that Noble was indeed translated into a protein in vivo (Fig. 6). In these experiments, the endogenous genome reference strain RFeSP locus (containing intron2bΔ62) was cloned, tagged C-terminally with TagRFP-T and expressed in Drosophila Schneider2 (S2) cells under the control of a Gal4-responsive promoter (Fig. 6a). The introduction of a mutation into this construct within the intron that does not affect the coding sequence of RFeSP but results in a Trp164 to a STOP codon within Noble (resulting in NobleW164STOP) completely impedes Noble-TagRFP-T production (Fig. 6a). The Noble-TagRFP-T gene fusion localized subcellularly to cytoplasmic dots (Fig. 5b), which were entirely eliminated in the Trp164-STOP mutant, confirming that TagRFP-T fluorescence originates from the full-length Noble protein product fusion (Fig. 6b). The complete elimination of splicing from the intron2bΔ62 locus by targeted mutagenesis (resulting in NobleOPT, for optimal), exclusively produced Noble-TagRFP-T mRNA (Supplementary Fig. S7) and full-length NobleOPT-TagRFP-T protein (Fig. 6a).

The subcellular localization of the fusion proteins was determined in vivo with higher resolution in third instar larvae salivary gland cells using well-characterized fluorescently tagged markers. Noble-TagRFP-T tightly associated with mitochondrial markers, but not with other organelles (Fig. 7). In the mitochondria, there was marked heterogeneity on the proportions of Mito-GFP and Noble-TagRFP-T (Fig. 7a–e), suggesting mitochondrial dynamics. The same was found in salivary glands of flies expressing Noble-TagRFP-T together with a Mito-YFP reporter (Fig. 7f–o), confirming the close association between Noble and mitochondria. Additional experiments with an amino (N)-terminally tagged RFeSP (intron2bΔ62) locus construct, suggested that Noble, like RFeSP, is N-terminally processed and requires an intact N-terminus to reach the mitochondria (Supplementary Fig. S8).

Discussion

Here, a systematic dissection of the evolutionary processes behind the origination of a novel protein-coding sequence has been conducted. Noble's emergence is partially analogous to non-deleterious frameshift-derived gene origins31, which have long been hypothesized as an important window for the generation of genetic novelty32,33. Indeed, similar gene arrangements to the RFeSP/Noble locus have been reported in the literature31,33,34,35. In some cases, such as with the relatively new p19ARF tumour suppressor, which is encoded on the alternative reading frame of the more conserved INK4a tumour suppressor36, the newest protein component of the locus has clearly integrated into molecular pathways and assumed important functions. In the case of the RFeSP/Noble pair, one can assume that although Noble carries the information and appears to be stable enough to accumulate in the mitochondria, it could not participate positively in mitochondrial respiration because it lacks the smaller iron–sulphur domain, which is found only in RFeSP (Fig. 1a). This property hints at a possible regulatory function of Noble on mitochondrial respiration, whereby Noble could directly antagonize RFeSP function. Considering this hypothetical scenario, the finding that Noble emerged by alternative splicing opens up the possibility that the evolution of this protein diversifying process is tightly linked to the abrupt origination of fine-tuned regulatory protein networks.

We showed that the 102 codons encoding the C-terminus of Noble emerged de novo in a single step from non-coding DNA by a deletion that induced alternative retention of the second intron of the RFeSP locus. Thus, apart from arising through gradual descent from previously duplicated expressed genetic units, the emergence of Noble demonstrates that new domain-sized protein stretches may form in the absence of expressed and/or functional transitional forms, in what appears to the eyes of the observer as a molecular 'leap'; as if it were out of the blue.

Our analyses showed that the non-coding sequences that were used for the generation of Noble had been shaped by the accumulation of nearly neutral mutations at a strongly negatively selected locus, RFeSP, through hundreds of millions of years, probably since RFeSP gained intron2 at Asp158 very early during Diptera evolution (Supplementary Discussion). As neutral or nearly neutral mutations are only a minor subset of the mutations expected to have occurred at this locus, it can be concluded that the origination of Noble was biased by selection, and was therefore not random. This can be contrasted with an eventual origination at a more neutrally evolving locus such as at a pseudogene or duplicated gene, in which most mutations (at least initially for the latter)37 should have an equal probability of fixation. The mechanisms behind the generation of Noble can explain how a locus can paradoxically diversify and increase the protein repertoire while maintaining ancestral states under strong negative selection without gene duplication, such as during the evolution of alternative splicing. This might provide a rational to explore the different constraints imposed on the evolution of genes by gene duplication and alternative splicing38. It is also tempting to suggest that these findings could also shed light onto instances in which de novo protein stretches probably had to originate under highly constrained situations of negative selection, such as during the ab initio protein diversification in early living organisms39.

Methods

Drosophila strains and other insect samples

Drosophila flies were raised and crossed at 25 °C. Isofemale lines were established by R.C.W. from wild Drosophila melanogaster lines caught in Ohio, USA. Other insect samples were collected and classified by M.F.W. or A.M.G. and stored in absolute ethanol at −80 °C. A list of the Drosophila lines and the non-Drosophilinae insects used in our study can be found in Supplementary Tables S1 and S4, respectively.

PCR and reverse transcriptase–PCR

Drosophila samples were stored in RNAlater TissueProtect Tubes (Qiagen, catalogue #76,154). Genomic DNA was routinely extracted from one male and one female adult per Drosophila line or from parts, or whole individuals, for the other insects using the Dneasy, Blood and Tissue kit (Qiagen, catalogue #69,506). RNA was isolated with Trizol Reagent (Invitrogen, catalogue #15,596-026; larvae and adult flies and insect samples stored in EtOH) or with RNeasy Mini Kit (Qiagen, catalogue #74,106; larvae), and subject to double DNAse digestion: RNAse-free DNase set (Qiagen, catalogue #79,254) and Turbo DNA-free (Ambion, catalogue #AM1907). cDNA was made with SuperScript First-Strand, Synthesis System for RT–PCR (Invitrogen, catalogue #18,080-051). A list with the primers used in this study is provided in Supplementary Table S5.

Sequence analyses and phylogeny

PCR products were cleaned with QIAquick PCR purification kit (Qiagen, catalogue #28,106) or if necessary by gel extration using a QIAquick Gel extraction kit (Qiagen, catalogue #28,704). Products from degenerate PCRs were cloned by ligating 1 μl of PCR product with 50 ng of AccepTor Vector, pSTblue-1 vector (Novagen) using a Quick ligation kit (Biolabs, catalogue #M2200S). Subsequently, 1 μl of the ligation reaction was added directly to Novablue Singles Competent Cells (Novagen), which were transformed for 5 min on ice, 30 s at 42 °C and again 2 min on ice. Minipreps were performed with QIAprep Miniprep Kit (Qiagen, catalogue #27,106), and cloned sequences were amplified with standard primers: T7 and SP6. Sequences were read and edited with MacVector. SNAP software was used to calculate dN/dS ratios21,40. Neutrality tests were performed using all intron2 sequences, or with each intron2 group (intron2a and intron2b) alone using Intrapop (by Guillaume Achaz)41, where 100,000 coalescence simulations were used to estimate the statistical significance of each test. Deletions counted as a single unique mutational event. For Figure 1d, the RFeSP loci were overlaid onto the well-accepted phylogeny of the melanogaster subgroup19,20,29,42,43. For Figure 4a, the presence or absence of the dipteran RFeSP intron2 at Asp158 was overlaid onto a consensus phylogeny extracted from several sources27,28,44,45. Tipula sp. was placed basal to both Culicomorpha and Psychodomorpha based on the apparent consensus between morphological and molecular data27,28,45,46,47. To establish P values for dN/dS estimates we used the pN and pS values (the proportion of nonsynonymous sites and synonymous sites, respectively), and applied Fischer's exact test, considering α=0.05 and pN=pS as the null hypothesis. To estimate the relative amount of possible mutable nucleotides in the RFeSP locus under negative selection, we made two assumptions. First we assumed a conservative value that 80% of intronic sites are not selected for at the nucleotide level. Second, we assumed an equal probability of mutational hits happening between coding and non-coding neutral sites. We then calculated the average and standard deviation of the amount of possible synonymous sites on the surveyed region of RFeSP for 36 Diptera RFeSP homologues. Alternatively, we added to this amount the average proportion of nonsynonymous site substitutions that we found in five basal Diptera relative to D. melanogaster (taxa used, followed by calculated pN: Phlebotomus papatasi 0.072; Lutzomyia longipalpis, 0.089; Anopheles gambiae, 0.059; Aedes aegypti, 0.060; Armigeres subalbatus, 0.0558; and Culex pipens quinquefasciatus, 0.060). We then obtained the average and standard deviation of the latter sums (pN+average pS for each of the latter five taxa). The difference between these two estimates is to consider as neutral only the synonymous sites or to consider all observed nucleotide substitutions that have happened during the divergence of Diptera neutral, respectively. The latter includes the nonsynonymous changes, which should represent mostly aa changes that do not affect protein function. A list with the database sources of all sequences used in this study can be found in Supplementary Table S6. Other genome sequences were obtained from the UCSC Genome Browser (http://genome.ucsc.edu/). Recombination rates between RFeSP and Or22 were calculated with the Drosophila melanogaster recombination rate calculator48.

Nonsense-mediated decay assay in vivo

cDNA was produced from mRNA isolated from male larvae carrying different RFeSP intron2 genotypes as depicted in Figure 1c. Upf125G is X-linked and eliminates NMD in hemizigous males49, as confirmed by the retention of the larger transcript in RpS9, which served as a positive control50.

Transgenes and cloning

Transgenes are synthetic and fully sequenced (http://www.geneart.com). Complete sequences and full descriptions of the transgenes have been deposited in GenBank under references {"type":"entrez-nucleotide-range","attrs":{"text":"HQ161726-HQ161730","start_term":"HQ161726","end_term":"HQ161730","start_term_id":"333755432","end_term_id":"333755443"}}HQ161726-HQ161730. Transgenes consist of the endogenous reference genome strain RFeSP intron2bΔ62 locus (for these constructs we removed intron1 completely) under the control of a Gal4-responsive promoter. This was either tagged N-terminally or C-terminally with the bright jellyfish green fluorescent protein (GFP) derivative VisGreen51 or the bright TagRFP-T (a monomeric derivative of eqFP578 from the sea anemone Entacmaea quadricolor)52, respectively. VisGreen should label both proteins encoded by the intron2bΔ62 locus, RFeSP and Noble, because they share their N-termini. By contrast, TagRFP-T would exclusively label Noble because of its unique C-terminus. All transgenes were cloned into pUAST after digestion with EcoRI/NotI.

S2 cell transfection

S2 cells (Invitrogen, catalogue #10,831-014) were maintained in Express Five SFM (Invitrogen, catalogue #10,486-025), supplemented with L-glutamine (from 100× stock, LabClinics, catalogue #M11-004) and antibiotics (from 100× penicillin/streptomycin stock, Sigma, catalogue #P4333-100ML). The cells were grown in an air incubator at 25 °C without CO2. For transient transfections, 2 ml of Express Five SFM medium supplemented with L-glutamine 1× containing 8×105Drosophila S2 cells were plated into individual wells of 6-well plates. The DNA for transfection was maxi-prepped (NucleoBond Xtra Maxi kit, Macherey-nagel, catalogue #740414.50). DNA concentrations were determined using the NanoDrop 1,000 spectrophotometer (Thermo Scientific). For individual transfections, we used 2 μg of total DNA including pMT-Gal4 and one of the following plasmids: pUAST empty vector, pUAST-VisGreen-RFeSP/Noble, pUAST-Noble-TagRFP-T, pUAST-NobleW137STOP-TagRFP-T and pUAST-NobleOPT-TagRFP-T. The amount of each plasmid was adjusted to get equimolar concentration. The cells were transfected using Cellfectin II Reagent (Invitrogen, catalogue #10,362-100) according to the manufacturer's protocol using 100 μl Express Five SFM medium supplemented with L-glutamine and 8 μl Cellfectin II Reagent. The metallothionein promoter was induced 24 h after transfection by adding CuSO4 at 1.4 mM to the cells. Cells were lysed 24 h later (48 h since the start of transfection).

Fly transformation

Transgenes were injected in w1118/yw embryos together with the helper plasmid Δ2–3 using standard P-element-mediated transformation procedures (BestGene). w+ transformant flies were backcrossed again to w1118/yw flies and balanced.

Immunofluorescence analysis

Transfections were performed exactly as described above (see 'S2 Cell Transfection'), except that 1 μg of total DNA was used. Cells on cover slips were fixed with 4% formaldehyde 24 h after the addition of CuSO4 (48 h since the start of transfection). Cells were then incubated in darkness with 4,6-diamidino-2-phenylindole for 10 min. Slides were mounted in Vectashield (Vector Labs, catalogue #H-1,000) and analysis was performed with an inverted confocal microscope (Laser Scanning confocal Microcope TCS SP2 ADBS, Leica Microsystems, Heidelberg GmbM). For these experiments, transgenic flies carrying either pUAST-VisGreen-RFeSP/Noble or pUAST-Noble-TagRFP-T were crossed to ey-Gal4 (Gal4 under the eyeless enhancer, driving UAS-dependent transcription) in the presence of fluorescent reporters of subcellular organelles53,54 (Supplementary Table S1). Tissue-specific overexpression of the pUAST-VisGreen-RFeSP/Noble or pUAST-Noble-TagRFP-T constructs either alone or together had no detectable effect on developing eye imaginal discs or salivary glands. Wandering third instar larval salivary glands were dissected with PBS and processed as described above.

Author contributions

A.M.G. coordinated the study, conceived the ideas, designed the experiments, collected and classified wild dipteran species, analysed the data and wrote the paper. V.M. performed in vitro experiments. R.C.W. collected and established D. melanogaster isofemale lines. M.F.W. collected and classified wild dipteran and non-dipteran biological samples. M.D. provided laboratory space, essential support, funding and resources in all steps of the study, and helped to write the paper. V.M., M.F.W., R.C.W. and M.D. discussed the results and implications and commented on and edited the manuscript.

Supplementary Material

Acknowledgments

We thank I. Gutierrez, F. Heredia, E. Ballesta and E. Caparros for experimental assistance, B. Lakowski, P. Capy, S. Ralph and F. Heredia for comments on the manuscript, the Bloomington and the UCSD Drosophila Stock Centers, B. Wiegmann, M. Raupach, C. Schlotterer, E. Loreto, M. Krasnow, M. Metzstein, S. Brenner, L. Lareau, A. Parks and K. Matthews for fly stocks, reagents and information, and members of the M. Dominguez lab for discussions. A.M.G. received support from the European Commission under a Marie Curie Intra-European Fellowship for Career Development, V.M. from the Generalitat Valenciana, M.F.W. from NSF EF-0531665 and R.C.W. from a BGSU Faculty Research Committee Research Incentive Grant. The work in the laboratory of M.D. is funded by the Ministerio de Ciencia e Innovación (BFU2009-09074 and MEC-CONSOLIDER CSD2007-00023), the Fundación Marcelino Botin (FMB), Generalitat Valenciana (PROMETEO 2008/134) and an European Union Research Grant UE-HEALH-F2-2008-201666. The contents of this article reflect only the authors' views and not the views of the European Commission.

References

Long M., Betrán E., Thornton K. & Wang W. The origin of new genes: glimpses from the young and old. Nat. Rev. Genet.4, 865–875 (2003). [PubMed]