I am working with a student on the D. eugracilis dot chromosome, contig 50. We are annotating the RhoGAP102A gene, and she has successfully annotated the C and E isoforms. The D and F isoforms have additional exons that are giving us trouble. There are two issues:
1) In D. melanogaster, exons 13 (in isoform D) and 14 (in isoform F) overlap, with 47 common amino acids. Exon 13 has additional amino acids at the 5' end of the exon. BLAST searches with both exons return the coordinates 40255-40338 (frame +1) which covers 27 of the 47 shared amino acids. There is a splice acceptor at 40254 that is in phase with the splice donor in Exon 12. Upstream of this splice site, there are stop codons in all three frames, which suggest the conserved amino acids that are upstream in D. melanogaster do not exist in D. eugracilis. However, the RNA-Seq results show expression of the region of contig 50 that is 5' to the 40254 coordinate. Just doing a BLAST search with the 5' half of the sequence of Exon 13 (the longer of the two overlapping exons) does not generate any significant alignment, suggesting that part of this exon has been lost in D. eugracilis, which would make Exon 13 and Exon 14 identical, and thus make the CDS of isoforms D and F identical. What evidence should we include to make the conclusion that part of these exons is not present in D. eugracilis? Or is there evidence we should consider to show that there may be a sequencing error that has made it difficult to find the remainder of this exon?

Contig50Exons_12_13_14.jpg (76.64 KiB) Viewed 1208 times

2) Exon 15 (also present in isoforms D and F) does not seem to be present in D. eugracilis. At first, we thought it may be present on the next contig, but a BLAST search in FlyBase does not find any significant matches to Exon 15 in eugracilis (even with a low-stringency search). Also, in melanogaster, exons 12 through 15 are all present in a regions of 606 nucleotides in the genomic sequence. A similar distance in eugracilis is still well within contig 50. I tried to perform the BLAT protocol described in the "Not all of the gene in the contig?" post from March 2015, but the BLAT server responded with an error message. The lack of Exon 15 means there is no stop codon present in these two isoforms, though the Exon 13/14 described above could be read through to a stop codon at 40368-71. This would be the second significant deviation from the D. melanogaster gene structure, and thus, concerns me. What other evidence should we examine to determine if Exon 15 is or is not present in eugracilis?

> What evidence should we include to make the conclusion that part of these exons is not present in D. eugracilis?

The strongest evidence which indicates that the 3' end of RhoGAP102A has changed in D. eugracilis is the D. eugracilis RNA-Seq data from the adult males, adult females, and mixed embryos samples. In addition, tblastn searches of CDS 13_12205_1 and CDS 15_12205_2 against contig50 show that these CDS are only conserved in the species within the melanogaster subgroup. Hence the tblastn search results indicate that the orthologous CDS might not exist in D. eugracilis.

First, the "RNA-Seq Alignment Summary" track shows evidence of transcription that extends from CDS 12_12205_1 to beyond the stop codon at 40136-40138 in frame +2 (see the purple arrow and purple box in the screenshot below). Hence the RNA-Seq data suggests a novel isoform of RhoGAP102A in D. eugracilis where the final CDS of the isoform spans from 40027-40138 (yellow box).

Deug1_RhoGAP102A_3prime_end.png (404.25 KiB) Viewed 1198 times

The TopHat splice junctions from the three RNA-Seq samples placed the splice donor site for CDS 12_12205_1 at 40105-40106. This splice donor site is in phase 2 relative to frame +2. This means that the splice acceptor site for the next CDS must be in phase 1.

The TopHat junctions indicate that there are two potential splice acceptor sites for the next CDS (at 40152-40153 and 40160-40161, respectively). If the CDS were to begin at 40154 (consistent with the splice junction JUNC00004036 in the mixed embryos sample; red arrow), then the CDS must be in frame +3. This means that translation of this CDS would end at 40175. If the CDS were to begin at 40162 (consistent with the TopHat junction JUNC00004037 in the mixed embryos sample; brown arrow), then the CDS must be in frame +2 and the translation would end at the stop codon at 40246.

In order to account for the blastx matches to the ends of CDS 13_12205_1 and 14_12205_1 in frame +1, we would further propose a terminal CDS that spans from 40263-40371 (blue box).

To verify that the change in the reading frame from frame +2 to frame +1 at ~40245 is not due to a consensus error, we can compare the Illumina RNA-Seq reads against the consensus sequence. Examination of the RNA-Seq read alignments in the "D. eugracilis RNA-Seq" track did not show any consistent discrepancies between the Illumina reads and the consensus sequence. Hence the change in the reading frame is unlikely to be caused by an error in the consensus.

To gather more evidence in support of the proposed change in the gene structure, we can perform a tblastn search of CDS 13_12205_1 against the "Genome Assembly" databases of 22 Drosophila species. The tblastn search result (with significance threshold of 1E-2), show that this CDS is only conserved in the Drosophila species within the melanogaster subgroup (i.e., in D. melanogaster, D. sechellia, D. simulans, D. erecta, and D. yakuba).

Similarly, a tblastn search of CDS 15_12205_2 against the "Genome Assembly" databases at FlyBase shows that this CDS is only conserved in the species in the melanogaster subgroup, as well as in D. elegans, and D. ficusphila.

> At first, we thought it may be present on the next contig, but a BLAST search in FlyBase does not find any significant matches to [CDS] 15

The position of the D. eugracilis annotation projects are available through the "Annotation Projects" track on the D. eugracilis "April 2013 (BCM-HGSC/Deug_2.0)" genome assembly. You can search for the gene (RhoGAP102A) or the project (Deug1_contig50) in order to navigate to the region of interests. In this case, the region with high RNA-Seq coverage downstream of contig50 is likely to be the 3' UTR of the NfI gene.