A method for identifying alternative or cryptic donor splice sites within gene and mRNA sequences. Comparisons among sequences from vertebrates, echinoderms and other groups.

Buckley KM, Florea LD, Smith LC - BMC Genomics (2009)

Bottom Line:
The Purpuratus model was shown to predict splice signals better than a similar model created from vertebrate sequences.Although the Purpuratus model was able to correctly predict the true splice sites within the 185/333 genes, no evidence for alternative or trans-gene splicing was observed.Furthermore, alternative or trans-gene splicing does not appear to be acting as a diversification mechanism in the 185/333 gene family.

Background: As the amount of genome sequencing data grows, so does the problem of computational gene identification, and in particular, the splicing signals that flank exon borders. Traditional methods for identifying splicing signals have been created and optimized using sequences from model organisms, mostly vertebrate and yeast species. However, as genome sequencing extends across the animal kingdom and includes various invertebrate species, the need for mechanisms to recognize splice signals in these organisms increases as well. With that aim in mind, we generated a model for identifying donor and acceptor splice sites that was optimized using sequences from the purple sea urchin, Strongylocentrotus purpuratus. This model was then used to assess the possibility of alternative or cryptic splicing within the highly variable immune response gene family known as 185/333.

Results: A donor splice site model was generated from S. purpuratus sequences that incorporates non-adjacent dependences among positions within the 9 nt splice signal and uses position weight matrices to determine the probability that the site is used for splicing. The Purpuratus model was shown to predict splice signals better than a similar model created from vertebrate sequences. Although the Purpuratus model was able to correctly predict the true splice sites within the 185/333 genes, no evidence for alternative or trans-gene splicing was observed.

Conclusion: The data presented herein describe the first published analyses of echinoderm splice sites and suggest that the previous methods of identifying splice signals that are based largely on vertebrate sequences may be insufficient. Furthermore, alternative or trans-gene splicing does not appear to be acting as a diversification mechanism in the 185/333 gene family.

Figure 3: Histograms to evaluate the models. Genes isolated from S. purpuratus (circles), vertebrates (diamonds), and protostomes (triangles) were collected and analyzed using the Purpuratus (A) and Vertebrate (B) models. Histograms of the known positive (solid lines) and negative (dashed lines) donor splice sites were generated (bin size = 2). The average of the means (Table 3) is shown by a vertical dotted line. Values corresponding to N0.95, and P0.05 (Table 3) flank the left and right side of the gray region, respectively, and are shown as a dashed/dotted line. The tables within the graphs indicate the percentage of known positive (Pos.) and negative (Neg.) S. purpuratus (Purp.), vertebrate (Vert.), and protostome (Prot.) sequences, which were classified as positive or negative using the average of the means as the threshold.

Mentions:
The Purpuratus and Vertebrate donor splice models were used to evaluate an independent set of annotated S. purpuratus genes (Table 2). The frequencies with which the known positive and negative donor sites were predicted correctly were calculated using a 2 × 2 contingency table. The resulting numbers of TP, FP, true negative (TN), and false negative (FN) sites were used to calculate six measures of model accuracy. The Purpuratus model out-performed the Vertebrate model in five of the six assessments (Table 4; Figure 3). Most notably, the specificity (Sp) of the Purpuratus model was higher than that of the Vertebrate model, indicating that, compared to the Vertebrate model, more of the predicted positive sites were likely to be functional donor sites. The sensitivities (Sn) of the two models were approximately the same (Table 4). For each of the four assessments [correlation coefficient (CC), simple matching coefficient (SMC), average conditional probability (ACP), and approximate correlation (AC)] that combined Sp and Sn values, the Purpuratus model scored higher than the Vertebrate model (Table 4). Thus, when evaluating sea urchin sequences, the Purpuratus donor splice site model more accurately predicted whether or not a given sequence could be used as a splice signal.

Figure 3: Histograms to evaluate the models. Genes isolated from S. purpuratus (circles), vertebrates (diamonds), and protostomes (triangles) were collected and analyzed using the Purpuratus (A) and Vertebrate (B) models. Histograms of the known positive (solid lines) and negative (dashed lines) donor splice sites were generated (bin size = 2). The average of the means (Table 3) is shown by a vertical dotted line. Values corresponding to N0.95, and P0.05 (Table 3) flank the left and right side of the gray region, respectively, and are shown as a dashed/dotted line. The tables within the graphs indicate the percentage of known positive (Pos.) and negative (Neg.) S. purpuratus (Purp.), vertebrate (Vert.), and protostome (Prot.) sequences, which were classified as positive or negative using the average of the means as the threshold.

Mentions:
The Purpuratus and Vertebrate donor splice models were used to evaluate an independent set of annotated S. purpuratus genes (Table 2). The frequencies with which the known positive and negative donor sites were predicted correctly were calculated using a 2 × 2 contingency table. The resulting numbers of TP, FP, true negative (TN), and false negative (FN) sites were used to calculate six measures of model accuracy. The Purpuratus model out-performed the Vertebrate model in five of the six assessments (Table 4; Figure 3). Most notably, the specificity (Sp) of the Purpuratus model was higher than that of the Vertebrate model, indicating that, compared to the Vertebrate model, more of the predicted positive sites were likely to be functional donor sites. The sensitivities (Sn) of the two models were approximately the same (Table 4). For each of the four assessments [correlation coefficient (CC), simple matching coefficient (SMC), average conditional probability (ACP), and approximate correlation (AC)] that combined Sp and Sn values, the Purpuratus model scored higher than the Vertebrate model (Table 4). Thus, when evaluating sea urchin sequences, the Purpuratus donor splice site model more accurately predicted whether or not a given sequence could be used as a splice signal.

Bottom Line:
The Purpuratus model was shown to predict splice signals better than a similar model created from vertebrate sequences.Although the Purpuratus model was able to correctly predict the true splice sites within the 185/333 genes, no evidence for alternative or trans-gene splicing was observed.Furthermore, alternative or trans-gene splicing does not appear to be acting as a diversification mechanism in the 185/333 gene family.

Background: As the amount of genome sequencing data grows, so does the problem of computational gene identification, and in particular, the splicing signals that flank exon borders. Traditional methods for identifying splicing signals have been created and optimized using sequences from model organisms, mostly vertebrate and yeast species. However, as genome sequencing extends across the animal kingdom and includes various invertebrate species, the need for mechanisms to recognize splice signals in these organisms increases as well. With that aim in mind, we generated a model for identifying donor and acceptor splice sites that was optimized using sequences from the purple sea urchin, Strongylocentrotus purpuratus. This model was then used to assess the possibility of alternative or cryptic splicing within the highly variable immune response gene family known as 185/333.

Results: A donor splice site model was generated from S. purpuratus sequences that incorporates non-adjacent dependences among positions within the 9 nt splice signal and uses position weight matrices to determine the probability that the site is used for splicing. The Purpuratus model was shown to predict splice signals better than a similar model created from vertebrate sequences. Although the Purpuratus model was able to correctly predict the true splice sites within the 185/333 genes, no evidence for alternative or trans-gene splicing was observed.

Conclusion: The data presented herein describe the first published analyses of echinoderm splice sites and suggest that the previous methods of identifying splice signals that are based largely on vertebrate sequences may be insufficient. Furthermore, alternative or trans-gene splicing does not appear to be acting as a diversification mechanism in the 185/333 gene family.