Bottom Line:
Comparison/alignment of RNA molecules provides an effective means to predict their functions and understand their evolutionary relationships.Based on ESA, a rigorous mathematical framework can be built for RNA structure comparison.Means and covariances of full structures can be defined and computed, and probability distributions on spaces of such structures can be constructed for a group of RNAs.

ABSTRACTThe functions of RNAs, like proteins, are determined by their structures, which, in turn, are determined by their sequences. Comparison/alignment of RNA molecules provides an effective means to predict their functions and understand their evolutionary relationships. For RNA sequence alignment, most methods developed for protein and DNA sequence alignment can be directly applied. RNA 3-dimensional structure alignment, on the other hand, tends to be more difficult than protein structure alignment due to the lack of regular secondary structures as observed in proteins. Most of the existing RNA 3D structure alignment methods use only the backbone geometry and ignore the sequence information. Using both the sequence and backbone geometry information in RNA alignment may not only produce more accurate classification, but also deepen our understanding of the sequence-structure-function relationship of RNA molecules. In this study, we developed a new RNA alignment method based on elastic shape analysis (ESA). ESA treats RNA structures as three dimensional curves with sequence information encoded on additional dimensions so that the alignment can be performed in the joint sequence-structure space. The similarity between two RNA molecules is quantified by a formal distance, geodesic distance. Based on ESA, a rigorous mathematical framework can be built for RNA structure comparison. Means and covariances of full structures can be defined and computed, and probability distributions on spaces of such structures can be constructed for a group of RNAs. Our method was further applied to predict functions of RNA molecules and showed superior performance compared with previous methods when tested on benchmark datasets. The programs are available at http://stat.fsu.edu/ ∼jinfeng/ESA.html.

gkt187-F6: AUC for selected values of λ using the full FSCOR dataset. The selected λ values are 0, 1, 2 … , 10, 20, 30 … , 70, (0 to 10 with a step-size of 1, 10 to 70 with a step-size of 10, and sequence only alignment).

Mentions:
When we combine sequence and structure information, we introduce a weight parameter, λ, controlling the relative influence of sequence information, which is subject to tuning. To search for the range of plausible values of λ, we randomly sample 80% of RNA structures with <200 residues in FSCOR dataset and use leave-one-out cross validation (LOOCV) to find the optimal λ among some selected values between zero and 70. Specifically, there are 369 out of 419 structures with <200 residues. This subset was selected mainly to save computational time, since some of the remaining structures are very large. In LOOCV, we compute the area under ROC curve (AUC) and use this criterion for performance evaluation throughout this article. This randomization was performed 10 times to assess the robustness of estimated optimal λ values for different sets of RNA structures. AUC values obtained with different λ from this experiment were plotted in Figure 5. In addition to this set of λ values, we also performed RNA comparison using sequence information alone, denoted as . When using only sequence information in alignment, equation A.1 is replaced by:(2)The AUCs for different λ for the full FSCOR dataset are shown in Figure 6. Both results show that λs around 5 tend to give the best AUCs, which is the value we use for subsequent performance evaluation on the whole FSCOR dataset. It is worth noting that both and give worse classification performances, showing that alignment in the joint sequence–structure space indeed performs better than using either sequence or structure information alone.Figure 5.

gkt187-F6: AUC for selected values of λ using the full FSCOR dataset. The selected λ values are 0, 1, 2 … , 10, 20, 30 … , 70, (0 to 10 with a step-size of 1, 10 to 70 with a step-size of 10, and sequence only alignment).

Mentions:
When we combine sequence and structure information, we introduce a weight parameter, λ, controlling the relative influence of sequence information, which is subject to tuning. To search for the range of plausible values of λ, we randomly sample 80% of RNA structures with <200 residues in FSCOR dataset and use leave-one-out cross validation (LOOCV) to find the optimal λ among some selected values between zero and 70. Specifically, there are 369 out of 419 structures with <200 residues. This subset was selected mainly to save computational time, since some of the remaining structures are very large. In LOOCV, we compute the area under ROC curve (AUC) and use this criterion for performance evaluation throughout this article. This randomization was performed 10 times to assess the robustness of estimated optimal λ values for different sets of RNA structures. AUC values obtained with different λ from this experiment were plotted in Figure 5. In addition to this set of λ values, we also performed RNA comparison using sequence information alone, denoted as . When using only sequence information in alignment, equation A.1 is replaced by:(2)The AUCs for different λ for the full FSCOR dataset are shown in Figure 6. Both results show that λs around 5 tend to give the best AUCs, which is the value we use for subsequent performance evaluation on the whole FSCOR dataset. It is worth noting that both and give worse classification performances, showing that alignment in the joint sequence–structure space indeed performs better than using either sequence or structure information alone.Figure 5.

Bottom Line:
Comparison/alignment of RNA molecules provides an effective means to predict their functions and understand their evolutionary relationships.Based on ESA, a rigorous mathematical framework can be built for RNA structure comparison.Means and covariances of full structures can be defined and computed, and probability distributions on spaces of such structures can be constructed for a group of RNAs.

ABSTRACTThe functions of RNAs, like proteins, are determined by their structures, which, in turn, are determined by their sequences. Comparison/alignment of RNA molecules provides an effective means to predict their functions and understand their evolutionary relationships. For RNA sequence alignment, most methods developed for protein and DNA sequence alignment can be directly applied. RNA 3-dimensional structure alignment, on the other hand, tends to be more difficult than protein structure alignment due to the lack of regular secondary structures as observed in proteins. Most of the existing RNA 3D structure alignment methods use only the backbone geometry and ignore the sequence information. Using both the sequence and backbone geometry information in RNA alignment may not only produce more accurate classification, but also deepen our understanding of the sequence-structure-function relationship of RNA molecules. In this study, we developed a new RNA alignment method based on elastic shape analysis (ESA). ESA treats RNA structures as three dimensional curves with sequence information encoded on additional dimensions so that the alignment can be performed in the joint sequence-structure space. The similarity between two RNA molecules is quantified by a formal distance, geodesic distance. Based on ESA, a rigorous mathematical framework can be built for RNA structure comparison. Means and covariances of full structures can be defined and computed, and probability distributions on spaces of such structures can be constructed for a group of RNAs. Our method was further applied to predict functions of RNA molecules and showed superior performance compared with previous methods when tested on benchmark datasets. The programs are available at http://stat.fsu.edu/ ∼jinfeng/ESA.html.