چکیده انگلیسی

Numerous experiments and analyses of RNA structures have revealed that the local distinct structure closely correlates with the biological function. In this study, we present a data mining approach to discover such unusual folding regions (UFRs) in genome sequences. Our approach is a three-step procedure. During the first step, the quality of a local structure different from a random folding in a genomic sequence is evaluated by two z-scores, significance score (SIGSCR) and stability score (STBSCR) of the local segment. The two scores are computed by sliding a fixed window stepped a base along the sequence from the start to end position. Next, based on the non-central Student's t distribution theory we derive a linearly transformed non-central Student's t distribution (LTNSTD) to describe the distribution of SIGSCR and STBSCR computed in the sequence. In the third step, we extract these significant UFRs from the sequence whose SIGSCR and/or STBSCR are greater or less than a given threshold calculated from the derived LTNSTD. Our data mining approach is successfully applied to the complete genome of Mycoplasma genitalium (M. gen) and discovers these statistical extremes in the genome. By comparisons with the two scores computed from randomly shuffled sequences of the entire M. gen genome, our results demonstrate that the UFRs in the M. gen sequence are not selected by chance. These UFRs may imply an important structure role involved in their sequence information.

مقدمه انگلیسی

Complete genomic sequence data are being accumulated at an unprecedented pace. A wide variety of computational methods for analyzing genomic sequences have been developed [1] and [2]. Most of the problems in these methods are essentially statistical. Computational analyses of the distinct sequence pattern can help to understand the structure and function of genomic sequences. The discovery of biological knowledge from sequence data consisting of bases A, C, G, T/U in biological databases, such as Genbank, is especially important in a post-genomics age.
RNA is a single-stranded conformationally polymorphic macromolecule with its nucleotide sequence identical to that of one of the DNA strands except for a base replacement of T to U. The RNA sequence often folds back on itself between complementary segments to form various local structures guided by Watson–Crick rules. In addition to the Watson–Crick A–U and G–C base pairs, wobble G–U base pairs also contribute to the thermodynamic stability of an RNA structure. It has been demonstrated that some structures folded by local RNA segments are functional elements of the control for gene regulations in different levels [3] and [4]. These functional elements are often closely associated with unusual folding regions (UFRs) where the folding free energy of the UFR is significantly lower than that expected by chance [5], [6], [7], [8], [9], [10], [11] and [12]. The development of an efficient data mining approach to extract these potentially functional structured elements in the sequence database is highly desirable.
Knowledge discovery of functional structured elements in a genomic sequence is an important step to reach our goal from genome data to biological knowledge. The thermodynamic stability of an RNA/DNA fragment in the genome is often measured by the free energy of the formation of the folded RNA/DNA segment. Based on accumulated data [3], [4] and [13], UFRs in an RNA sequence are assessed by the two z-scores, significant score (SIGSCR) and stability score (STBSCR) [13] and [14]. SIGSCR signifies the difference of thermodynamic stability between a local, natural RNA fragment and the average of its randomly shuffled sequences. Similarly, STBSCR indicates the difference of the stability between a specific fragment at a given place and the average from all other fragments of the same size in the sequence. As an example of our data mining, we analyze the complete genome sequence of Mycoplasma genitalium (M. gen).
Our data mining approach consists of three steps. In the first stage, we compute SIGSCR and STBSCR by sliding a fixed window with a step of one base along the sequence from the start to end position. Our statistical analysis shows that the distributions of the two z-scores in the sequence do not follow a simple normal distribution. In order to obtain useful information from an extraordinarily large number of sample observations in the analysis, we have to derive a reliable statistical model to describe the distributions of the two z-scores. In the second step we develop a linearly transformed non-central Student's t statistical model to delineate the distributions of SIGSCR and STBSCR in the entire genomic sequence by means of a non-central Student's t distribution theory [15]. Statistical tests show that the linearly transformed non-central Student's t distribution (LTNSTD) is a good statistical model to describe the distributions of the two scores computed in the genome. In the last step, the significant UFRs that are either much more stable or unstable than expected by chance are discovered based on the derived, well-fitted LTNSTD.
As a comparison, we also compute the distributions of SIGSCR and STBSCR in the randomly shuffled sequence of the complete M. gen genome. Our results further demonstrate that the statistical extremes of UFRs are not selected by chance in M. gen. The UFRs in the genome may imply the biological functions of the primary sequence data and provide useful information in further searching for functional structured elements involved in the control of regulatory genes [5], [6], [7], [8], [9], [10], [11] and [12].

نتیجه گیری انگلیسی

In this study, we present a data mining approach to discover UFRs in the M. gen genome sequence. At the first stage of the approach, we calculate two z-scores of SIGSCR and STBSCR in the sequence. Next, we derive a LTNSTD statistical model to describe the distributions of the two scores in the M. gen sequence. Finally, we discover the UFRs in M. gen based on the derived LTNSTDs, whose SIGSCR and STBSCR values are significantly deviated from their sample means. The approach is generally applicable for other genomes. For instance, we also computed the two scores in other microbial genomes, such as Helicobacter pylori strains 26695 and J99, and Mycoplasma pneumonia. The distributions of the two scores in these sequences are well represented by a LTNSTD. Statistical extremes of UFRs can be confidently assessed based on the derived, theoretical LTNSTD. The precise locations for these UFRs can be further inferred by an extended search (SEGFOLD) in which the window size is systematically changed in the corresponding extended regions. These detected UFRs in M. gen and others can be suggested as candidate sites for further experimental study in searching gene regulatory elements and potential target sequences of long-chain antisense RNAs. Our data mining approach in the genomic sequence is particularly useful for antisense RNA therapeutics and the targeting of RNA-binding drugs against pathogenic bacteria.