Similar presentations

1
svinga@itqb.unl.pt Rényi entropic profiles of DNA sequences and statistical significance of motifs Acknowledgments S.Vinga and J.S.Almeida thankfully acknowledge the financial support by grants SFRH/BPD/24254/2005 and POCTI/BIO/48333/2002 from Fundação para a Ciência e a Tecnologia (FCT) of the Portuguese Ministério da Ciência, Tecnologia e Ensino Superior. References [1] Vinga, S. and Almeida, J. S. (2004) Rényi continuous entropy of DNA sequences J Theor Biol, 231(3):377-388. Susana Vinga (a,b), Jonas S Almeida (a,c) In a recent report [1] the authors presented a new measure of Rényi continuous entropy for DNA sequences, which allows the estimation of their randomness level. The definition therein explored was based on the Rényi entropy of the probability density estimation (pdf) using the Parzens window method and applied to Chaos Game Representation/Universal Sequence Maps (CGR/USM). This work extends those concepts of continuous entropy by defining DNA sequence entropic profiles using the pdf estimations obtained. These profiles are applied to the study of a sequence dataset constituted by artificial and real DNA and a new fractal-kernel function, more adjusted to the estimation, is explored, instead of the Gaussians functions previously used. This work shows that the entropic profiles are directly related to the statistical significance of motifs, allowing the study of under and over- representation of sub-strings. Furthermore, by spanning the parameters of the fractal-kernel function, it is possible to extract important information about the scale of each DNA region, which can have future applications in the recognition of biologically significant segments of the genome. Keywords: Rényi entropy, DNA, Information Theory, kernel functions, CGR/USM. http://bioinformatics.musc.edu/renyi 1. CGR/USM representation of DNA Chaos Game Representation/Universal Sequence Map (CGR/USM) Maps discrete sequences onto continuous maps. The CGR/USM mapping of a N-length DNA sequence is: A TC G 2D-CGR/USM representation of DNA Each point x i corresponds to one symbol in its context Each iteration goes half the distance towards the corner representing the next symbol Suffix property – strings ending in a specific suffix are in the sub-square labeled with that suffix Definition of DNA entropy based on CGR/USM and Parzens Method with parameter - variance of Gaussian function used. where All pairwise squared Euclidean distances between CGR/USM coordinates x i Simplification! -ATC- Motif detected Simplification: Integral Sum Convolution of two Gaussians is Gaussian CGR/USM estimation 2. Rényi continuous entropy of DNA sequences DNA testset Rényi entropic profiles Rényi entropic profiles provide local information about motifs and their statistical significance Continuous quadratic entropy H 2 is a good measure of DNA sequence randomness a) a) Biomathematics Group ITQB/UNL Instituto de Tecnologia Química e Biológica, Universidade Nova de Lisboa - Oeiras, Portugal b) b) INESC-ID Instituto de Engenharia de Sistemas e Computadores: Investigação Desenvolvimento - Lisboa, Portugal c) c) Dept. Biostatistics, Bioinformatics and Epidemiology - Medical Univ. South Carolina - Charleston SC 29425, USA Method provides new tools for the study of motifs and repeatability in biological sequences Explore theoretical properties of the entropic profiles Optimize algorithm to accommodate longer sequences Rényi continuous quadratic entropy for the sequence DNA dataset Representation of entropies for the dataset described in the Table above as a function of the logarithm of the Gaussian kernel variance used in the Parzens Method. The lower the value of entropy H 2, the less random or more structured the sequence is. The graph has theoretically demonstrated asymptotes for given by line and for, line -ATC- Motif detected 1. Abstract 2. Methods and Algorithms 4. Conclusions and Future work 3. Results ATC Gaussian kernel 0 1x Fractal kernel vs. Example