Evaluation of DNA structure complexity: The return interval approach

The primary structure of DNA is known to exhibit multifractal scaling properties, especially within the non-coding regions [1]. Recently it has been show that in multifractal data characterized by pronounced nonlinear long-range dependencies the distribution density of intervals between characteristics events is a power law, in contrast to the well-known exponential distribution for independent data [2]. This approach has been very recently applied to protein data analysis where intervals between single amino acids in the protein primary structure has been considered [3].
Here we employ a similar approach to analyze the primary structure of DNA of different organisms in order to reveal how the evolutional effects are reflected in the return interval statistics of nucleotide sequences. We analyze the intervals between consecutive occurrences of each nucleotide, i.e. A-A, C-C, G-G and T-T intervals. The data contains a set of complete genome sequences obtained from the NCBI database classified as follows. All analyzed organisms belong to the four kingdoms – Bacteria, Fungi, Plants and Animals. Mycoplasms, pathogenic and pfytopathogenic bacteria, have a minimal genome of 0.5-2.0 Mbp and carry minimal amount of non-coding DNA. For these data, as well as for the eubacteria as B.subtilis, E.coli, H.pilory, Lactobacillus sp etc. we obtained that the distribution of return intervals is only slightly broader than a simple exponential. In contrast, we found that in a DNA from eukaryotic organisms the distributions are significantly broader and can be to a certain extent approximated by a power law. This effect is in a substantial agreement with the results of [1] since it is well known that most of the eukaryotic DNA is non-coding. On the other hand, in a range of animals from simplest Porifera to Homo sapiens we have observed only slight discrepancies, despite of their chromosome length differences.
The work was supported by Grant of President of Russian Federation (МК-556.2011.8).
1. Peng C.K. et al. Long-range correlations in nucleotide sequences. Nature 356 (1992) 168; Arneodo A. et al. Wavelet based fractal analysis of DNA sequences. In: The Science of Disasters, ed. by A. Bunde et al. Berlin: Springer, 2002. p. 46-59.
2. Bogachev M.I., Eichner J.F., Bunde A. Effect of Nonlinear Correlations on the Statistics of Return Intervals in Multifractal Data Sets. Phys. Rev. Lett. 99 (2007) 240601.
3. Kayumov A.R., Bogachev M.I., Mikhailova E.O. Quantification of long-range memory effects in proteins by return intervals statistics. BGRS-2010: Proc. 7th intnl. conf. on Bioinformatics of Genome Regulation and Structure, Novosibirsk, Russia, June 20-27, 2010. P. 65.