Abstract

The broad class of tasks in genetics and epigenetics can be reduced to the study of various features that are distributed over the genome (genome tracks). The rapid and efficient processing of the huge amount of data stored in the genome-scale databases cannot be achieved without the software packages based on the analytical criteria. However, strong inhomogeneity of genome tracks hampers the development of relevant statistics. We developed the criteria for the assessment of genome track inhomogeneity and correlations between two genome tracks. We also developed a software package, Genome Track Analyzer, based on this theory. The theory and software were tested on simulated data and were applied to the study of correlations between CpG islands and transcription start sites in the Homo sapiens genome, between profiles of protein-binding sites in chromosomes of Drosophila melanogaster, and between DNA double-strand breaks and histone marks in the H. sapiens genome. Significant correlations between transcription start sites on the forward and the reverse strands were observed in genomes of D. melanogaster, Caenorhabditis elegans, Mus musculus, H. sapiens, and Danio rerio. The observed correlations may be related to the regulation of gene expression in eukaryotes. Genome Track Analyzer is freely available at http://ancorr.eimb.ru/.

(a) The distributions of the nearest neighbouring transcription start sites (NN TSS) on the forward and reverse strands across particular chromosomes of the Homo sapiens genome. The cytobands across corresponding chromosomes and relevant length scales are shown below the TSS. The length scale is in megabases. The blue vertical lines correspond to the pairs of NN TSS. The 15 closest pairs on each chromosome are marked by the red lines, and the names of the corresponding NN TSS are indicated. Names shown above the red lines correspond to the TSS on the forward strand, whereas names shown below the red line correspond to the TSS on the reverse strand (names are given according to EPD notation). (b) Particular examples of NN TSS pairs in the H. sapiens genome. The transcriptions factors (TF) participating in the regulation of expression of a particular gene are listed after the name of the gene. The TF that match genes on both strands are marked in red. The data on binding sites for TF associated with genes were taken from http://www.genecards.org.

(a) The binding profiles for proteins E(Z), Pc-S2, and Psc, and for H3me3K27 histone marks over chromosome 3R of Drosophila melanogaster. For the study of correlations, these profiles were preliminary filtered by the cut-off threshold mean + 2 SD and clustered with distance of 50 nt [Preprocessing of input genetic data and Equation (12)]. The input data after preprocessing are shown below initial profiles. (b) z-ratios [Equation (23)] characterizing pairwise positional correlations between profiles for proteins E(Z), Pc-S2, and Psc, and for the H3me3K27 mark in the different chromosomes of D. melanogaster. The input data were preprocessed as described above. The numbers below the chromosome nomenclature correspond to that of the nearest neighbours. The horizontal broken lines for z-ratios correspond to 5% (|z| = 1.96) and 1% (|z| = 2.58) significance thresholds for random correlations. (c) Ratios characterizing positional correlations between profiles for proteins E(Z), Pc-S2, and Psc, and for H3me3K27 histone marks in the chromosome 2R of D. melanogaster at the different clustering lengths. The profiles were preliminary filtered by the cut-off threshold mean + 2 SD. The positive values of zcorr reflect a trend towards shorter distances between profiles relative to the reference model (or correlations), whereas the negative values of zcorr reflect a trend towards longer distances between profiles (or anticorrelations).

(a) The distributions of DNA double-strand breaks (DSBs) and H3K4me3 histone marks over human chromosome 7. The distributions of DSBs and histone marks were coarse-grained over bins of 100 kb, i.e. the heights in these distributions correspond to the number of points in the bins of 100 kb. Both sets were preprocessed as described in the main text. The distribution of cytobands across chromosome 7 is shown above the length scale. (b) z-ratios [Equation (23)] characterizing pairwise positional correlations between distributions of DSBs and H3K4me3 in the human chromosomes. The correlations for the Y-chromosome are not shown due to poor statistics. The numbers below the chromosome nomenclature correspond to that of the nearest neighbours. The horizontal broken lines for z-ratios correspond to 5% (z = 1.96) and 1% (z = 2.58) significance thresholds for random correlations.