Bottom Line:
We also show that the CG clusters co-localize in the human genome with hypomethylated loci and annotated transcription start sites to a greater extent than annotations produced by prior CpG island definitions.Moreover, this new approach allows CG clusters to be identified in a species-specific manner, revealing a degree of orthologous conservation that is not revealed by current base compositional approaches.Finally, our approach is able to identify methylating genomes (such as Takifugu rubripes) that lack CG clustering entirely, in which it is inappropriate to annotate CpG islands or CG clusters.

ABSTRACTCytosines at cytosine-guanine (CG) dinucleotides are the near-exclusive target of DNA methyltransferases in mammalian genomes. Spontaneous deamination of methylcytosine to thymine makes methylated cytosines unusually susceptible to mutation and consequent depletion. The loci where CG dinucleotides remain relatively enriched, presumably due to their unmethylated status during the germ cell cycle, have been referred to as CpG islands. Currently, CpG islands are solely defined by base compositional criteria, allowing annotation of any sequenced genome. Using a novel bioinformatic approach, we show that CG clusters can be identified as an inherent property of genomic sequence without imposing a base compositional a priori assumption. We also show that the CG clusters co-localize in the human genome with hypomethylated loci and annotated transcription start sites to a greater extent than annotations produced by prior CpG island definitions. Moreover, this new approach allows CG clusters to be identified in a species-specific manner, revealing a degree of orthologous conservation that is not revealed by current base compositional approaches. Finally, our approach is able to identify methylating genomes (such as Takifugu rubripes) that lack CG clustering entirely, in which it is inappropriate to annotate CpG islands or CG clusters.

Figure 3: The base compositional characteristics of CG clusters (black) are shown in terms of observed to expected CG dinucleotide densities (O/E CG ratio) on the x-axis with (G+C) content on the y-axis. The dashed lines illustrate the relatively non-stringent thresholds of the original CpG island definition (3). Any points to the left of the vertical threshold or below the horizontal threshold show how many CG-dense loci would fail to be identified using base compositional criteria alone. The arrowhead illustrates extremely (A+T)-rich, CG-dense alpha satellite DNA sequences.

Mentions:
We were able to optimize the criteria when we recognized that at any individual CG-dense locus, a given number of CGs generates multiple overlapping fragments. More CG-dense clusters require a greater number of fragments to span all of the CGs they contain. Accordingly, the more overlapping fragments that represent a given locus, the more likely it is to be significantly CG-dense. For each number of CGs, we calculated the number of overlapping fragments per cluster. We obtained a representation of information content for each CG number by summing this total across all loci in the genome and dividing by maximum fragment length. We then determined the optimal number of CGs per fragment using the maximum value obtained (Figure 2c). For the human genome, this optimum corresponds to 27 or more CG dinucleotides in a sequence of no more than 531 bp in length. This new means of identifying CG clusters is neither constrained by (G+C) content nor by the associated observed/expected CG dinucleotide ratio. In Figure 3, we show that the thresholds imposed by even the least stringent original base compositional criteria (2) cause many CG-dense loci in the genome to be missed. However, even though we are annotating the entire sequenced genome, including repetitive DNA, we identify only a small fraction of the ∼350 000 CpG islands predicted by these old criteria (2) (Table 1).Figure 3.

Figure 3: The base compositional characteristics of CG clusters (black) are shown in terms of observed to expected CG dinucleotide densities (O/E CG ratio) on the x-axis with (G+C) content on the y-axis. The dashed lines illustrate the relatively non-stringent thresholds of the original CpG island definition (3). Any points to the left of the vertical threshold or below the horizontal threshold show how many CG-dense loci would fail to be identified using base compositional criteria alone. The arrowhead illustrates extremely (A+T)-rich, CG-dense alpha satellite DNA sequences.

Mentions:
We were able to optimize the criteria when we recognized that at any individual CG-dense locus, a given number of CGs generates multiple overlapping fragments. More CG-dense clusters require a greater number of fragments to span all of the CGs they contain. Accordingly, the more overlapping fragments that represent a given locus, the more likely it is to be significantly CG-dense. For each number of CGs, we calculated the number of overlapping fragments per cluster. We obtained a representation of information content for each CG number by summing this total across all loci in the genome and dividing by maximum fragment length. We then determined the optimal number of CGs per fragment using the maximum value obtained (Figure 2c). For the human genome, this optimum corresponds to 27 or more CG dinucleotides in a sequence of no more than 531 bp in length. This new means of identifying CG clusters is neither constrained by (G+C) content nor by the associated observed/expected CG dinucleotide ratio. In Figure 3, we show that the thresholds imposed by even the least stringent original base compositional criteria (2) cause many CG-dense loci in the genome to be missed. However, even though we are annotating the entire sequenced genome, including repetitive DNA, we identify only a small fraction of the ∼350 000 CpG islands predicted by these old criteria (2) (Table 1).Figure 3.

Bottom Line:
We also show that the CG clusters co-localize in the human genome with hypomethylated loci and annotated transcription start sites to a greater extent than annotations produced by prior CpG island definitions.Moreover, this new approach allows CG clusters to be identified in a species-specific manner, revealing a degree of orthologous conservation that is not revealed by current base compositional approaches.Finally, our approach is able to identify methylating genomes (such as Takifugu rubripes) that lack CG clustering entirely, in which it is inappropriate to annotate CpG islands or CG clusters.

ABSTRACTCytosines at cytosine-guanine (CG) dinucleotides are the near-exclusive target of DNA methyltransferases in mammalian genomes. Spontaneous deamination of methylcytosine to thymine makes methylated cytosines unusually susceptible to mutation and consequent depletion. The loci where CG dinucleotides remain relatively enriched, presumably due to their unmethylated status during the germ cell cycle, have been referred to as CpG islands. Currently, CpG islands are solely defined by base compositional criteria, allowing annotation of any sequenced genome. Using a novel bioinformatic approach, we show that CG clusters can be identified as an inherent property of genomic sequence without imposing a base compositional a priori assumption. We also show that the CG clusters co-localize in the human genome with hypomethylated loci and annotated transcription start sites to a greater extent than annotations produced by prior CpG island definitions. Moreover, this new approach allows CG clusters to be identified in a species-specific manner, revealing a degree of orthologous conservation that is not revealed by current base compositional approaches. Finally, our approach is able to identify methylating genomes (such as Takifugu rubripes) that lack CG clustering entirely, in which it is inappropriate to annotate CpG islands or CG clusters.