Abstract

CpG dinucleotide clusters also referred to as CpG islands (CGIs) are usually located
in the promoter regions of genes in a deoxyribonucleic acid (DNA) sequence. CGIs play
a crucial role in gene expression and cell differentiation, as such, they are normally
used as gene markers. The earlier CGI identification methods used the rich CpG dinucleotide
content in CGIs, as a characteristic measure to identify the locations of CGIs. The
fact, that the probability of nucleotide G following nucleotide C in a CGI is greater
as compared to a non-CGI, is employed by some of the recent methods. These methods
use the difference in transition probabilities between subsequent nucleotides to distinguish
between a CGI from a non-CGI. These transition probabilities vary with the data being
analyzed and several of them have been reported in the literature sometimes leading
to contradictory results. In this article, we propose a new and efficient scheme for
identification of CGIs using statistically optimal null filters. We formulate a new
CGI identification characteristic to reliably and efficiently identify CGIs in a given
DNA sequence which is devoid of any ambiguities. Our proposed scheme combines maximum
signal-to-noise ratio and least squares optimization criteria to estimate the CGI
identification characteristic in the DNA sequence. The proposed scheme is tested on
a number of DNA sequences taken from human chromosomes 21 and 22, and proved to be
highly reliable as well as efficient in identifying the CGIs.