Searching for statistically significant regulatory modules

Timothy L. Bailey and William Stafford Noble

Bioinformatics (Proceedings of the European Conference on
Computational Biology). 19(Suppl. 2):ii16-ii25, 2003.

Abstract

The regulatory machinery controlling gene expression is complex,
frequently requiring multiple, simultaneous DNA-protein interactions.
The rate at which a gene is transcribed may depend upon the presence
or absence of a collection of transcription factors bound to the DNA
near the gene. Locating transcription factor binding sites in genomic
DNA is difficult because the individual sites are small and tend to
occur frequently by chance. True binding sites may be identified by
their tendency to occur in clusters, sometimes known as regulatory
modules.
We describe an algorithm for detecting occurrences of regulatory
modules in genomic DNA. The algorithm, called MCAST, takes as input a
DNA database and a collection of binding site motifs that are known to
operate in concert. MCAST uses a motif-based hidden Markov model with
several novel features. The model incorporates motif-specific
p-values, thereby allowing scores from motifs of different widths and
specificities to be compared directly. The p-value scoring also
allows MCAST to only accept motif occurrences with significance below
a user-specified threshold, while still assigning better scores to
motif occurrences with lower p-values. MCAST can search long DNA
sequences, modeling length distributions between motifs within a
regulatory module, but ignoring length distributions between modules.
The algorithm produces a list of predicted regulatory modules, ranked
by E-value. We validate the algorithm using simulated data as well as
real data sets from fruitfly and human.