BACKGROUND:Regulatory modules are segments of the DNA that control particular aspects of gene expression. Their identification is therefore of great importance to the field of molecular genetics. Each module is composed of a distinct set of binding sites for specific transcription factors. Since experimental identification of regulatory modules is an arduous process, accurate computational techniques that supplement this process can be very beneficial. Functional modules are under selective pressure to be evolutionarily conserved. Most current approaches therefore attempt to detect conserved regulatory modules through similarity comparisons at the DNA sequence level. However, some regulatory modules, despite the conservation of their responsible binding sites, are embedded in sequences that have little overall similarity.RESULTS:In this study, we present a novel approach that detects conserved regulatory modules via comparisons at the binding site level. The technique compares the binding site profiles of orthologs and identifies those segments that have similar (not necessarily identical) profiles. The similarity measure is based on the inner product of transformed profiles, which takes into consideration the p values of binding sites as well as the potential shift of binding site positions. We tested this approach on simulated sequence pairs as well as real world examples. In both cases our technique was able to identify regulatory modules which could not to be identified using sequence-similarity based approaches such as rVista 2.0 and Blast.CONCLUSION:The results of our experiments demonstrate that, for sequences with little overall similarity at the DNA sequence level, it is still possible to identify conserved regulatory modules based solely on binding site profiles.

This article is available from: http://www.biomedcentral.com/1471-2105/7/287
2006 Meng et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract
Background: Regulatory modules are segments of the DNA that control particular aspects of
gene expression. Their identification is therefore of great importance to the field of molecular
genetics. Each module is composed of a distinct set of binding sites for specific transcription factors.
Since experimental identification of regulatory modules is an arduous process, accurate
computational techniques that supplement this process can be very beneficial. Functional modules
are under selective pressure to be evolutionarily conserved. Most current approaches therefore
attempt to detect conserved regulatory modules through similarity comparisons at the DNA
sequence level. However, some regulatory modules, despite the conservation of their responsible
binding sites, are embedded in sequences that have little overall similarity.
Results: In this study, we present a novel approach that detects conserved regulatory modules via
comparisons at the binding site level. The technique compares the binding site profiles of orthologs
and identifies those segments that have similar (not necessarily identical) profiles. The similarity
measure is based on the inner product of transformed profiles, which takes into consideration the
p values of binding sites as well as the potential shift of binding site positions. We tested this
approach on simulated sequence pairs as well as real world examples. In both cases our technique
was able to identify regulatory modules which could not to be identified using sequence-similarity
based approaches such as rVista 2.0 and Blast.
Conclusion: The results of our experiments demonstrate that, for sequences with little overall
similarity at the DNA sequence level, it is still possible to identify conserved regulatory modules
based solely on binding site profiles.

Background
Transcription is the fundamental biological process in
which a selected region of DNA is transcribed into RNA by
a molecular machinery, a core component of which is
RNA polymerase. For most protein-coding genes, tran-
scription is the intermediate step via which the informa-

tion coded in their DNA is "expressed" into functioning
proteins. For others, such as RNA genes, the product of the
transcription itself may have biological function. Even
though each cell has the complete set of genes in its chro-
mosomal DNA, only a portion of the genes are transcribed
(expressed) in any particular cell depending on tissue/cell

Page 1 of 13
(page number not for citation purposes)

central

^^3

r)

http://www. biomedcentral.com/1471-2105/7/287

type, developmental stage, etc [ 1]. The transcriptome, that
is all of the genes that are selectively transcribed in a cell,
determines the function and morphology of the cell. In
addition, the level (i.e., rate) of transcription is often reg-
ulated in response to intra-cellular and extra-cellular envi-
ronmental factors to achieve cellular homeostasis.
Normal transcriptional regulation, i.e., the right genes
being transcribed at the right times, in the right cell, and
at the appropriate rates, is central to almost all physiolog-
ical processes. Abnormal regulation of transcription often
results in disruption of development and/or pathological
states. For example, ectopic (i.e., abnormally high) expres-
sion of oncogenes leads to hyperplasia and cancer.

A basic element of transcriptional regulation is the inter-
action of transcription factors (trans factors) and their cor-
responding transcription factor binding sites (TFBSs, also
referred to as cis factors) on the DNA. Transcriptional reg-
ulation of a gene (e.g. restricted transcription in a particu-
lar cell type, or elevated transcription, in response to UV
light) is often mediated through the functional/physical
interactions among multiple transcription factors, each
recruited to the proximity of the DNA in part by their
selective affinity to their corresponding binding sites. For
example, the even-skipped(eve) gene is transiently
expressed in 7 alternative stripes on the longitudinal axis
in the developing Drosophila melanogaster embryo at the
blastoderm stage. Each of the seven stripes is regulated by
a distinct set of transcription factors binding to their cor-
responding binding sites located in a DNA segment flank-
ing the even-skipped gene. The most well investigated of
these segments is the stripe 2 regulatory region, which has
identified binding sites for 5 different transcription factors
in a 700 bp (base pair)-l kb (kilo base pair) DNA region
in front of the transcription initiation site of the eve gene.
Evolutionary comparison of this transcription regulatory
module in different Drosophila species has revealed that
most of the binding sites are highly conserved and func-
tional, even though the underlying DNA sequence has
undergone considerable change [2].

A useful analogy to understanding the composition of
DNA regulatory modules is to consider DNA as a
sequence of "Letters" and individual binding sites as
"Words". Then, a functional module of closely associated
binding sites can be conceived of as the "Sentence" com-
mand embedded in the DNA sequence that guides tran-
scription. The problems associated with identifying the
"Sentence" commands in a DNA sequence are two fold.
First, the binding sites are degenerate in nature, that is, the
same "Word" may have different letters in certain posi-
tions. Second, the "Words" are themselves interspersed by
varying lengths of meaningless "Letters".

One approach to identifying DNA regulatory modules is
through cross-genome comparison. The assumption
underlying this approach is that DNA sequences encoding
functional TFBSs are under selective pressure to be con-
served during evolution, whereas non-functional DNA
mutate/change more rapidly. Thus, if DNA sequences
flanking orthologs in two related species were to be com-
pared for sequence-level similarity, DNA regulatory mod-
ules would appear as conserved "islands" in a sea of
otherwise not-conserved DNA sequences. Approaches in
this category include rVista2.0, ConSite, PhyME, TOU-
CAN, CREME, TraFAC, etc [3-10]. For instances, based on
the sequence level conservation between human and
mouse, Cora et al. predicted functional TFBSs that are sta-
tistically over-represented and share the same specific
Gene Ontology (GO) terms [9]. This kind of cross-
genome comparison approaches has successfully led to
the discovery of regulatory modules that were subse-
quently verified by functional characterization [ 11].

The disadvantage of the sequence-based approach is that
it is dependent on the overall conservation of the DNA
region harbouring the regulatory module. As we men-
tioned earlier, TFBS sequences are degenerate in nature
and are interspersed by non-functional sequences which
are not under conservation pressure. Depending on the
ratio of functional versus non-functional base-pairs in the
DNA region, it is entirely possible that the overall
sequence level conservation of the region be indistin-
guishable from the background level, while the actual
TFBSs making up the functional module still be con-
served. In other words, it is possible that despite the con-
servation of the "Sentence" command at the binding site
level, the overall conservation of the DNA backbone at the
sequence level be minimal or non-detectable. This situa-
tion is aggravated if the participating binding sites are
highly degenerate (i.e., tolerate many variations at the
sequence level) and the spaces between the binding sites
are long. In fact, it has been observed by researchers in
many instances that while the regulatory region has no
detectable overall similarity, the transcription regulatory
function is preserved [12]. Sequence-based approaches, or
approaches requiring filtration of sequences based on
DNA level similarity, are not helpful for identifying the
responsible TFBSs in this scenario.

Since the conservation pressure is at the binding site level,
i.e., the sequence must be able to maintain binding affin-
ity to the transcription factorss, it makes biological sense
to perform comparisons at the binding site level rather
than at the sequence level. This, however, is currently hin-
dered by two factors. The first limitation is the effective-
ness of identifying transcriptional binding sites in a given
DNA sequence. The set of TFBSs for a given TF can be
quantitatively represented using a frequency matrix that

Page 2 of 13
(page number not for citation purposes)

BMC Bioinformatics 2006, 7:287

http://www. biomedcentral.com/1471-2105/7/287

describes the binding specificity of the TF at each of its
positions. The quality of the matrix used to identify poten-
tial TFBSs is determined by the number and quality of
known binding site sequences used to construct the
matrix. As a result of the development of new technolo-
gies such as Chip [13] and ChIP-chip [14], itis anticipated
that binding site instances will be identified at a unprece-
dented rate which will undoubtedly greatly enhance both
the quality as well as the coverage of binding site matrices
in the near future [15].

The second limitation is that we currently do not under-
stand the grammar governing how binding sites (Words)
make up the regulatory modules (Sentences). Based on
our understanding of transcriptional regulation, such a
grammar should have at least three components: (1) the
composition of the binding sites, (2) the sequence of the
binding sites, and (3) the spaces between/among the
binding sites. Currently, the number of regulatory mod-
ules that have been thoroughly characterized is far fewer
than what is required to decode this grammar.

A major obstacle for biologists working on transcriptional
regulation is to locate and identify potential TFBSs
responsible for a particular regulatory module, especially
in sequences that do not have significant conserved
islands. In this paper, we describe a novel methodology
for binding site level identification of conserved regula-
tory modules in such sequences.

Results
Simulating sequence pairs harbouring a conserved module
of binding sites
Since the number of well-studied regulatory modules is
currently rather sparse, we chose to simulate sequence sets
(pairs) representing the domain of our interest, i.e., con-
served binding site patterns in a pair of sequences which
nonetheless have little or no similarity at the sequence
level. In many cases, experimental investigation in a
model organism has narrowed down the location of the
regulatory module(s) for a particular gene to a relatively
short region (e.g. within 1 kb), whereas for the ortholog
in a less-studied organism (reference organism), informa-
tion about the localization of the module is absent (except
that it is generally in the proximity of the gene). In view of
this, in the first (current) stage of the development of
BLISS, we considered the identification of a conserved
module present in both a short sequence of about 100-
500 bps, representing the model organism, and a longer
sequence (5-6 kb), representing the reference organism.
Although this simplification limits the applicability of the
current methodology, it does highlight the promise of our
approach.

For each sequence pair, the backbones for both the short
sequence and the long sequence were generated randomly
and thus had no sequence similarity. A hypothetical mod-
ule involving binding sites for 4-8 distinct transcription
factors was first introduced into the short sequence. The
binding site sequences were randomly selected from the
instances recorded in the TRANSFAC 9.1 database [15].
The rules governing the formation of the hypothetical
module were as follows:

2.) For each transcription factor, there may be more than
one binding site in the module.

3.) The distance between consecutive binding sites is "di",
where in 65% of the cases di lies within 5-20 bps, in 22%
of the cases di lies within 21-50 bps and in the remaining
13% of the cases di lies within 51-60 bps (Figure 1).

The range of values for di was based on background
knowledge as well as a statistical analysis of the distances
between pairs of TFBSs in TransCompel database by Qiu

Simulation of conserved regulatory
modules in pairs of DNA sequences

Hypothetical module

Seq. #1 A -
d r
Conserved module

Seq. #2 0 M A

A I Transcription Factor
binding sites

Figure I
Simulating conserved regulatory modules in a pair of
random sequences. Each pair has a short (100-500 bp)
and a long (5-6 kb) sequence. For both sequences the back-
bones were generated randomly. The hypothetical module
was formulated according to rules described in the text and
inserted into the short sequence. A "conserved" module with
binding sites corresponding to the same transcription factor,
but with different sequences was inserted into the long
sequence. The distances between consecutive binding sites
were also different between the hypothetical module and the
conserved module.

Page 3 of 13
(page number not for citation purposes)

BMC Bioinformatics 2006, 7:287

http://www. biomedcentral.com/1471-2105/7/287

et al. which revealed the above distribution of distances
between pairs of TFBSs [16].

The hypothetical module was first simulated according to
the above rules in the short sequence. Subsequently, a
"conserved" module was formulated and inserted into the
longer sequence at a random location. The rules govern-
ing the formation of the "conserved" module were as fol-
lows:

1.) It is comprised of TFBSs that correspond to the same
transcription factors as present in the hypothetical mod-
ule.

2.) The sequence for each TFBS is randomly chosen from
the recorded instances in TRANSFAC 9.1, with the caveat
that it cannot be the same instances) that was (were)
used to construct the hypothetical module in the short
sequence.

3.) The respective order of TFBSs is the same as in the
hypothetical module in the short sequence.

4.) The distance between consecutive binding sites in the
conserved module is dj; dj is a function of d, in that dj lies
in the range (di Ad, di + Ad) (Figure 1).

Ad is the "perturbation factor" the variation of distance
between corresponding binding sites in the hypothetical
module and the conserved module. In this study, we used
Ad = 4 (See Discussion). A total of 10,000 pairs of
sequences were generated according to above rules, and
were used to test and evaluate various algorithms.

Identifying a conserved module by comparison at the
binding site level
As stated above, the objective of our methodology is to
identify conserved regulatory modules within highly
divergent sequences. The sequence pairs in our simulated
data-set had little overall sequence similarity. Of the
10,000 pairs, 73.32% have no similarity detectable by
BLAST analysis (E = 0.01, Blast2seq). This indicated that
the conservation of binding sites in the hypothetical mod-
ule and the conserved module was not sufficient to allow
detection at the sequence level. Of those that did have a
significant match, the output alignments were shorter
than 30 bps, which was far shorter than the length of the
inserted module.

M_score
To identify the conserved module at the binding site level,
we first generated the potential TFBS profiles for each of
the simulated sequence pairs. A matrix scoring method
similar to that used in Match [17] was implemented (see
Methods), which allowed us to score each sequence

against the frequency TFBS matrices recorded in TRANS-
FAC 9.1 (M_score, Figure 2a). When a cut-off value of 0.75
(on M_score) was applied, on average there were about
3000 identified potential TFBSs in every thousand base
pairs of simulated sequences. This is similar to what we
observed when using genomic DNA sequences randomly
extracted from model organism databases (data not
shown).

To identify the hypothetical module embedded in the
sequences, we tried several different algorithms that com-
pared the binding site profiles of the short and the long
sequences. Of those tested, a scoring method (using inner-
products) based on a statistical evaluation of binding site
matches after a Gaussian smoothing of the binding site
profiles gave reliable and promising results.

P_score
The matrix score (M_score), by virtue of its definition (f. 1),
ranges from 0 to 1 for all TFBS matrices. Thus it does not
differentiate short and relatively simple matrices that
match DNA sequences with a high frequency from those
long and stringent matrices that match DNA sequences
only rarely. For example, the binding site for En
(I$ED_06) has 7 positions, and on average there are 320
matches (with M_score > 0.75) on any 10,000 bp random
sequence. In contrast, the binding site for Bel-1
(V$BEL1_B) has 13 positions, and the average number of
matches with M_score > 0.75 in a 10,000 bp random
sequence is 3. It is clear that a match involving the binding
site V$BEL1_B is far more significant than a match with
I$ED_06. To differentiate this, we introduced the p value
of the M_score, which was estimated by calculating the
fraction of randomly generated sequences that have scores
equal to or higher than that M_score. We then calculated
the P_score (see Methods) as the product of -log (p value
ofM_score > cutoff) and the M_score (Figure 2b).

G_score
To account for the variation in the distances between/
among binding sites, we performed a Gaussian smooth-
ing of the P_score (see Methods). Through empirical test-
ing (data not shown), we found that a variance of (2 = 9
gave the best performance. We denote the Gaussian
smoothed score profile as the G_score profile of the
sequence (Figure 2d).

BLISS score
G_score profiles were generated for both the short and the
long sequences. To identify a maximum match at the
binding site profile level, the short G_score profile was slid
along the long G_score profile. At each position, the match
between the short profile and its corresponding region of
equal length (length of the window) in the long profile

Page 4 of 13
(page number not for citation purposes)

BMC Bioinformatics 2006, 7:287

http://www. biomedcentral.com/1471-2105/7/287

(a) M-scores before cutting off

(b) M-scores after cutting off

TFBSs 0 0

S 200
100
Locations

TFBSs 0 0

(c) P-scores

10 -

(d) G-scores

2

11

200

TFBSs 0 0

Locations

TFBSs 0 0

-` 200
100
Locations

Figure 2
Gaussian smoothing of the TFBS profile, a.) Profile of matrix matching scores (M_score) for three TFBSs along a short
DNA sequence. b.) M_score profile after applying a cutoff of 0.75. c.) Profile of P_score after incorporating the p value of bind-
ing site matches. d.) Profile of the G_score after Gaussian smoothing. The colors represent three different TFBSs: Red, En;
Green, Croc; Blue, Lun-1.

was evaluated using an inner-product as the BLISS_score
(see Methods).

Note that since the "length of the window" appears in the
denominator, the BLISS_score is independent of the
length of the short profile (or the length of the window).
Figure 3a shows the distribution of BLISS_scores as the
shorter G_score profile was slid along the longer G_score
profile. The peak of the BLISS_score indicates the maxi-
mum match. In this case, the abrupt surge in the
BLISS_score is due to the match between the embedded
hypothetical module in the short sequence and the con-
served module in the long sequence. When this method-
ology was tested on all of the 10,000 simulated sequence
pairs, about 80% of the highest peaks for each pair con-

trained the correct match between the embedded hypo-
thetical module and the conserved module.

Distribution and statistical evaluation of the BLISS_score
To be able to evaluate the match at the binding site profile
level, we analyzed the distribution of BLISS_scores using
the simulated sequence pairs. For each pair of sequences,
BLISS_scores were calculated at each position as the short
profile slid along the longer profile. The peak matches
(corresponding to the peaks in the score profile) between
each pair of sequences were evaluated to see whether it
aligned the embedded modules. If the match did align the
modules, it was designated a "true" match. All other
BLISS_scores were considered as background.

Figure 3
Identifying a conserved module at the binding site
level. a.) The G_score profile is slid along the long sequence
to calculate the BLISS_score. The peak in the BLISS_Score pro-
file represents the maximum match at the binding site level,
and in this case, contains the match of the conserved hypo-
thetical regulatory module simulated in the sequences. b.)
and c.) show the distribution of BLISS_score in true matches
vs. background with a cutoff value of 0.75 and 0.80, respec-
tively.

Figure 3b shows the distribution of the background and
the "true" match BLISS_scores for the 10,000 simulated
pairs of sequences. This distribution varies slightly
depending upon the cutoff threshold set for M_score
(Figure 3b&c). This is not surprising, since a lower cutoff

threshold will lead to more identified potential binding
sites and thus a slightly higher background score.

The distributions allow us to evaluate any particular
BLISS_score. It is informative in helping set a threshold for
reporting significant matches at the binding site level.
Given a BLISS_score x, the distributions allow us to decide
whether that BLISS_score corresponds to a true alignment
of modules or whether it corresponds to the module
aligning with a random DNA segment. Let C1 denote the
event where the modules embedded in the short and the
long sequences are aligned, and C2 denote the event
where either module is aligned with a random DNA seg-
ment. Based on Bayes formula, the posterior probabilities
can be calculated as follows:

Where p(C1 Ix) is the conditional probability of C1 given
a BLISS_score x and p(C2 x) is the conditional probability
of C2 given a BLISS_score x; p(Cl) is the prior probability
of C1 and p(C2) is the prior probability of C2; p(x\C1) is
the conditional probability of observing x given Cl and
p(x| C2) is the conditional probability of observing x given
C2. p(x|C1) and p(x C2) can be read off directly from the
distributions generated.

It is difficult to find the means to calculate the prior prob-
abilities p(Cl) and p(C2). In this study, we assumed p(C1)
= p(C2), although we suspect that p(Cl) might be smaller
than p(C2). This assumption allowed us to calculate the
posterior probabilities to evaluate a BLISS_score x. In prac-
tice, we decided that x was a significant matching score if
p(C2|x) was less than a threshold of, e.g. 0.01 or 0.001.

Identifying a conserved regulatory module in distantly
related species
To test the efficacy of the BLISS methodology in real
sequences, we undertook the task of identifying the Even-
skipped (eve) stripe 2 enhancer (S2E) in distantly related
Drosophila species. Even-skipped, an important develop-
ment regulatory gene in Drosophila melanogaster (D. mela),
is specifically expressed in seven transverse stripes in the
embryo during the blastoderm stage. The stripe 2
enhancer is the best studied and includes TFBSs for five
TFs, bicoid (Bcd), hunchback (Hb), giant (gt), Kruppel
(Kr), and sloppy-paired (slp) [18-20]. Unfortunately,
TRANSFAC 9.1 has matrices for only three of the five TFs,
i.e., Hb, Kr, and Bcd. Our search was therefore limited in
the sense that some of the participating TFBSs could not
be predicted and used for the match comparison.

Page 6 of 13
(page number not for citation purposes)

BMC Bioinformatics 2006, 7:287

M

http://www. biomedcentral.com/1471-2105/7/287

Previous experimental investigations have shown that S2E
is evolutionarily conserved among D. yakuba (D.yaku), D.
erecta, and D. pseudoobscura (D.pseu) [2,12,21]. All of these
species are in the same subgenus (Sophophora) as D.mela,
with D.pseu having the most distant relationship with
D.mela (separated at about 40 million years ago). BLISS
did identify the eve S2E modules among these four spe-
cies. In particular, a significant peak was reported by BLISS
when we searched the S2E module extracted from D. pseu
against the 14 kb D. mela genomic sequence flanking the
eve coding region.

In contrast, no detailed information has been published
about potential S2E in more distantly related species, such
as D. mojavensis (D.moja) or D. virilis (D.viri), both from a
separate subgenus (Drosophila) separated from D.mela at
about 60 million years ago [22]. To identify S2E in these
two distantly related species, we extracted the 14 kb
genomic sequence flanking the eve coding region from
D. moja and D.viri genomic sequences. Blast analysis using
the D.mela or D.pseu sequence harbouring the eve S2E
module did not identify any significant match longer than
41 bp (using bl2seq with default gap penalty values).
Using the BLISS methodology however, a significant peak
in the BLISS_score was observed (Figure 4a, p(C2|x) <
0.05). Verification of this match indicated that it con-
tained the TFBSs composing the S2E module. Similar
results were obtained when corresponding sequence pairs
involving 1.) D.mela and D.moja, and 2.) D.mela and
D.viri, were analyzed. In contrast, no significant match
was identified in the reverse-complemented sequence
(Figure 4b) or in other 14 kb sequences unrelated to eve,
indicating the specificity of the search.

A detailed inspection of the make up of the S2E modules
in distantly related species showed that S2E can be viewed
as a complex module made of element modules. To make
an analogy, S2E is a complex sentence that has several
"clauses" (Figure 5). The evolution of the whole module
indicates that the distances between some TFBSs have
changed dramatically. However, the distances among the
TFBSs within corresponding "clauses" have remained rel-
atively stable. For example, in Clause 1 the distances
among participating TFBSs have remained constant over
the long evolutionary period. Specifically, the distance
between the first bcd (overlapping with the first kr) and
the second bcd is invariably 20 in all of the four species.
In addition, the distances among TFBSs in Clause 3 have
also remained relatively stable, i.e., within the variation
we have factored into our simulation.

Since our methodology is really based on the assumption
of limited distance variations between TFBSs, it should be
much more sensitive at identifying individual "Clauses"
or simple modules. When the corresponding TFBS profile

covering Clause 1 or Clause 3 were used to search against
the genome sequence from D. moja, very significant peaks
in BLISS_score were observed (Figure 6a&b, p(C2|x) <
0.001 for both). The peaks corresponded to the match of
Clause 1 and Clause 3 on the D.moja sequence, respec-
tively. BLAST analysis using the sequences covering Clause
1 or Clause 3 searched against the D.moja genomic
sequence failed to identify significant matches that
spanned the whole module. rVista 2.0 did predict Clause
1 because it succeeded in detecting the DNA similarity
between the sequence covering Clause 1 and the D.moja
sequence. However, rVista 2.0 failed to identify Clause 3
since no similarity was detected between the sequence
covering Clause 3 and the D. moja genomic sequence.

Page 7 of 13
(page number not for citation purposes)

BMC Bioinformatics 2006, 7:287

http://www. biomedcentral.com/1471-2105/7/287

4

mel -

159
A

327 403
I A

- a H a a I

3
139139
C ausel

A
311
Clause2

49; r521572

580 615664
ClauseS

494 573
A

703 73674 793
A

-I

802 846
802 846

A Bed Kr Hb

Figure 5
Inter-TFBSs distances are very well conserved within each clause of the S2E module. Comparison of the evolution
of S2E modules across distantly related species revealed that while the sequence length of the module has changed significantly,
the distance among TFBSs in Clauses I and 3 have remained stable. The numbers near the TFBS indicate the positions relative
to the first Kr site in this module.

Implementation of BLISS as a web-based service
The BLISS methodology has been implemented as a web-
based tool for the research community. The web applica-
tion embodies the Gaussian Smoothing Method for iden-
tifying cis-regulatory modules at the binding site level, and
outputs all potential TFBSs in the predicted module. The
module finding process consists of several steps:

To begin, the user inputs two DNA sequences. For exam-
ple, a short sequence from a model organism that har-
bours a regulatory module, and a longer sequence
surrounding the ortholog of a different species. An
M_score threshold of 0.75 or 0.8 is then chosen by the user
for the generation of the TFBS profiles for both sequences.
Next, a plot of BLISS_scores comparing successive align-

ments of the short profile against the long profile is
returned to the user. On the very same page, the distribu-
tions described earlier (Figure 3b&c) are displayed so that
the user may choose a BLISS_score threshold. Once the
BLISS_score threshold is chosen, BLISS outputs all of the
matches with a BLISS_score higher than that threshold
(limited up to 5). For each match, a table of contributing
TFBSs are listed based on the product of the p-values of
the matching TFBSs on both sequences (Figure 7). Alter-
natively, it can be listed based on their numeric contribu-
tion to the BLISS_score, or by the location of the TFBSs.

Currently, the limits for the short and long input
sequences are set at 1200 bps and 15 k bps, respectively.

Page 8 of 13
(page number not for citation purposes)

221
A

201 201
201 201

pse

moja

4

viri I

BMC Bioinformatics 2006, 7:287

http://www. biomedcentral.com/1471-2105/7/287

(a) Search S2E Clause 1 in 14kb D moja

I 3o00 4001 s001 Boot
to 14kb D. moja

(b) Search S2E Clause 3 in 14kb D. moja

00

14kb D. moja

Figure 6
Searching for conserved individual clauses/element
modules using BLISS. Profiles covering clauses I (a) or 3
(b) of S2E were used to search against a 14 kb D.moja
genomic region. The BLISSscore peaks are significant
(P(C21x) = 0.0 for a, = 0.0003 for b).

Discussion
In this study, we have presented a first step towards iden-
tifying regulatory modules via comparisons at the binding
site level. The advantage of such an approach is that it
allows the detection of conserved regulatory modules in
highly divergent sequences, as we have demonstrated
both with simulated sequences as well as with real world
examples. This method is thus complementary to many
existing methods that are based on sequence similarity
comparisons [231 or use sequence similarity for pre-anal-
ysis selection [4,5,7,241. It should also be complementary
to applications such as MEME and CompareProspector,
which are widely used for the identification of conserved
sequence motifs (binding sites) in the regulatory region of
co-expressed genes [25,261.

There are limitations to our approach. Some of the major
limitations, such as the coverage and quality of the TFBS

matrices, are expected to improve rapidly in the near
future as new high throughput techniques are applied to
identify binding sites in genome scale. Our current algo-
rithm is developed based on the assumption that the
inter-TFBS distance variation is within a +/-4 base pair
range. This allows the identification of modules/clauses
with relatively small inter-TFBS distance variation, such as
the individual clauses in the S2E module. It will likely
miss modules/clauses that have much larger distance var-
iations between TFBSs. In the case of S2E, the identifica-
tion of the module was based on the fact that the third
clause had low inter-TFBS distance variations, which was
sufficient to generate a significant BLISS_score (figure 4a).
As indicated by Ludwig et al, if S2E as a whole were to be
considered, many inter-TFBS distances have changed dra-
matically during evolution [12]. However, a closer look at
the distribution of TFBSs in S2E in the two distantly
related species also indicated that the S2E module may be
sub-divided into clauses (Figure 5). While the inter-clause
distances have varied dramatically, the inter-TFBS dis-
tances within each clause have remained largely stable
(Figure 5). This is very possibly a reflection of the spacing
restriction on important transcription factor interactions.

In addition to the S2E module, we also tested our meth-
odology on other regulatory modules such as the DME
(Distal Muscle Enhancer) module in front of the paramy-
osin gene [27]. Using a 200 bp sequence harbouring the
DME in D. mela, we were able to detect the corresponding
module in D.viri (data not shown). Currently, the number
of well characterized, evolutionarily conserved modules is
limited. The goal of BLISS is to facilitate the discovery of
multiple TFBSs modules by identifying the conserved pat-
tern of the TFBSs. We also applied BLISS to a regulatory
region that is responsible for mediating UV induced
expression of hac-1 [28]. There is no existing information
on the composition of the UV-responsive module in this
region, which has very low sequence level conservation
between the corresponding segments in D.mela and
D.pseu. Yet genetic experiments have indicated that the
responsiveness is highly conserved. The potential module
identified by BLISS is currently being tested experimen-
tally.

BLISS, with some adaptation, can potentially be used to
identify the conserved regulatory modules in co-expressed
genes. Another advantage of BLISS is that the methodol-
ogy can also be applied to identify patterns that involve
not only TFBSs, but also other sequence features such as
complex response elements [29], insulator sequences,
CpG islands, etc. Functionally, these sequence features
(their related modifications and binding partners) inter-
act with transcription factors. However, these features,
such as CpG islands, cannot be detected by simple
sequence similarity based searches.

Figure 7
BLISS output of the contributing TFBSs. Our S2E search was used as the example. The list of TFBSs can be output either
based on sequence position, product of p value (Figure 7), or contribution to BLISS score. The TFBSs that belongs to S2E were
highlighted in green.

Conclusion
In this study, we addressed the feasibility of identifying
conserved regulatory modules at the binding site level.
Our results indicate that it is feasible to identify conserved
regulatory modules in simulated random sequences har-
bouring a regulatory module made of 4-8 distinct bind-
ing sites. Using real sequences, we demonstrated that our
approach outperforms regular sequence level compari-
sons when the orthologous DNA sequences are highly
diverged. In addition, the BLISS program outputs directly
the candidate binding sites that are shared between the
two regulatory sequences, which can greatly facilitate the
evaluation of the candidate module as well as the design
of the experimental verification strategy by biomedical
scientists. Future development of the project will include
identifying better algorithms for complex modules and
modules with higher inter-TFBS distance variations.

Methods
Generating simulated sequences
10000 simulated sequence pairs were generated for devel-
oping the methodology. Each set included a short DNA
sequence (100-500 bps) harboring a hypothetical TFBS
module and a long DNA sequence (5-6 kb) harboring a
conserved TFBS module. First, the hypothetical TFBS
module was generated in the following manner: 4-8
binding sites were randomly chosen from TRANSFAC 9.1
[15] database and then random DNA subsequences were
inserted between them. Qiu et al. [16] analyzed all the
entries of composite elements in the TransCompel data-
base (version 3.0) and they found that about 87% of the
composite elements are within 50 bp distance and about
65% are within 20 bp distance of one another. We there-
fore chose lengths of DNA subsequences inserted between
binding sites based on this result. Next, we created the
conserved TFBS module, which included binding sites for
the same sets of transcription factors in the same order as
in the shorter sequence. However, the binding site

Page 10 of 13
(page number not for citation purposes)

0 9447829 5.706195E-4 646
0.9414161 5.706195E-4 646

0.8660778 2.304445E-4 619

BMC Bioinformatics 2006, 7:287

ICdx-Zi-) 10

http://www. biomedcentral.com/1471-2105/7/287

sequences had to be different and they were randomly base frequencies ought to be proportional to the binding
chosen from TRANSFAC 9.1. Furthermore, compared to energy and the information vector reflects this average
the hypothetical module, distances between binding sites binding energy between the transcription factor and the
were set to vary slightly and we allowed each binding site binding site.
to shift up to 4 bps either to the right or to the left. Finally, binding site
we inserted the conserved module into a 5000 bp ran- A score cutoff at 0.75 was applied to the M_score profiles
domly generated DNA sequence to generate the longer of both the short and the long sequence as follows:
sequence. of both the short and the long sequence as follows:
sequence.

Binding site identification M score [i, j] = M score [i, j] if M score [i, j] > 0.75
Potential TFBSs both in the short DNA sequence (includ- Mscore [i, j] = 0 ifM score [i, j] < 0.75
ing the hypothetical module) and the long DNA sequence
(including the conserved module) were searched based
on frequency matrices collected by TRANSFAC 9.1. p value forM score
To calculate the p value, a background model is required.
Because TFBSs may be detectable on either the forward or we chose the background model to be a random
Here, we chose the background model to be a random
the backward strand, we searched both strands of
DNA sequence where each position is drawn independ-
sequences. The M score profile for each sequence is a M* L gently. The ratio among A, T, C, and G is 30%: 30%: 20 %
matrix, where M is twice of the number of matrices
matrix, where M is twice of the number of matrices : 20%. For each frequency matrix, 300 million subse-
applied and L is the length of the sequence. The top half quences were sampled from the background model, and
of the M score matrix is the score profile for the forward the Mscore of each subsequence was calculated to build
the Mscore of each subsequence was calculated to build
strand and the bottom half is that for the complementary the M score distribution. The p value of a M score for each
strand. The M score of the ith TFBS at position j of the
strand. The wMscore of the ith TFBS at position n of the TFBS was estimated by calculating the fraction of samples
sequence was calculated by first aligning the frequency that had scores equal to or higher than that M_score. Then,
matrix for the ith TFBS with the sequence at position j and the P score profiles were calculated as follows:
the c g the Pscore profiles were calculated as follows:
then computing:

K is the length of the TFBS. nj+k e {A,T,C,G} is the nucle-
otide occurring in the sequence at position j+k. fk,,n is

the frequency of nucleotide nj+k at position k in the fre-

quency matrix (of the ith TF). fminn is the lowest fre-

quency and f7max is highest frequency across all
nucleotides at position k in the frequency matrix (of the
ith TF). I(k) is the information vector for the frequency
matrix, which reflects the degree of conservation at posi-
tion k of the matrix. Finally, M_score [i,j] is the normalized
Score [i,j]. Stormo [30-32] observed that logarithms of the

P_score[i, j] = Ci M_score[i, j]

C, is the negative natural logarithm of the p value of
M_score > = 0.75 for the ith TFBS.

Gaussian smoothing
To account for the change in the distances between/
among binding sites, a Gaussian smoothing was applied
to the P_score profiles with a variance of 9. Formally, the
G_score profiles were calculated as follows:

where 7 = 3 and k ranges from -7 to 7. In effect, a P_score
can spread 7 positions to both the right and the left due to
the Gaussian smoothing. Smoothed P_scores beyond 7
positions were ignored due to their small values.

Searching the conserved module in the long sequence
To identify a maximum match at the binding site profile
level, the short G_score profile was slid along the long
G_score profile. BLISS_score at position n is the matching

Page 11 of 13
(page number not for citation purposes)

BMC Bioinformatics 2006, 7:287

http://www. biomedcentral.com/1471-2105/7/287

score between the short profile and its corresponding
region of equal length (length of the short sequence) in
the long profile at position n:

f.4 BLISS_score:

BLISS score[n] =
M-1 L-1
( G score[i, j]* G_ score2[i, j + n]) /LengthOfShortSequence
i=0 j=0
where G_scorel is the G_score profile for the short
sequence and G_score2 is the G_score profile for the long
sequence; L is the length of the short sequence; n is the
current location where the short sequence is aligned to the
long sequence.

Large scale search of the simulated sequences, statistical
analysis
We used 10000 simulated sequence pairs generated by the
above method to calculate two BLISS score distributions.
The first is the BLISS_score distribution when the hypo-
thetical module in the short sequence is aligned with the
conserved module in the long sequence. The second is the
BLISS_score distribution when the module is aligned with
a non-module segment of the longer sequence. For each
pair of sequences, BLISS_scores were calculated at each
position as the short profile slid along the longer profile.
The peak matches (corresponding to the peaks in the score
profile) between each pair of sequences were evaluated to
see whether it aligned the embedded modules. If the
match did include the alignment of the modules, it was
designated a "true" match, and this BLISS _score was used
to calculate the distribution for the modules matching. All
of the other BLISS scores were used to calculate the distri-
bution for the module matching with the background
sequence.

Searching for the eve2 module in D. virilis and D.
mojavanis sequences
The GenBank [331 accession numbers for the S2E
sequences are AF042712(D. pseudoobscura) and
AF042709(D. melanogaster). We used BLISS to search
these two enhancers in 13 kb D. virilis and 14 kb D.
mojavanis sequences, in which S2E is hypothesized to be
located, but the specific location is unknown.

Ludwig et al. indicated that distances between TFBSs in
two clauses in S2E (region 134-275 and region 484-684
for D. melanogaster, region 196-376 and region 692-866
for D. pseudoobscura) were substantially conserved. We
removed those regions and searched for the modules in
13 kb D. virilis and 14 kb D. mojavanis sequences using
BLISS.

Website construction
BLISS was implemented using HTML/JSP/JavaBean and is
supported by an Apache Tomcat 5.5 server. It is publicly
available at: http://genel .ufscc.ufl.edu:8080/blissWeb!
index.html. The M_score profiles of TFBSs were calculated
based on the frequency matrix library collected by TRANS-
FAC 9.1. BLISS used DISLIN [34], a plotting library for dis-
playing data, to draw the match score plot in run time.

Authors' contributions
H. Meng is the principal undertaker of this project. A. Ban-
erjee provided guidance to HM in development of the
algorithms and worked on the manuscript. L. Zhou initi-
ated the project and provided general guidance to HM.
HM and LZ drafted the manuscript.

Acknowledgements
The authors are grateful for comments on the manuscript from Dr. Alberto
Riva. This work was supported in part by NIH CA095542 to LZ.

Publish with BioMed Central and every
scientist can read your work free of charge
"BioMed Central will be the most significant development for
disseminating the results of biomedical research in our lifetime."
Sir Paul Nurse, Cancer Research UK
Your research papers will be:
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours you keep the copyright