We developed CompareProspector to take advantage of
comparative genomics information to aid sequence motif finding.
CompareProspector is built upon BioProspector (Liu X et al, 2000), which is an extension of the
original Gibbs sampler (Liu JS et al, 1995) with improved flexibility and performance.

CompareProspector takes as input a list of sequences from one species that is
predicted to share common regulatory element(s). Such sequences can be obtained
from high throughput genomics techniques such as gene expression profile
clustering or chromatin immunoprecipitation followed by microarray (ChIP-chip).
It also takes as input a list of percent identity values representing the
cross-species conservation of each nucleotide.

In the Gibbs sampling iterations,
CompareProspector biases the motif finding towards sequences conserved across
species. First of all, the user can specify two WPID thresholds, Tch
(high conservation threshold) and Tcl (low conservation threshold).
In BioProspector, a site score Ax is calculated for every site x
in the input sequence as the ratio of the probability of generating x
from the motif model over the probability of generating x from the
background distribution. A new site is sampled with probability proportional to
Ax. In CompareProspector, during initial iterations of Gibbs sampling,
only positions whose WPID values are above Tch are sampled.
Subsequently, the WPID cutoff is gradually decreased from Tch to Tcl
to allow sampling of less conserved positions. The new site score A'x is
weighted by sequence conservation (A'x = Ax ´ WPIDx, WPIDx being the
WPID of site x) to favor sampling of more conserved sequences. Sequences
without orthologs are assigned Tcl as the WPIDx for all x,
so they only participate in sampling
in later iterations.Finally,
in the original BioProspector, sites with a high enough score Ax are
automatically added to the motif without sampling. CompareProspector restricts
automatic additions to only sites whose WPIDs are above Tch.
This step further down weighs the influence of divergent sites and sequences
without orthologs. The output of CompareProspector includes a list of
highest-scoring motifs as position-specific probability matrices, the individual
sites used to construct each motif, and the locations of the sites on the input
sequences.