PhyloGibbs

NEW A significant enhancement, PhyloGibbs-MP, is now available. It is in late beta form, but almost final, and PhyloGibbs-1.0 (the version below, which was released with the original paper) is now deprecated. Please consider using PhyloGibbs-MP instead.

The original version of PhyloGibbs is being maintained at Erik van Nimwegen's group in Basel and is available here. It, too, differs in some ways from PhyloGibbs-1.0. A web interface is available here.

PhyloGibbs is a motif finder to find binding sites for transcription
factors in cis-regulatory sequences of DNA. It is based on the Gibbs
sampling algorithm, but with the following enhancements:

If sequences from closely related species are used, it
systematically accounts for non-functional conservation due to
phylogeny and modifies the scoring accordingly. On tests, with
synthetic and real genomic data, we find that this
approach significantly increases specificity to known binding sites. Input
sequences need to be preprocessed by a multiple alignment program
and presented in "aligned fasta" or "multi-fasta" format; we have developed an
alignment program Sigma designed for
non-coding DNA, and also recommend Dialign,
but other programs may be used.

It bypasses the problems of estimating the number of motifs in the
sequence, and of assessing significance of found motifs, by using a
two-stage motif-finding strategy: the first "simulated annealing"
phase finds a few high-quality groups of binding sites representing a
few different motifs, and the second "tracking" phase keeps statistics
on how much these groups hang together and what other sites get
co-clustered with them. The output is a list of putative binding
sites (not limited by the initial guess) and the fraction of time they
were co-clustered in that group (the most direct measure of their
significance).

The code:
The final (version 1.0) version of PhyloGibbs is released now.
A webserver where one can submit
PhyloGibbs jobs is also planned. The last feature-complete snapshot on
this page (November 15, 2005) had a small bug in string-handling when
parsing the "-L" option, which apparently showed up only on very new linux
systems: this is the only fix in 1.0, but if "-L" works for you in
the 20051115 version, you don't need to upgrade.
Further development continues and will be made available in later
versions.

Source code tarball,
including instructions on compiling and usage (start with README), and
example output.
Requires the GSL and glib libraries and headers installed
(standard on most linux systems). Should compile on most Unix-like
systems, and on Microsoft Windows in the
Cygwin environment. (185 KB)