If any of these programs are used, please cite "Park,
J. and Teichmann, S.A. DIVCLUS: an automatic method in the GEANFAMMER package
that finds homologous domains in single- or multi-domain proteins. (Bioinformatics,
14, 144-150)".

News
on geanfammer development. (July 1998)

Geanfammer can now run PSI-Blast. and parse the output of it
for clustering (since Ver. 1.5)

. " GEANFAMMER "
refers either to a perl5 program*, a suite
of perl5 programs, a perl5 module or a perl5 subroutine library. These
are all available by anonymous ftp at cyrah.ebi.ac.uk. It has been developed for the
analysis of most of the complete bacterial genomes announced since
1995. It summarises the whole procedure of preparing statistically and
biologically more relevant protein (sequence) duplication modules before
any more biological analysis like structure and function assignment. With
this now anybody can easily analyse the duplication level and types of
sequence families in any genome or database.

*A
very preliminary version of GUI version (Perl/Tk)
is also available.(older version)

This is critically important as a lot of protein sequences are multidomain
and it
can cause serious problems in analysing large amount of sequences automatically
if sequences were not broken down to sequence domains.

Geanfammer uses FASTA or SSEARCH which allow the gap in sequence comparison
in compared to older BLASTP algorithm which does not. Also, it uses E
value instead of Z-score to increase the sensitivity.

The program takes the protein
sequences of one or two databases and creates protein families. The protein
sequence databases can be a whole genome, part of a genome or any other
protein sequence databases in fasta format.

The protein sequence databases
are compared to each other (or one database is compared to itself) using
one of the two sequence comparison programs of the FASTA
package. Using the ouput of the sequence comparison, the proteins are
clustered by single linkage. Then GEANFAMMER divides the single linkage
clusters which contain unrelated sequences (due to multi-domain proteins)
using the DIVCLUS algorithm.

Finally, a sorted cluster
file containing the duplication
module families is created together with a summary file, which summarizes
the distribution of duplication module families.

An example run could be:

prompt> geanfammer.pl YOUR_GENOME.fa

In the distribution, a test fasta format database
(geanfammer_test_FASTA_DB.fa) is included, so you can see yourself what
it does before trying a bigger real DB. Just type:

prompt> geanfammer.pl
geanfammer_test_FASTA_DB.fa

The final result will be"geanfammer_test_FASTA_DB.gclu"

Real
Genome TEST!!

We have included the smallest complete Mycoplasma genitalium genome
(MG.fa) in the distribution to play with. According to your choice of E
value threshold, geanfammer should produce a domain level clustering.

Try:

geanfammer.pl MG.fa E=0.2 e=0.2
orgeanfammer.pl MG.fa E=0.01 e=0.01

and see what it produces. E=0.2 will produce
larger protein families as you are generous in the possible mismatches.
E=0.01 can be quite reasonable and we used
0.001 for our genome analysis work to be very strict ( to avoid wrong clusters
at the cost of losing distant but true members). The search part of the
program will take the most time. It will produce a subdirectory called
MG in which the results of search will be stored. Final results will be
made in the present directory. So, it is a good idea to make a new directory
for the test and run geanfammer inside it.

The suite of perl5 programs essentially consists of
the constituent parts of the GEANFAMMER single program. A flow chart of
the constituent programs can be found by clicking
here .

A documentation of the
single programs follows here, although details on usage can be found in
the headers of all programs:

The search scripts are do_fasta_sequence_search.pl,
do_ssearch_sequence_search.pl or
do_sequence_search.pl. They take as arguments a query database,
a target database and the path and name of the search program (either FASTA
or SSEARCH respectively). do_sequence_search.pl do self
self search as default.

Next, all the ssearch output format files (.msso
files) generated by the search script have to be converted to msp
(matching sequence pair) format files. This is done using sso_to_msp.pl.
sso_to_msp.pl has to be run in the directories with the sso files as follows
: "sso_to_msp.pl *.sso (ie all
sso files) e". The e
option means that 1 msp file is created for each sso file, rather than
a single msp file for all sso files. The header of the program can be read
for further details of options.

Next, the msp files can be read by msp_single_link.pl
to create a file with single linkage clusters of the data at a given expectation
value (E-value) threshhold. msp_single_link.pl has a single argument, the
E-value threshhold. A single file "single_linkage.sclu" is created
by this program as default unless any name is given.

msp files for each single_linkage cluster must then be created by sso_to_msp.pl.
Here, sso_to_msp.pl must be run as follows: sso_to_msp.pl
single_linkage.sclu.

The resulting msp files can then be put into DIVCLUS,
which has its own page. DIVCLUS makes .clu
files containing the duplication module families. There are two important
options: e, the e-value threshhold and f, the minimum percentage overlap
region. Usually, e should be the same as the single linkage threshhold.
f should be 3 (=67%) or higher (for clusters larger than 100 sequences).
DIVCLUS is run in the form: "divclus.pl e=0.001 f=3 *_cluster.msp".

A program which creates a single file, called "sorted_cluster_file.gclu",
of all the families is create_sorted_cluster.pl.
It is simply run in the directory with all the clu files produced by DIVCLUS.

make_clustering_summary.pl provides
a summary of the families in the "sorted_cluster_file.gclu" in
"sorted_cluster_file.summary". This also inserts the summary
to "sorted_cluster_file.gclu".

. You can also download geanfammer
from CPAN site. However, it might not be as updated as above ftp routes.

We are the programmers who made this, so we will do our best to tackle
any problems you have while using the program(s). However, there is no
legal guarantee on the possible malfunction of any part of the package.

Search
web with Altavista >> Put email address to get email when this page is updated.

Free for academic research and educational purposes for non-profit making
purposes. The copyright rule for Perl itself applies to the program(s).
For commercial use and collaboration please contact the authors.