Using generalized profiles for functional
annotation of genome sequences

Philipp Bucher, Kay Hofmann

Generalized profiles are computer-searchable descriptions of
sequence families, domains, motifs, and other elementary
components of genetic informations, which can also be interpreted
as hidden Markov models (HMMs) of a particular architecture.
Generalized profiles or HMMs are among the most effective tools
for identifying and characterizing highly divergent protein
homology domains, as judged by the following criteria: (i)
discrimination of true members from chance matches, (ii) accurate
definition of domain boundaries, (iii) correctness of
profile-generated multiple sequence alignments. Their excellent
performance with regard to these criteria makes comprehensive
collections of profiles or HMMs extremely useful tools for
automatic sequence annotation. This will be exemplified by a
whole genome application using the PROSITE profile and PFAM HMM
libraries. The talk will also address a number of important
technical issues related to the application of generalized
profiles. Several protocols to derive profiles from initial data
will be described along with a comparative evaluation of their
performances with respect to the above criteria. In addition, new
solutions to the problem of estimating the statistical
significance of profile matches will be presented that take into
account various non-random properties of biological sequence
sets, e.g. compositional bias, periodicities, subfamily
over-representation, which are known to falsify statistical tests
based on simple random models.