Pattern Recognition Tools for Protein
Sequences

Michael Gribskov

Genes are commonly identified in genomic based on the
similarity of their sequences to known genes. While in many cases
this is (relatively) straightforward, there are many cases where
the similarity is so distant that even the assignment of a new
gene to a protein family may be less than compelling. One
approach to these difficult situations is to add information from
an entire protein family to the analysis. Combinations of
supervised and unsupervised learning, and fixed length and gapped
comparisons provide powerful tools when working with distantly
related sequences and sequence families. These tools can be
applied to gene identification, family classification, and
candidate gene analysis.