Hidden Markov models and their application to genome analysis in the context of protein structure

Abstract

The bulk of the thesis is concerned with the application of hidden Markov models (HMMs) to remote protein homology detection. The thesis both addresses how best to utilise HMMs, and then uses them to analyse all completely sequenced genomes. There is a structural perspective to the work, and a section on three-dimensional protein structure analysis is included.

The Structural Classification of Proteins (SCOP) database forms the basis of the structural perspective. SCOP is a hierarchical database of protein domains classified by their structure, sequence and function. The main aim of the work is to use HMMs to classify all protein sequences produced by the genome projects (including human) into their constituent structural domains. To do this HMMs were built to represent all proteins of known structure; the thesis examines ways in which to best do this and describes the construction of a library of models.

Structural domain assignments to the genome sequences were generated by scoring the model library against the genomes, and then selecting the most probable domain architecture for each sequence. The genome assignments (within the framework of the evolutionary-based SCOP classification) provide the capability to study evolution at a molecular level, as well as at the level of the whole organism.

The model library and genome assignments have been made public via the SUPERFAMILY database. The data and services provided have been used for a growing number of projects several of which have already led to publication.