Abstract

The process of protein structure prediction is a crucial part of understanding the function of the building blocks of life. It is based on the approximation of a protein free energy that is used to guide the search through the space of protein structures towards the thermodynamic equilibrium of the native state. A function that gives a good approximation of the protein free energy should be able to estimate the structural distance of the evaluated candidate structure to the protein native state. This correlation between the energy and the similarity to the native is the key to high quality predictions.

State-of-the-art protein structure prediction methods use very simple techniques to design such energy functions. The individual components of the energy functions are created by human experts with the use of statistical analysis of common structural patterns that occurs in the known native structures. The energy function itself is then defined as a simple weighted sum of these components. Exact values of the weights are set in the process of maximisation of the correlation between the energy and the similarity to the native measured by a root mean square deviation between coordinates of the protein backbone.

In this dissertation I argue that this process is oversimplified and could be improved on at least two levels. Firstly, a more complex functional combination of the energy components might be able to reflect the similarity more accurately and thus improve the prediction quality. Secondly, a more robust similarity measure that combines different notions of the protein structural similarity might provide a much more realistic baseline for the energy function optimisation.

To test these two hypotheses I have proposed a novel approach to the design of energy functions for protein structure prediction using a genetic programming algorithm to evolve the energy functions and a structural similarity consensus to provide a reference similarity measure. The best evolved energy functions were found to reflect the similarity to the native better than the optimised weighted sum of terms, and therefore opening a new interesting area of research for the machine learning techniques.