Identification of disordered proteins

Identification of disordered proteins

The conventional view of protein structure, built up over thirty years of X-ray structure determination, treats the atoms of a protein as fixed in space relative to one another, in a complex shape containing regions of local structure, such as α-helices and β-sheets. This view is undoubtedly correct for many proteins, however it is clear that large numbers of proteins do not exhibit fixed structure [1]. Initially such proteins were identified because regions of the structure were not visible in the crystallographic X-ray diffraction experiments, a consequence of either static or dynamic disorder [2,3]. The development of NMR methods for investigating protein structures in solution has made the experimental detection of structural disorder relatively simple, even for large proteins [4]. Factors including the complexity of the amino acid sequence and the content of charged and non-polar residues are major determinants of disorder [1-3,5]. The application of such considerations to entire genome sequences with bioinformatic tools has led to the suggestion that the majority of proteins in some eukaryotes, including humans, have disordered domains of up to fifty amino acids in length [2-3,6]. Intrinsically disordered, or "natively unfolded", proteins may be common because disorder leads to a high conformal entropy that is suggested to be advantageous for a protein searching for its interaction partner. Indeed many natively unfolded proteins fold into an ordered structure on binding to a partner [7,8], and yeast proteins with disordered regions of seventy or more contiguous residues have more protein-protein interaction partners than other proteins [9].
The aim of this work is to investigate the use of modern machine learning methods, such as the Support Vector Machine (SVM), artificial neural network (ANN) and Relevance Vector Machine (RVM) for identifying regions within proteins lacking a fixed structure, based on amino acid sequence data. Kernel learning methods, such as the SVM, appear especially promising for work of this nature as they are able to operate directly on structured data, such as graphs, trees, or in this case sequence data.

Acknowledgements

This work was supported by a discipline-hopping grant made jointly by the U.K. Medical Research Council (MRC), Engineering and Physical Sciences Research Council (EPSRC) and Biotechnology and Biological Sciences Research Council (BBSRC), administered by the MRC (Grant number 67192 - "Predicting protein disorder with advanced machine learning tools").