UCL Department of Computer Science: Bioinformatics Group

The DISOPRED2 Disorder Prediction Server

Native disorder explained

The majority of water-soluble proteins have structures that are globular and relatively static. However, some proteins have regions that are natively disordered. Disordered regions are flexible, dynamic and can be partially or completely extended in solution. Native disorder also exists in global structures such as extended random coil proteins with negligible secondary structure or molten globules, which have regular secondary structure elements but have not condensed into a stable globular fold.

The primary function of disorder appears to be molecular recognition of proteins and nucleic acids. It has been speculated that the multiple metastable conformations, adopted by disordered binding sites, allows recognition of several targets with high specificity and low affinity. Order to disorder transitions also provide a mechanism for controlling protein concentration via proteolytic degradation.

Disordered regions are often characterised by low sequence complexity, compositional bias toward aromatic and hydrophilic residues and high flexibility. The absence of a static structure means that disordered residues do not appear in the electron density maps obtained by X-ray crystallography but they can be investigated using other types of spectroscopy such as circular dichroism and NMR.

Why is prediction of native disorder important?

Conservative estimates of the disorder frequencies in complete genomes suggest that disordered residues account for around one-fifth of a typical eukaryote proteome, and that one-third of eukaryotic proteins contain a contiguous disordered region with length greater than thirty amino acids. These proteins are involved in important regulatory processes such as transcription and cytokinesis.

It has also been shown experimentally that disorder is involved in cell cycle regulation and endocytosis. The molecular recognition of DNA by disordered peptides is implicated in control of gene expression by transcription, epigenetic modifications and gene silencing (see Figure 1). Disorder is also associated with signalling processes that involve the protein kinases and the small GTPases. References and further discussion of the functionality of native disorder can be found in the article cited at the bottom of this page.

Figure 1

The image on the left shows a transcription factor bound to DNA (1gt0), the regions of the protein that are predicted to be disordered are coloured in yellow. The image on the right shows histone proteins bound to DNA (1kx5), predicted disordered regions are represented by the space-filled structures colored in grey. The disordered regions that project outside the complex were found to have zero occupancy in the file from the protein data bank.

How does DISOPRED2 work?

DISOPRED2 was trained on a set of around 750 non-redundant sequences with high resolution X-ray structures. Disorder was identified with those residues that appear in the sequence records but with coordinates missing from the electron density map. This is an imperfect means for identifying disordered residues as missing co-ordinates can also arise as an artifact of the crystalization process. False assignment of order can also occur as a result of stabilizing interactions by ligands or other macromolecules in the complex. However, this is the simplest means for defining disorder in the absence of further experimental investigation of the protein.

A sequence profile was generated for each protein using a PSI-BLAST search against a filtered sequence database. The input vector for each residue was constructed from the profiles of a symmetric window of fifteen positions. The data were used to train linear support vector machines (SVMs). The SVM controls overfitting by ensuring that the decision surface separates the two classes with a large margin. An example linear decision surface that separates two classes in 2D (solid line) is shown in figure 1. The circled points denote the support vectors which lie on the margin (dashed lines).

Figure 2

The unbalanced class frequencies in the training set (a ratio of approximately 19:1) can result in classifiers that output the majority class exclusively since this optimizes overall accuracy. This is avoided in the training of DISOPRED2 by placing a greater cost on points in the minority (disordered) class that violate the margin than points in the majority (ordered) class. Adjustment of the classifier decision threshold is discussed in the help pages.