Predictors:

PROSO

A sequence-based PROtein SOlubility evaluator

PROSO tries to answer the following question:

"Which of my cloned proteins have the best/worst chances to
be soluble upon heterologous expression?"

The prediction is based on a classifier exploiting subtle differences
between soluble proteins from TargetDB and PDB and notoriously
insoluble proteins from TargetDB and literature mining. For more
details please read the reference or brief

We propose a machine-learning approach to sequence-based prediction of protein solubility in which we exploit subtle differences between soluble proteins from TargetDB and PDB and notoriously insoluble proteins from TargetDB and literature mining. The length distribution of soluble and insoluble dataset was adjusted to avoid predictions biased by protein size. As feature space for classification, we used frequencies of mono-, di-, and tri-peptides represented by the original 20-letter amino acid alphabet as well as by several reduced alphabets in which amino acids were grouped by their physicochemical and structural properties. The classification algorithm was constructed as a two-layered structure in which the output of primary support vector machine classifiers operating on peptide frequencies was combined by a second-level Naive Bayes classifier. An overall prediction accuracy of 72% (75% on the positive (soluble) and 69% on the negative (insoluble) class) was achieved in a 10-fold cross-validation experiment over 50% identity clustered data, indicating that the proposed algorithm may be a valuable tool for more efficient target selection in structural genomics. Furthermore predicted solubility was shown to correlate very well with experimental results on protein solubility upon expression in E.coli. The classifier will correctly evaluate only proteins without trans-membrane segment as predicted by TMHMM 2.0._

The input protein sequences are categorized into two classes: YES - soluble; NO - insoluble.
Additionally the probability of a class (from 0.5 to 1.0) is provided.
The probability threshold value is set by default to 0.5. By increasing it one can expect
higher classification precision (selectivity) and lower recall (sensitivity).
If the result cannot be calculated, a comment is written.