3
Transmembrane Protein Topology Topology of a transmembrane protein describes which regions are membrane-spanning and which are 'inside' or 'outside' (e.g. cytoplasmic/extracellular or cytoplasmic/lumenal). Number and position of TM helices. Position of the N-terminal.

4
Early Hydrophobicity-based Approaches To generate data for a plot, the protein sequence is scanned with a moving window of size 19-21 residues. At each position, the mean hydrophobic index of the amino acids within the window is calculated and that value plotted as the midpoint of the window. Aquaporin KGVWTQAFWKA V TAEFLAMLIFVLLSVGSTINWGGSEN

7
Using Support Vector Machines for Topology Prediction Recently, more advanced methods using machine learning algorithms such as hidden Markov models (e.g. TMHMM, PHOBIUS) and neural networks (MEMSAT3) have been developed, They have achieved significant improvements in prediction accuracy (~80%). However, none of the top scoring methods use SVMs. While hidden Markov models and neural networks may have multiple outputs, SVMs are binary classifiers. In order to deal with TM topology prediction, multiple SVM will have to be combined, e.g. TM helix / Loop Inside Loop / Outside Loop Signal Peptide / ¬Signal Peptide Re-entrant Loop / ¬Re-entrant Loop

8
Assembling a Novel Data Set of Transmembrane Proteins In order to study and predict features of transmembrane (TM) proteins, the use of a high quality data set containing sequences with experimentally confirmed TM regions is essential for both training and validation purposes. Based on Möller set and MPTOPO database. Novel TM sequences parsed from SWISS-PROT and blasted vs PDB. Remove fragments, chain breaks, colicins, venoms etc. Homology reduce at 40% sequence identity. Topologies determined by OPM or PDB_TM. Since PDB structures of TM proteins contain no lipid, theoretical approaches are used to predict the position of the membrane relative to the structure, and thus the TM helix boundaries. OPM uses water-lipid transfer energy minimisation PDB_TM uses hydrophobicity/structural feature analysis

9
Assembling a Novel Data Set of Transmembrane Proteins Theoretical membrane placement on to the Mechanosensitive channel protein MscS crystal structure (PDB code 2oau) by OPM (left) and PDB_TM (right). The membrane region is between the red and blue bars.

15
Dynamic Programming Simplified version of original MEMSAT algorithm, treating TM helices as discrete units, rather than separating them into inside, outside and middle components. Re-entrant helix and signal peptide states were added. Residues were therefore predicted to lie in one of five different topological regions: inside loop, outside loop, TM helix, re-entrant helix and signal peptide. For evaluating signal peptide preference, residues with positive signal peptide scores up to position 30 in a target sequence were added to the outside loop score and subtracted from the inside loops score, in order to direct prediction towards a non-cytoplasmic amino terminal. The value was also scaled by a factor of 10 and subtracted from the TM helix SVM score to prevent TM helix prediction. For the same reason, positive re-entrant helix scores were scaled by a factor of 10 and subtracted from the TM helix SVM score

16
Overall Prediction Accuracy Benchmark results for the SVM-based method ('TMSVM') against a selection of leading topology predictors. 'Correct signal peptide' and 'correct re-entrant helix' refer to correct topology prediction for proteins containing these features. TMSVM was able to detect signal peptides with 92% accuracy, and re-entrant helices with 39% accuracy. No false positives of either class were predicted. OCTOPUS results were not cross-validated therefore are likely to be overestimated as there is considerable overlap between test and training sets. Tested vs the Möller (low resolution) data set – scores 77%, same as MEMSAT3.

22
Discriminating between TM and Globular Proteins For SVM training, we used 416 randomly chosen proteins from the MEMSAT3 [11] set which consists of 2685 non-redundant chains from globular proteins of known structure, combined with our novel set of 131 TM proteins. The remaining 2269 sequences were used used as test cases. PSI-BLAST profiles were generated for all sequences and 10-fold cross validation was used to assess performance, again removing sequences from the training fold with greater than 25% sequences identity to any sequence in the test fold. Window size = 33, Kernel = RBF, MCC = 0.78