Prediction of protein glycosylation sites by using support vector machines

Date of Defense

2008-07-03

Page Count

55

Keyword

glycosylation

post-translational modification

prediction

support vector machines

Abstract

Protein glycosylation is an important post-translational modification (PTM) to affect various molecular functions such as structure, biological activity and protein-protein interaction. Due to the difficulties of biological experiments and the huge amount of identification works, there are several works were proposed in recent years to identify protein glycosylation sites by computational approaches. The features of their identification model were mainly amino acids surrounding the glycosylation sites. All of previous prediction tools are against respective types of glycosylation. Therefore, we develop prediction methods to identify protein glycosylation sites include O-linked, N-linked and C-linked glycosylation using support vector machine (SVM) based on dipeptide combined with accessible surface area, region combined with amino acid, and dipeptide. It shows that the accuracy of O-linked glycosylation on serine and threonine, N-linked on asparagine and C-linked on tryptophan are 95%, 91%, 96% and 95%. We implemented in GSI, a web server to identify O-linked, N-linked and C-linked glycosylation sites.

List of FiguresFigure 1. The structure of O-linked glycosylation. The oligosaccharides attached to the hydroxyl group of amino acid, serine and threonine.3Figure 2. The structure of N-linked glycosylation. The oligosaccharides attached to asparagine.3Figure 3. The structure of C-linked glycosylation. The £\-mannopyranosyl residue is attached to the indole C2 of tryptophan via a C-C link4Figure 4. The structure of GPI anchors. The hydrophobic phosphatidylinositol group is linked to a residue at or near the C-terminus of a protein through a carbohydrate-containing linker.5Figure 5. The system flow of constructing prediction models12Figure 6. The process of truncate the protein sequence to region windows with glycosylation or non-glycosylation site in the middle.14Figure 7. The process of dipeptide encoding17Figure 8. The process of tripeptide encoding18Figure 9. The process of secondary structure encoding18Figure 10. The calculation of ASA scores combined with dipeptide.19Figure 11. Comparison of the different between ASA scores of positive and negative datasets on serine residue on O-linked glycosylation20Figure 12. Comparison of the different between ASA scores of positive and negative datasets on threonine residue on O-linked glycosylation21Figure 13. Comparison of the different between ASA scores of positive and negative datasets on N-linked glycosylation21Figure 14. Comparison of the different between ASA scores of positive and negative datasets on C-linked glycosylation22Figure 15. The performance of serine residue in O-linked glycosylation prediction models28Figure 16. The performance of threonine residue in O-linked glycosylation prediction models30Figure 17. The performance of N-linked glycosylation prediction models32Figure 18. The performance of C-linked glycosylation prediction models34Figure 19. The interface of GSI web server, which is available at http://bioinfo.gene.idv.tw/.42Figure 20. In this graph, the web interface with an example of inputs on GSI43Figure 21. The results of each type of potentially glycosylated amino acid sites and the distribution of ASA scores surrounding them44Figure 22. The list of protein sequences prediction result and the ASA scores of each site.44

List of TablesTable 1. Comparison of current prediction tools10Table 2. Number of positive and negative datasets in our study for O-linked, N-linked and C-linked glycosylation considered13Table 3. The number of positive and negative datasets for Serine in O-linked glycosylation for different symmetrical window size and ratio of positive and negative datasets15Table 4. The number of positive and negative datasets for Serine in O-linked glycosylation for different symmetrical window size and ratio of positive and negative datasets15Table 5. The number of positive and negative datasets for C-linked glycosylation for different symmetrical window size16Table 6. The number of positive and negative datasets for N-linked glycosylation for different symmetrical window size16Table 7. The various ratio of positive and negative datasets on serine residues in O-linked glycosylation based on 0/1 system encoding25Table 8. The various ratio of positive and negative datasets on threonine residues in O-linked glycosylation based on 0/1 system encoding26Table 9. The results of serine residue in O-linked glycosylation using different features27Table 10. The results of threonine residue in O-linked glycosylation using different features29Table 11. The results in N-linked glycosylation from different models31Table 12. The performance of C-linked glycosylation from different models32Table 13. Best models of four types of glycosylation35Table 14. Comparison of using our training datasets on serine in O-linked glycosylation to test precious prediction tools36Table 15. Comparison of using our training datasets on threonine residues in O-linked glycosylation to test the other prediction tools36Table 16. Comparison of the training datasets for serine within other prediction tools to test our and their own prediction models37Table 17. Comparison of the training datasets for threonine within other prediction tools to test our and their own prediction models37Table 18. Comparison of proposed accuracy with other prediction tools on N-linked glycosylation38Table 19. Comparison of the training datasets for asparagine residue within other prediction tools to test our and their prediction tools38Table 20. Comparison of proposed accuracy with other prediction tools on C-linked glycosylation39Table 21. Comparison of the training datasets for tryptophan residues within other prediction tools to test our and their own prediction models39Table 22. Comparison of using independent test sets with current prediction tools and ours on serine residues in O-linked glycosylation40Table 23. Comparison of using independent test sets with current prediction tools and ours on threonine residues in O-linked glycosylation41Table 24. Comparison of using independent test sets with current prediction tools and ours on asparagine residues in N-linked glycosylation41Table 25. Comparison of using independent test sets with current prediction tools and ours on tryptophan residues in C-linked glycosylation41Table 26. Using datasets of serine in O-linked glycosylation of CKSAAP_OGlySite, EnsembleGly and NetOGlyc with our prediction method to cross test precious prediction tools47Table 27. Using datasets of threonine residues in O-linked glycosylation of CKSAAP_OGlySite, EnsembleGly and NetOGlyc with our prediction method to cross test precious prediction tools48Table 28. Using datasets of asparagine residues in N-linked glycosylation of CKSAAP_OGlySite, EnsembleGly and NetOGlyc with our prediction method to cross test precious prediction tools49Table 29. Using datasets of tryptophan residues in C-linked glycosylation of CKSAAP_OGlySite, EnsembleGly and NetOGlyc with our prediction method to cross test precious prediction tools50Table 30. The comparison of different glycosylation datasets between previous prediction tools and ours51