﻿Protein phosophorylation (or simply phosphorylation for short) is a ubiqui-
tous post-translational modi¯cation in both prokaryotic and eukaryotic organisms,
which is catalyzed by a type of enzyme called kinase. Phosphorylation in particular
plays a signi¯cant role in a wide range of cellular processes. With the fast growing
number of novel protein sequences published, there is increasing need to identify
phosphorylation sites in these sequences and also to specify the type of kinase(s)
involved. Whilst those experimental methods that identify phosphorylation sites
in vitro are usually labor-intensive and time-consuming, in silico prediction of
phosphorylation sites is much desirable and popular for its convenience and fast
speed.
One of the most challenging issue in phosphorylation site prediction is the
complex substrate speci¯city of the large kinase family, which makes this problem
eligible for employing pattern recognition approaches like arti¯cial neural network
(ANN). In this thesis, we introduce a novel classi¯er ensemble approach called
Bagging-Adaboost Ensemble (BAE) and ¯rst apply this ensemble framework on
eukaryotic protein phosphorylation prediction problem. BAE incorporates bagging
technique and adaboost technique to improve the accuracy, stability and robust-
ness of the result. This improvement is accomplished by (i) the enhancement of
the diversity of training data set during bagging process and (ii) the adaptive
weights of individual samples of the training data set during adaboost process.
Although a number of approaches for predicting phosphorylation site based
on ANN have been developed in the last decade, little e®ort has been put on the generation and selection of features of phosphorylation sites, which is a very crucial
step leading to good performance. Hence, we analysis a broad spectrum of features,
including discrete alphabets, evolutionary features, physicochemical features and
structural features, based on class separability measuring criteria. We further
propose a heterogenous feature representation for describing phosphorylation sites,
which integrates features with high discriminatory power but low correlation with
one another, thus reduces the dimensionality of the code vector and at the same
time retains as much as possible of the class discriminatory information.
We evaluate BAE on a large database of experimentally veri¯ed phosphory-
lation sites, and compare the results with existing prediction systems that adopt
neural network (NN) and support vector machine (SVM). The experimental results
show that BAE outperforms many existing methods.