Abstract

Identification of robust set of predictive features is one of the most important steps in the construction of clustering, classification and regression models from many thousands of features. Although there have been various attempts to select predictive feature sets from high-dimensional data sets in classification and clustering, there is a limited attempt to study it in regression problems. As semi-supervised and supervised feature selection methods tend to identify noisy features in addition to discriminative variables, unsupervised feature selection methods (USFSMs) are generally regarded as more unbiased approach. Therefore, in this study, along with the entire feature set, four different USFSMs are considered for the quantitative prediction of peptide binding affinities being one of the most challenging post-genome regression problems of very high-dimension comparted to extremely small size of samples. As USFSMs are independent of any predictive method, support vector regression was then utilised to assess the quality of prediction. Given three different peptide binding affinity data sets, the results suggest that the regression performance of USFMs depends generally on the datasets. There is no particular method that yields the best performance compared to their performances in the classification problems. However, a closer investigation of the results appears to suggest that the spectral regression-based approach yields slightly better performance. To the best of our knowledge, this is the first study that presents comprehensive comparison of USFSMs in such high-dimensional regression problems, particularly in biological domain with an application in the prediction of peptide binding affinity, and provides a number of practical suggestions for future practitioners.