Predicting co-complexed protein pairs from heterogeneous data

Jian Qiu and William Stafford Noble

PLoS Computational Biology. 4(4):e1000054, 2008.

Abstract

Proteins do not carry out their functions alone. Instead, they often
act by participating in macromolecular complexes and play different
functional roles depending on the other members of the complex. It is
therefore interesting to identify co-complex relationships. Although
protein complexes can be identified in a high-throughput manner by
experimental technologies such as affinity purification coupled with
mass spectrometry (APMS), these large-scale datasets often suffer from
high false positive and false negative rates. Here, we present a
computational method that predicts co-complexed protein pair (CCPP)
relationships using kernel methods from heterogeneous data sources. We
show that a diffusion kernel based on random walks on the full network
topology yields good performance in predicting CCPPs from protein
interaction networks. In the setting of direct ranking, a diffusion
kernel performs much better than the mutual clustering coefficient. In
the setting of SVM classifiers, a diffusion kernel performs much
better than a linear kernel. We also show that combination of
complementary information improves the performance of our CCPP
recognizer. A summation of three diffusion kernels based on
two-hybrid, APMS, and genetic interaction networks and three sequence
kernels achieves better performance than the sequence kernels or
diffusion kernels alone. Inclusion of additional features achieves a
still better ROC(50) of 0.937. Assuming a negative-to-positive ratio
of 600ratio1, the final classifier achieves 89.3% coverage at an
estimated false discovery rate of 10%. Finally, we applied our
prediction method to two recently described APMS datasets. We find
that our predicted positives are highly enriched with CCPPs that are
identified by both datasets, suggesting that our method successfully
identifies true CCPPs. An SVM classifier trained from heterogeneous
data sources provides accurate predictions of CCPPs in yeast. This
computational method thereby provides an inexpensive method for
identifying protein complexes that extends and complements
high-throughput experimental data.