Abstract

During the last years gene interaction networks are increasingly being used for the assessment and interpretation of biological measurements. Knowledge of the interaction partners of an unknown protein allows scientists to understand the complex relationships between genetic products, helps to reveal unknown biological functions and pathways, and get a more detailed picture of an organism's complexity. Being able to measure all protein interactions under all relevant conditions is virtually impossible. Hence, computational methods integrating different datasets for predicting gene interactions are needed. However, when integrating different sources one has to account for the fact that some parts of the information may be redundant, which may lead to an overestimation of the true likelihood of an interaction. Our method integrates information derived from three different databases (Bioverse, HiMAP and STRING) for predicting human gene interactions. A Bayesian approach was implemented in order to integrate the different data sources on a common quantitative scale. An important assumption of the Bayesian integration is independence of the input data (features). Our study shows that the conditional dependency cannot be ignored when combining gene interaction databases that rely on partially overlapping input data. In addition, we show how the correlation structure between the databases can be detected and we propose a linear model to correct for this bias. Benchmarking the results against two independent reference data sets shows that the integrated model outperforms the individual datasets. Our method provides an intuitive strategy for weighting the different features while accounting for their conditional dependencies.

Initially log-likelihood scores were calculated for each database independently. A naive Bayes classifier was applied to the individual data sets for mapping the interaction confidence onto a common scale. Subsequently a linear correction was applied to obtained from more than one database.

Three-fold cross validation was applied for each of the three databases independently. Training of the prediction was done based on two-thirds of the reference data. The x-axis represents the predicted from the training parameters while the y-coordinate represents the actual enrichment with true positives in the test set. The data were binned in five bins and the dots show the respective for each bin. The color indicates the source database. For all three datasets predicted and observed results are very close to the ideal case (solid line). The correlation coefficients between predicted and true are reported in the figure legend.

The plot of observed versus estimated for all four subgroups of integrated interactions illustrates the dependencies between the three different datasets as well as the linear correlation between predicted and observed log-likelihood scores. The approach is the same in as with the difference that the test set is limited to common interactions. Linear correlation coefficients () are reported in the legend.

Linear regression plots for trained (predicted) versus tested based on a second, independent reference data set for the different combinations of redundant subsets. Red line: interactions reported in only one database. Green and blue line: corrected and uncorrected for interactions reported in at least two databases. Orange line: Using the maximum of the individual instead of the sum. Ideally, all predictions should be along the diagonal. The bias corrected scores are clearly better predictors of the true interaction likelihood.

(A) Cross validation based on the HPRD in vivo reference set. (B) Training on HPRD in vivo, testing on independent reference set (see main text for details). We divided the test dataset in 20 bins based on their descending log-likelihood score and assessed the cumulative precision and recall for each successive bin for the corrected score and for the scores derived from training the individual databases. The integrated network shows equal or better overall performance. The maximum F-score of each network is reported in the legend. The F-score () is an integrated measure of the predictive power.