A statistical framework for genomic data fusion

Abstract

During the past decade, the new focus on genomics has highlighted a
particular challenge: to integrate the different views of the genome
that are provided by various types of experimental data. This paper
describes a computational framework for integrating and drawing
inferences from a collection of genome-wide measurements. Each data
set is represented via a kernel function, which defines generalized
similarity relationships between pairs of entities, such as genes or
proteins. The kernel representation is both flexible and efficient,
and can be applied to many different types of data. Furthermore,
kernel functions derived from different types of data can be combined
in a straightforward fashion. Recent advances in the theory of kernel
methods have provided efficient algorithms to perform such
combinations in a way that minimizes a statistical loss
function. These methods exploit semidefinite programming techniques to
reduce the problem of finding optimizing kernel combinations to a
convex optimization problem. Computational experiments performed
using yeast genome-wide data sets, including amino acid sequences,
hydropathy profiles, gene expression data and known protein-protein
interactions, demonstrate the utility of this approach. A statistical
learning algorithm trained from all of these data to recognize
particular classes of proteins -- membrane proteins and ribosomal
proteins -- performs significantly better than the same algorithm
trained on any single type of data.