Hosts:

Abstract

An important challenge for the field of machine learning is to leverage the diversity of information available in large-scale learning problems, in which different sources of information often capture different aspects of the data. Beyond classical vectorial data formats, information in the format of graphs, trees, strings and beyond have become widely available (e.g., the linked structure of webpages, amino acid sequences describing proteins). In this talk I introduce a principled computational and statistical framework to integrate data from heterogeneous information sources in a flexible and unified way. The approach is formulated within the unifying learning framework of kernel methods and applied to the specific case of classification. The resulting formulation takes the form of a semidefinite programming (SDP) problem. Although this implies a polynomial time algorithm, the scale of many real-life problems is often beyond the reach of general-purpose SDP algorithms. Using tools from conic duality and convex analysis, I derive a dedicated algorithm that is significantly more efficient than generic SDP methods in this setting. Finally, I present applications to computational biology, showing that classification performance can be enhanced by integrating diverse genome-wide information sources.