We propose a general framework for computing minimal set covers under class of certain logical constraints.
The underlying idea is to transform the problem into a mathematical programm under linear constraints.
In this sense it can be seen as a natural extension of the vector quantization algorithm proposed by Tipping and Schoelkopf.
We show which class of logical constraints can be cast and relaxed into linear constraints and give an algorithm for
the transformation.

In recent years, spectral clustering has become one of the most
popular modern clustering algorithms. It is simple to implement, can
be solved efficiently by standard linear algebra software, and very
often outperforms traditional clustering algorithms such as the
k-means algorithm. Nevertheless, on the first glance spectral
clustering looks a bit mysterious, and it is not obvious to see why it
works at all and what it really does. This article is a tutorial
introduction to spectral clustering. We describe different graph
Laplacians and their basic properties, present the most common
spectral clustering algorithms, and derive those algorithms from
scratch by several different approaches. Advantages and disadvantages
of the different spectral clustering algorithms are discussed.

We propose novel methods for machine learning of structured output
spaces. Specifically, we consider outputs which are graphs with
vertices that have a natural order.
We consider the usual adjacency matrix representation of
graphs, as well as two other representations for such a graph: (a)
decomposing the graph into a set of paths, (b) converting the graph
into a single sequence of nodes with labeled edges.
For each of the three representations, we propose an encoding and
decoding scheme. We also propose an evaluation measure for comparing
two graphs.

Protein subcellular localization is a crucial ingredient to many
important inferences about cellular processes, including prediction of
protein function and protein interactions. While many predictive
computational tools have been proposed, they tend to have complicated
architectures and require many design decisions from the developer.
We propose an elegant and fully automated approach to building a
prediction system for protein subcellular localization. We propose a
new class of protein sequence kernels which considers all motifs
including motifs with gaps. This class of kernels allows
the inclusion of pairwise amino acid distances into their
computation. We further propose a multiclass support vector machine method
which directly solves protein subcellular localization without
resorting to the common approach of splitting the problem into several
binary classification problems. To automatically search over families
of possible amino acid motifs, we generalize our method to optimize over
multiple kernels at the same time. We compare our automated approach
to four other predictors on three different datasets.

(147), Max Planck Institute for Biological Cybernetics, Tübingen, April 2006, The version in the "Large Scale Kernel Machines" book is more up to date. (techreport)

Abstract

Most literature on Support Vector Machines (SVMs) concentrate on
the dual optimization problem. In this paper, we would like to point out
that the primal problem can also be solved efficiently, both for linear
and non-linear SVMs, and there is no reason for ignoring it. Moreover, from
the primal point of view, new families of algorithms for large scale SVM
training can be investigated.

We address the problem of learning hyperparameters in kernel methods for
which the Hessian of the objective is structured. We propose an approximation
to the cross-validation log likelihood whose gradient can be computed
analytically, solving the hyperparameter learning problem efficiently
through nonlinear optimization. Crucially, our learning method is based
entirely on matrix-vector multiplication primitives with the kernel
matrices and their derivatives, allowing straightforward specialization to
new kernels or to large datasets. When applied to the problem of multi-way
classification, our method scales linearly in the number of classes and
gives rise to state-of-the-art results on a remote imaging task.

2004

We propose fast algorithms for reducing the number of kernel evaluations in the testing
phase for methods such as Support Vector Machines (SVM) and Ridge Regression (RR). For
non-sparse methods such as RR this results in significantly improved prediction time.
For binary SVMs, which are already sparse in their expansion, the pay off is mainly in
the cases of noisy or large-scale problems. However, we then further develop our method
for multi-class problems where, after choosing the expansion to find vectors which
describe all the hyperplanes jointly, we again achieve significant gains.

Considerable progress was recently achieved on semi-supervised
learning, which differs from the traditional supervised learning by
additionally exploring the information of the unlabelled examples.
However, a disadvantage of many existing methods is that it does
not generalize to unseen inputs. This paper investigates learning
methods that effectively make use of both labelled and unlabelled
data to build predictive functions, which are defined on not just
the seen inputs but the whole space. As a nice property, the proposed
method allows effcient training and can easily handle new
test points. We validate the method based on both toy data and
real world data sets.

In this paper, we propose to combine an efficient image representation based on local descriptors with a Support Vector Machine classifier in order to perform object categorization. For this purpose, we apply kernels defined on sets of vectors. After testing different combinations of kernel / local descriptors, we have been able to identify a very performant one.

We investigate the problem of defining Hilbertian metrics resp.
positive definite kernels on probability measures, continuing previous work. This type of kernels has shown very good
results in text classification and has a wide range of possible
applications. In this paper we extend the two-parameter family of
Hilbertian metrics of Topsoe such that it now includes all
commonly used Hilbertian metrics on probability measures. This
allows us to do model selection among these metrics in an elegant
and unified way. Second we investigate further our approach to
incorporate similarity information of the probability space into
the kernel. The analysis provides a better understanding of these
kernels and gives in some cases a more efficient way to compute
them. Finally we compare all proposed kernels in two text and one
image classification problem.

This paper gives a survey of results in the mathematical
literature on positive definite kernels and their associated
structures. We concentrate on properties which seem potentially
relevant for Machine Learning and try to clarify some results that
have been misused in the literature. Moreover we consider
different lines of generalizations of positive definite kernels.
Namely we deal with operator-valued kernels and present the
general framework of Hilbertian subspaces of Schwartz which we use
to introduce kernels which are distributions. Finally indefinite
kernels and their associated reproducing kernel spaces are
considered.

We introduce a new framework for regression between multi-dimensional spaces. Standard
methods for solving this problem typically reduce the problem to one-dimensional
regression by choosing features in the input and/or output spaces. These methods, which
include PLS (partial least squares), KDE (kernel dependency estimation), and PCR
(principal component regression), select features based on different a-priori judgments as
to their relevance. Moreover, loss function and constraints are chosen not primarily on
statistical grounds, but to simplify the resulting optimisation. By contrast, in our
approach the feature construction and the regression estimation are performed jointly,
directly minimizing a loss function that we specify, subject to a rank constraint. A
major advantage of this approach is that the loss is no longer chosen according to the
algorithmic requirements, but can be tailored to the characteristics of the task at hand;
the features will then be optimal with respect to this objective. Our approach also
allows for the possibility of using a regularizer in the optimization. Finally, by processing the observations sequentially, our algorithm is able to work on large scale problems.

We consider the general problem of learning from labeled and
unlabeled data. Given a set of points, some of them are labeled,
and the remaining points are unlabeled. The goal is to predict the
labels of the unlabeled points. Any supervised learning algorithm
can be applied to this problem, for instance, Support Vector
Machines (SVMs). The problem of our interest is if we can
implement a classifier which uses the unlabeled data information
in some way and has higher accuracy than the classifiers which use
the labeled data only. Recently we proposed a simple algorithm,
which can substantially benefit from large amounts of unlabeled
data and demonstrates clear superiority to supervised learning
methods. In this paper we further investigate the algorithm using
random walks and spectral graph theory, which shed light on the
key steps in this algorithm.

We discuss reproducing kernel Hilbert space (RKHS)-based measures of statistical dependence, with emphasis on constrained covariance (COCO), a novel criterion to test dependence of random variables. We show that COCO is a test for independence if and only if the associated RKHSs are universal. That said, no independence test exists that can distinguish dependent and independent random variables in all circumstances. Dependent random variables can result in a COCO which is arbitrarily close to zero when the source densities are highly non-smooth, which can make dependence hard to detect empirically. All current kernel-based independence tests share this behaviour. Finally, we demonstrate exponential convergence between the population and empirical COCO, which implies that COCO does not suffer from slow learning rates when used as a dependence test.

We present a simple, geometric method to
construct Fieller's exact confidence sets for
ratios of jointly normally distributed random
variables. Contrary to previous geometric
approaches in the literature, our method is
valid in the general case where both sample mean
and covariance are unknown. Moreover, not only
the construction but also its proof are purely
geometric and elementary, thus giving intuition
into the nature of the confidence sets.

Max Planck Institute for Biological Cybernetics, 2004, See the improved version Regularization on Discrete Spaces. (techreport)

Abstract

We propose a general regularization framework for transductive
inference. The given data are thought of as a graph, where the
edges encode the pairwise relationships among data. We develop
discrete analysis and geometry on graphs, and then naturally adapt
the classical regularization in the continuous case to the graph
situation. A new and effective algorithm is derived from this
general framework, as well as an approach we developed before.

Designing a Brain Computer Interface (BCI) system one can choose from a variety of features that
may be useful for classifying brain activity during a mental task. For the special case of classifying EEG signals we propose the usage of the state of the art feature selection algorithms Recursive Feature Elimination [3] and Zero-Norm Optimization [13] which are based on the training of Support Vector Machines (SVM) [11]. These algorithms can provide more accurate solutions than standard filter methods for feature selection [14].
We adapt the methods for the purpose of selecting EEG channels. For a motor imagery paradigm we
show that the number of used channels can be reduced significantly without increasing the classification error. The resulting best channels agree well with the expected underlying cortical activity patterns during the mental tasks.
Furthermore we show how time dependent task specific information can be visualized.

The Google search engine has had a huge success with its PageRank
web page ranking algorithm, which exploits global, rather than
local, hyperlink structure of the World Wide Web using random
walk. This algorithm can only be used for graph data, however.
Here we propose a simple universal ranking algorithm for vectorial
data, based on the exploration of the intrinsic global geometric
structure revealed by a huge amount of data. Experimental results
from image and text to bioinformatics illustrates the validity of
our algorithm.

A new method for performing a kernel principal component analysis is
proposed. By kernelizing the generalized Hebbian algorithm, one can
iteratively estimate the principal components in a reproducing
kernel Hilbert space with only linear order memory complexity. The
derivation of the method, a convergence proof, and preliminary
applications in image hyperresolution are presented. In addition,
we discuss the extension of the method to the online learning of
kernel principal components.

We consider the learning problem in the transductive setting. Given
a set of points of which only some are labeled, the goal is to
predict the label of the unlabeled points. A principled clue to
solve such a learning problem is the consistency assumption that a
classifying function should be sufficiently smooth with respect to
the structure revealed by these known labeled and unlabeled points. We present a simple
algorithm to obtain such a smooth solution. Our method yields encouraging experimental results on a
number of classification problems and demonstrates effective use of
unlabeled data.

The Wiener series is one of the standard methods to systematically
characterize the nonlinearity of a neural system. The classical
estimation method of the expansion coefficients via cross-correlation
suffers from severe problems that prevent its application to
high-dimensional and strongly nonlinear systems. We propose a new
estimation method based on regression in a reproducing kernel Hilbert
space that overcomes these problems. Numerical experiments show
performance advantages in terms of convergence, interpretability and
system size that can be handled.

A key tool in protein function discovery is the ability to rank databases of proteins given a query amino acid sequence. The most successful method so far is a web-based tool called PSI-BLAST which uses heuristic alignment of a profile built using the large unlabeled database. It has been shown that such use of global information via an unlabeled data improves over a local measure derived from a basic pairwise alignment such as performed by PSI-BLAST's predecessor, BLAST. In this article we
look at ways of leveraging techniques from the field of machine learning for the problem of ranking. We show how clustering and semi-supervised learning techniques, which aim to capture global structure in data, can significantly improve over PSI-BLAST.

Canonical correlation analysis (CCA) is a classical multivariate method concerned with describing linear dependencies between sets of variables. After a short exposition of the linear sample CCA problem and its analytical solution, the article proceeds with a detailed characterization of its geometry. Projection operators are used to illustrate the relations between canonical vectors and variates. The article then addresses the problem of CCA between spaces spanned by objects mapped into kernel feature spaces. An exact solution for this kernel canonical correlation (KCCA) problem is derived from a geometric point of view. It shows that the expansion coefficients of the canonical vectors in their respective feature space can be found by linear CCA in the basis induced by kernel principal component analysis. The effect of mappings into higher dimensional feature spaces is considered critically since it simplifies the CCA problem in general. Then two regularized variants of KCCA are discussed. Relations to other methods are illustrated, e.g., multicategory kernel Fisher discriminant analysis, kernel principal component regression and possible applications thereof in blind source separation.

We introduce two new functions, the kernel covariance (KC) and the kernel
mutual information (KMI), to measure the degree of independence of several
continuous random variables.
The former is guaranteed to be zero if and only if the random variables
are pairwise independent; the latter shares this property, and is in addition
an approximate upper bound on the mutual information, as measured near
independence, and is based on a kernel density estimate.
We show that Bach and Jordan‘s kernel generalised variance (KGV) is also
an upper bound on the same kernel density estimate, but is looser.
Finally, we suggest that the addition of a regularising term in the KGV
causes it to approach the KMI, which motivates the introduction of this
regularisation.
The performance of the KC and KMI is verified in the context of instantaneous
independent component analysis (ICA), by recovering both artificial and
real (musical) signals following linear mixing.

In this short note, building on ideas of M. Herbster [2] we propose a method for automatically tuning the
parameter of the FIXED-SHARE algorithm proposed by Herbster and
Warmuth [3] in the context of on-line learning with
shifting experts. We show that this can be done with a memory
requirement of $O(nT)$ and that the additional loss incurred by
the tuning is the same as the loss incurred for estimating the
parameter of a Bernoulli random variable.

Interactive Images are a natural extension of three recent developments: digital photography, interactive web pages, and browsable video. An interactive image is a multi-dimensional image, displayed two dimensions at a time (like a standard digital image), but with which a user can interact to browse through the other dimensions. One might consider a standard video sequence viewed with a video player as a simple interactive image with time as the third dimension. Interactive images are a generalization of this idea, in which the third (and greater) dimensions may be focus, exposure, white balance, saturation, and other parameters. Interaction is handled via a variety of modes including those we call ordinal, pixel-indexed, cumulative, and comprehensive. Through exploration of three novel forms of interactive images based on color, exposure, and focus, we will demonstrate the compelling nature of interactive images.

Our goal is to understand the principles of Perception, Action and Learning in autonomous systems that successfully interact with complex environments and to use this understanding to design future systems