2009

We shed light on the discrimination between patterns belonging to two different classes by casting this decoding problem into a generalized prototype framework. The discrimination process is then separated into two stages: a projection stage that reduces the dimensionality of the data by projecting it on a line and a threshold stage where the distributions of the projected patterns of both classes are separated. For this, we extend the popular mean-of-class prototype classification using algorithms from machine learning that satisfy a set of invariance properties. We report a simple yet general approach to express different types of linear classification algorithms in an identical and easy-to-visualize formal framework using generalized prototypes where these prototypes are used to express the normal vector and offset of the hyperplane. We investigate nonmargin classifiers such as the classical prototype classifier, the Fisher classifier, and the relevance vector machine. We then study hard and soft margin cl
assifiers such as the support vector machine and a boosted version of the prototype classifier. Subsequently, we relate mean-of-class prototype classification to other classification algorithms by showing that the prototype classifier is a limit of any soft margin classifier and that boosting a prototype classifier yields the support vector machine. While giving novel insights into classification per se by presenting a common and unified formalism, our generalized prototype framework also provides an efficient visualization and a principled comparison of machine learning classification.

2008

Consistency is a key property of statistical algorithms when the data
is drawn from some underlying probability distribution. Surprisingly,
despite decades of work, little is known about consistency of most
clustering algorithms. In this paper we investigate consistency of
the popular family of spectral clustering algorithms, which clusters
the data with the help of eigenvectors of graph Laplacian matrices. We
develop new methods to establish that for increasing sample size,
those eigenvectors converge to the eigenvectors of certain limit
operators. As a result we can prove that one of the two major classes
of spectral clustering (normalized clustering) converges under very
general conditions, while the other (unnormalized clustering) is only
consistent under strong additional assumptions, which are not always
satisfied in real data. We conclude that our analysis provides strong
evidence for the superiority of normalized spectral clustering.

This Chapter presents the PASCAL Evaluating Predictive Uncertainty Challenge, introduces the contributed Chapters by the participants who obtained outstanding results, and provides a discussion with some lessons to be learnt. The Challenge was set up to evaluate the ability of Machine Learning algorithms to provide good “probabilistic predictions”, rather than just the usual “point predictions” with no measure of uncertainty, in regression and classification problems. Parti-cipants had to compete on a number of regression and classification tasks, and were evaluated by both traditional losses that only take into account point predictions and losses we proposed that evaluate the quality of the probabilistic predictions.

We study the properties of the eigenvalues of Gram matrices in a non-asymptotic setting. Using local Rademacher averages, we
provide data-dependent and tight bounds for their convergence towards
eigenvalues of the corresponding kernel operator. We perform these computations in a functional analytic framework which allows to deal implicitly with reproducing kernel Hilbert spaces of infinite dimension. This can
have applications to various kernel algorithms, such as Support Vector
Machines (SVM). We focus on Kernel Principal Component Analysis
(KPCA) and, using such techniques, we obtain sharp excess risk bounds
for the reconstruction error. In these bounds, the dependence on the
decay of the spectrum and on the closeness of successive eigenvalues is
made explicit.

2005

We introduce two new functionals, the constrained covariance and the kernel mutual information, to measure the degree of independence of random variables. These quantities are both based on the covariance between functions of the random variables in reproducing kernel Hilbert spaces (RKHSs). We prove that when the RKHSs are universal, both functionals are zero if and only if the random variables are pairwise independent.
We also show that the kernel mutual information is an upper bound near independence on the Parzen window estimate of the mutual information.
Analogous results apply for two correlation-based dependence functionals introduced earlier: we show the kernel canonical correlation and the kernel generalised variance to be independence measures for universal
kernels, and prove the latter to be an upper bound on the mutual information near independence. The performance of the kernel dependence functionals in measuring independence is verified in the context of independent component analysis.

We propose an independence criterion based on the eigenspectrum of covariance operators in reproducing kernel Hilbert spaces (RKHSs), consisting of an empirical estimate of the Hilbert-Schmidt norm of the cross-covariance operator (we term this a Hilbert-Schmidt Independence Criterion, or HSIC). This approach has several advantages, compared with previous kernel-based independence criteria. First, the empirical estimate is simpler than any other kernel dependence test, and requires no user-defined regularisation. Second, there is a clearly defined population quantity which the empirical estimate approaches in the large sample limit, with exponential convergence guaranteed between the two: this ensures that independence tests based on {methodname} do not suffer from slow learning rates.
Finally, we show in the context of independent component analysis (ICA) that the performance of HSIC is competitive with that of previously published kernel-based criteria, and of other recently published ICA methods.

Journal of Computer and System Sciences, 71(3):333-359, October 2005 (article)

Abstract

In order to apply the maximum margin method in arbitrary metric
spaces, we suggest to embed the metric space into a Banach or
Hilbert space and to perform linear classification in this space.
We propose several embeddings and recall that an isometric embedding
in a Banach space is always possible while an isometric embedding in
a Hilbert space is only possible for certain metric spaces. As a
result, we obtain a general maximum margin classification
algorithm for arbitrary metric spaces (whose solution is
approximated by an algorithm of Graepel.
Interestingly enough, the embedding approach, when applied to a metric
which can be embedded into a Hilbert space, yields the SVM
algorithm, which emphasizes the fact that its solution depends on
the metric and not on the kernel. Furthermore we give upper bounds
of the capacity of the function classes corresponding to both
embeddings in terms of Rademacher averages. Finally we compare the
capacities of these function classes directly.

We propose new bounds on the error of learning algorithms in terms of a data-dependent notion of complexity. The estimates we establish give optimal rates and are based on a local and empirical version of Rademacher averages, in the sense that the Rademacher averages are computed from the data, on a subset of functions with small empirical error. We present some applications to classification and prediction with convex function classes, and with kernel classes in particular.

An important aspect of clustering algorithms is whether the partitions constructed on finite samples converge to a useful clustering of the whole data space as the sample size increases. This paper investigates this question for normalized and unnormalized versions of the popular spectral
clustering algorithm. Surprisingly, the convergence of unnormalized spectral clustering is more difficult to handle than the normalized case. Even though recently some first results on the convergence of normalized spectral clustering have been obtained, for the unnormalized case
we have to develop a completely new approach combining tools from numerical integration, spectral and perturbation theory, and probability. It turns out that while in the normalized case, spectral clustering usually converges to a nice partition of the data space, in the unnormalized case
the same only holds under strong additional assumptions which are not always satisfied. We conclude that our analysis gives strong evidence for the superiority of normalized spectral clustering. It also provides a basis
for future exploration of other Laplacian-based methods.

We propose an independence criterion based on the eigenspectrum of covariance operators in reproducing kernel Hilbert spaces (RKHSs), consisting of an empirical estimate of the Hilbert-Schmidt norm of the cross-covariance operator (we term this a Hilbert-Schmidt Independence Criterion, or HSIC). This approach has several advantages, compared with previous kernel-based independence criteria. First, the empirical estimate is simpler than any other kernel dependence test, and requires no user-defined regularisation. Second, there is a clearly defined population quantity which the empirical estimate approaches in the large sample limit, with exponential convergence guaranteed between the two: this ensures that independence tests based on HSIC do not suffer from slow learning rates.
Finally, we show in the context of independent component analysis (ICA) that the performance of HSIC is competitive with that of previously published kernel-based criteria, and of other recently published ICA methods.

In Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, pages: 112-119, (Editors: R Cowell, R and Z Ghahramani), AISTATS, January 2005 (inproceedings)

Abstract

We discuss reproducing kernel Hilbert space (RKHS)-based measures of statistical dependence, with emphasis on constrained covariance (COCO), a novel criterion to test dependence of random variables. We show that COCO is a test for independence if and only if the associated RKHSs are universal. That said, no independence test exists that can distinguish dependent and independent random variables in all circumstances. Dependent random variables can result in a COCO which is arbitrarily close to zero when the source densities are highly non-smooth. All current kernel-based independence tests share this behaviour. We demonstrate exponential convergence between the population and empirical COCO. Finally, we use COCO as a measure of joint neural activity between voxels in MRI recordings of the macaque monkey, and compare the results to the mutual information and the correlation. We also show the effect of removing breathing artefacts from the MRI recording.

We investigate the problem of defining Hilbertian metrics resp.
positive definite kernels on probability measures, continuing previous work. This type of kernels has shown very good
results in text classification and has a wide range of possible
applications. In this paper we extend the two-parameter family of
Hilbertian metrics of Topsoe such that it now includes all
commonly used Hilbertian metrics on probability measures. This
allows us to do model selection among these metrics in an elegant
and unified way. Second we investigate further our approach to
incorporate similarity information of the probability space into
the kernel. The analysis provides a better understanding of these
kernels and gives in some cases a more efficient way to compute
them. Finally we compare all proposed kernels in two text and two
image classification problems.

We discuss reproducing kernel Hilbert space (RKHS)-based measures of statistical dependence,
with emphasis on constrained covariance (COCO), a novel criterion to
test dependence of random variables. We show that COCO is a test for independence if and only if the associated RKHSs
are universal.
That said, no independence
test exists that can distinguish dependent and independent random variables in all circumstances. Dependent random variables can result in a COCO which is arbitrarily close to zero when the source densities are highly non-smooth. All current kernel-based independence tests share this behaviour. We demonstrate exponential convergence between the population and empirical COCO. Finally, we use COCO as a measure of joint neural activity between voxels in MRI recordings of the macaque monkey, and compare the results to the mutual information and the correlation. We also show the effect of removing breathing artefacts from the MRI recording.

We develop a methodology for solving high dimensional dependency estimation problems between pairs of data types, which is viable in the case where the output of interest has very high dimension, e.g., thousands of dimensions. This is achieved by mapping the objects into continuous or discrete spaces, using joint kernels. Known correlations between input and output can be defined by such kernels, some of which can maintain linearity in the outputs to provide simple (closed form) pre-images. We provide examples of such kernels and empirical results.

The last few years have witnessed important new developments in the theory and practice
of pattern classification. We intend to survey some of the main new ideas that have lead to these
important recent developments.

A general method for obtaining moment inequalities for functions
of independent random variables is presented. It is a
generalization of the entropy method which has been used to
derive concentration
inequalities for such functions cite{BoLuMa01}, and is based on
a generalized tensorization inequality due to Lata{l}a and Oleszkiewicz
cite{LaOl00}.
The new inequalities prove to be a versatile tool in a
wide range of applications.
We illustrate the power of the method by showing how
it can be used to effortlessly re-derive classical
inequalities including
Rosenthal and Kahane-Khinchine-type inequalities for sums
of independent random variables, moment inequalities for suprema
of empirical processes, and moment inequalities for Rademacher chaos
and $U$-statistics. Some of these corollaries are apparently new.
In particular, we generalize Talagrands exponential inequality
for Rademacher chaos of order two to any order.
We also discuss applications for other complex functions
of independent random variables, such as suprema of boolean polynomials
which include, as special cases, subgraph counting problems in
random graphs.

Machine Learning has become a key enabling technology for many engineering applications, investigating scientific questions and theoretical problems alike. To stimulate discussions and to disseminate new results, a summer school series was started in February 2002, the documentation of which is published as LNAI 2600.
This book presents revised lectures of two subsequent summer schools held in 2003 in Canberra, Australia, and in T{\"u}bingen, Germany. The tutorial lectures included are devoted to statistical learning theory, unsupervised learning, Bayesian inference, and applications in pattern recognition; they provide in-depth overviews of exciting new developments and contain a large number of references.
Graduate students, lecturers, researchers and professionals alike will find this book a useful resource in learning and teaching machine learning.

We investigate the problem of defining Hilbertian metrics resp.
positive definite kernels on probability measures, continuing previous work. This type of kernels has shown very good
results in text classification and has a wide range of possible
applications. In this paper we extend the two-parameter family of
Hilbertian metrics of Topsoe such that it now includes all
commonly used Hilbertian metrics on probability measures. This
allows us to do model selection among these metrics in an elegant
and unified way. Second we investigate further our approach to
incorporate similarity information of the probability space into
the kernel. The analysis provides a better understanding of these
kernels and gives in some cases a more efficient way to compute
them. Finally we compare all proposed kernels in two text and one
image classification problem.

This paper gives a survey of results in the mathematical
literature on positive definite kernels and their associated
structures. We concentrate on properties which seem potentially
relevant for Machine Learning and try to clarify some results that
have been misused in the literature. Moreover we consider
different lines of generalizations of positive definite kernels.
Namely we deal with operator-valued kernels and present the
general framework of Hilbertian subspaces of Schwartz which we use
to introduce kernels which are distributions. Finally indefinite
kernels and their associated reproducing kernel spaces are
considered.

There exist many different generalization error bounds for classification. Each of these bounds contains an improvement over the others for certain situations. Our goal is to combine these different improvements into a single bound. In particular we combine the PAC-Bayes approach introduced by McAllester, which is interesting for averaging classifiers, with the optimal union bound provided by the generic chaining technique developed by Fernique and Talagrand. This combination is quite natural since the generic chaining is based on the notion of majorizing measures, which can be considered as priors on the set of classifiers, and such priors also arise in the PAC-bayesian setting.

The Google search engine has enjoyed a huge success with its web page ranking algorithm, which exploits global, rather than local, hyperlink structure of the web using random walks. Here we propose a simple universal ranking algorithm for data lying in the Euclidean space, such as text or image data. The core idea of our method is to rank the data with respect to the intrinsic manifold structure collectively revealed by a great amount of data. Encouraging experimental results from synthetic, image, and text data illustrate the validity of our method.

The goal of this article is to develop a framework for large margin classification in metric spaces. We want to find a generalization of linear decision functions for metric spaces and define a corresponding notion of margin such that the decision function separates the training points with a large margin. It will turn out that using Lipschitz functions as decision functions, the inverse of the Lipschitz constant can be interpreted as the size of a margin. In order to construct a clean mathematical setup we isometrically embed the given metric space into a Banach space and the space of Lipschitz functions into its dual space. To analyze the resulting algorithm, we prove several representer theorems. They state that there always exist solutions of the Lipschitz classifier which can be expressed in terms of distance functions to training points. We provide generalization bounds for Lipschitz classifiers in terms of the Rademacher complexities of some Lipschitz function classes. The generality of our approach can be seen from the fact that several well-known algorithms are special cases of the Lipschitz classifier, among them the support vector machine, the linear programming machine, and the 1-nearest neighbor classifier.

We consider the general problem of learning from labeled and
unlabeled data, which is often called semi-supervised learning or transductive inference. A principled approach to semi-supervised learning is to design a classifying function which is sufficiently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. We present a simple algorithm to obtain such a smooth solution. Our method yields encouraging experimental results on a number of classification problems and demonstrates effective use of unlabeled data.

We address in this paper the question of how the knowledge of the marginal distribution $P(x)$ can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graph-based semi-supervised learning algorithms. We also propose practical implementations.

In this paper we investigate connections between statistical learning
theory and data compression on the basis of support vector machine (SVM)
model selection. Inspired by several generalization bounds we construct
"compression coefficients" for SVMs which measure the amount by which the
training labels can be compressed by a code built from the separating
hyperplane. The main idea is to relate the coding precision to geometrical
concepts such as the width of the margin or the shape of the data in the
feature space. The so derived compression coefficients combine well known
quantities such as the radius-margin term R^2/rho^2, the eigenvalues of the
kernel matrix, and the number of support vectors. To test whether they are
useful in practice we ran model selection experiments on benchmark data
sets. As a result we found that compression coefficients can fairly
accurately predict the parameters for which the test error is minimized.

The goal of this article is to
investigate the field of Hilbertian metrics on probability
measures. Since they are very versatile and can therefore be
applied in various problems they are of great interest in kernel
methods. Quit recently Tops{o}e and Fuglede introduced a family
of Hilbertian metrics on probability measures. We give basic
properties of the Hilbertian metrics of this family and other used
metrics in the literature. Then we propose an extension of the
considered metrics which incorporates structural information of
the probability space into the Hilbertian metric. Finally we
compare all proposed metrics in an image and text classification
problem using histogram data.

We discuss reproducing kernel Hilbert space (RKHS)-based measures of statistical dependence, with emphasis on constrained covariance (COCO), a novel criterion to test dependence of random variables. We show that COCO is a test for independence if and only if the associated RKHSs are universal. That said, no independence test exists that can distinguish dependent and independent random variables in all circumstances. Dependent random variables can result in a COCO which is arbitrarily close to zero when the source densities are highly non-smooth, which can make dependence hard to detect empirically. All current kernel-based independence tests share this behaviour. Finally, we demonstrate exponential convergence between the population and empirical COCO, which implies that COCO does not suffer from slow learning rates when used as a dependence test.

In this article we construct a maximal margin classification algorithm for arbitrary metric spaces. At first we show that the Support Vector Machine (SVM) is a maximal margin algorithm for the class of metric spaces where the negative squared distance is conditionally positive definite (CPD). This means that the metric space can be isometrically embedded into a Hilbert space, where one performs linear maximal margin separation. We will show that the solution only depends on the metric, but not on the kernel. Following the framework we develop for the SVM, we construct an algorithm for maximal margin classification in arbitrary metric spaces. The main difference compared with SVM is that we no longer embed isometrically into a Hilbert space, but a Banach space. We further give an estimate of the capacity of the function class involved in this algorithm via Rademacher averages. We recover an algorithm of Graepel et al. [6].

We obtain exponential concentration inequalities for sub-additive
functions of independent random variables under weak conditions on the
increments of those functions, like
the existence of exponential moments for these increments.
As a consequence of these general inequalities, we obtain refinements
of Talagrand's inequality for empirical processes and new
bounds for randomized empirical processes.
These results are obtained by further developing the entropy method
introduced by Ledoux.

We investigate data based procedures for selecting the kernel when learning with Support Vector Machines. We provide generalization error bounds by estimating the Rademacher complexities of the corresponding function classes. In particular we obtain a complexity bound for function classes induced by kernels with given eigenvectors, i.e., we allow to vary the spectrum and keep the eigenvectors fix. This bound is only a logarithmic factor bigger than the complexity of the function class induced by a single kernel. However, optimizing the margin over such classes leads to overfitting. We thus propose a suitable way of constraining the class. We use an efficient algorithm to solve the resulting optimization problem, present preliminary experimental results, and compare them
to an alignment-based approach.

The Google search engine has had a huge success with its PageRank
web page ranking algorithm, which exploits global, rather than
local, hyperlink structure of the World Wide Web using random
walk. This algorithm can only be used for graph data, however.
Here we propose a simple universal ranking algorithm for vectorial
data, based on the exploration of the intrinsic global geometric
structure revealed by a huge amount of data. Experimental results
from image and text to bioinformatics illustrates the validity of
our algorithm.

We consider the learning problem in the transductive setting. Given
a set of points of which only some are labeled, the goal is to
predict the label of the unlabeled points. A principled clue to
solve such a learning problem is the consistency assumption that a
classifying function should be sufficiently smooth with respect to
the structure revealed by these known labeled and unlabeled points. We present a simple
algorithm to obtain such a smooth solution. Our method yields encouraging experimental results on a
number of classification problems and demonstrates effective use of
unlabeled data.

Motivation: In drug discovery a key task is to identify characteristics that separate active (binding) compounds from inactive (non-binding) ones. An automated prediction system can help reduce resources necessary to carry out this task.
Results: Two methods for prediction of molecular bioactivity for drug design are introduced and shown to perform well in a data set previously studied as part of the KDD (Knowledge Discovery and Data Mining) Cup 2001. The data is characterized by very few positive examples, a very large number of features (describing three-dimensional properties of the molecules) and rather different distributions between training and test data. Two techniques are introduced specifically to tackle these problems: a feature selection method for unbalanced data and a classifier which adapts to the distribution of the the unlabeled test data (a so-called transductive method). We show both techniques improve identification performance and in conjunction provide an improvement over using only one of the techniques. Our results suggest the importance of taking into account the characteristics in this data which may also be relevant in other problems of a similar type.

In this short note, building on ideas of M. Herbster [2] we propose a method for automatically tuning the
parameter of the FIXED-SHARE algorithm proposed by Herbster and
Warmuth [3] in the context of on-line learning with
shifting experts. We show that this can be done with a memory
requirement of $O(nT)$ and that the additional loss incurred by
the tuning is the same as the loss incurred for estimating the
parameter of a Bernoulli random variable.

Recently introduced in Machine Learning, the notion of kernels has
drawn a lot of interest as it allows to obtain non-linear algorithms
from linear ones in a simple and elegant manner. This, in conjunction
with the introduction of new linear classification methods such as the
Support Vector Machines has produced significant progress. The
successes of such algorithms is now spreading as they are applied to
more and more domains. Many Signal Processing problems, by their
non-linear and high-dimensional nature may benefit from such
techniques. We give an overview of kernel methods and their recent
applications.

Our goal is to understand the principles of Perception, Action and Learning in autonomous systems that successfully interact with complex environments and to use this understanding to design future systems