In the multiple instance learning setting, each observation is a bag of feature vectors of which one or more vectors indicates membership in a class. The primary task is to identify if any vectors in the bag indicate class membership while ignoring vectors that do not. We describe here a kernel-based technique that defines a parametric family of kernels via conformal transformations and jointly learns a discriminant function over bags together with the optimal parameter settings of the kernel. Learning a conformal transformation effectively amounts to weighting regions in the feature space according to their contribution to classification accuracy; regions that are discriminative will be weighted higher than regions that are not. This allows the classifier to focus on regions contributing to classification accuracy while ignoring regions that correspond to vectors found both in positive and in negative bags. We show how parameters of this transformation can be learned for support vector machines by posing the
problem as a multiple kernel learning problem. The resulting multiple instance classifier gives competitive accuracy for several multi-instance benchmark datasets from different domains.

A major challenge in applying machine learning methods to Brain-Computer
Interfaces (BCIs) is to overcome the possible nonstationarity in the data from the datablock
the method is trained on and that the method is applied to. Assuming the joint
distributions of the whitened signal and the class label to be identical in two blocks, where
the whitening is done in each block independently, we propose a simple adaptation formula
that is applicable to a broad class of spatial filtering methods including ICA, CSP, and
logistic regression classifiers. We characterize the class of linear transformations for which
the above assumption holds. Experimental results on 60 BCI datasets show improved
classification accuracy compared to (a) fixed spatial filter approach (no adaptation) and
(b) fixed spatial pattern approach (proposed by Hill et al., 2006 [1]).

Small molecules in chemistry can be represented as graphs.
In a quantitative structure-activity relationship (QSAR) analysis, the
central task is to find a regression function that predicts
the activity of the molecule in high accuracy.
Setting a QSAR as a primal target, we propose a new linear
programming approach to the graph-based regression problem.
Our method extends the graph classification algorithm by Kudo et al.
(NIPS 2004), which is a combination of boosting and graph mining.
Instead of sequential multiplicative updates, we employ the linear
programming boosting (LP) for regression. The LP approach allows to
include inequality constraints for the parameter vector, which turns out to
be particularly useful in QSAR tasks where activity values are
sometimes unavailable.
Furthermore, the efficiency is improved significantly by employing
multiple pricing.

In this paper we introduce a novel approach for incrementally building aspect models, and use it to dynamically discover underlying themes from document streams. Using the new approach we present an application which we call query-line tracking i.e., we automatically discover and summarize different themes or stories that appear over time, and that relate to a particular query. We present evaluation on news corpora to demonstrate the strength of our method for both query-line tracking, online indexing and clustering.

Despite many years of research on how to properly align sequences in
the presence of sequencing errors, alternative splicing and
micro-exons, the correct alignment of mRNA sequences to genomic DNA is
still a challenging task. We present a novel approach based on large
margin learning that combines kernel based splice site predictions
with common sequence alignment techniques. By solving a convex
optimization problem, our algorithm -- called PALMA -- tunes the
parameters of the model such that the true alignment scores higher
than all other alignments. In an experimental study on the alignments
of mRNAs containing artificially generated micro-exons, we show that
our algorithm drastically outperforms all other methods: It perfectly
aligns all 4358 sequences on an hold-out set, while the best other
method misaligns at least 90 of them. Moreover, our algorithm is very
robust against noise in the query sequence: when deleting, inserting,
or mutating up to 50% of the query sequence, it still aligns 95% of
all sequences correctly, while other methods achieve less than 36%
accuracy. For datasets, additional results and a stand-alone
alignment tool see
http://www.fml.mpg.de/raetsch/projects/palma.

In many graph-based semi-supervised learning algorithms, edge weights are assumed to be fixed and determined by the data points&amp;amp;amp;amp;lsquo; (often symmetric)relationships in input space, without considering directionality.
However, relationships may be more informative in one direction (e.g. from labelled to unlabelled) than in the reverse direction, and some
relationships (e.g. strong weights between oppositely labelled points) are unhelpful in either direction. Undesirable edges may reduce the amount of influence an informative point can propagate to its neighbours -- the point and its outgoing edges have been ``blunted.&amp;amp;amp;amp;lsquo;&amp;amp;amp;amp;lsquo; We present an approach to ``sharpening&amp;amp;amp;amp;lsquo;&amp;amp;amp;amp;lsquo; in which weights are adjusted to meet an optimization criterion
wherever they are directed towards labelled points. This principle can be applied to a wide variety of algorithms. In the current paper, we present one ad hoc solution satisfying the principle, in order to show that it can improve performance on a number of publicly available benchmark data sets.

In this paper, an approach to the finite-horizon optimal state-feedback control problem of nonlinear, stochastic, discrete-time systems is presented. Starting from the dynamic programming equation, the value function will be approximated by means of Taylor series expansion up to second-order derivatives. Moreover, the problem will be reformulated, such that a minimum principle can be applied to the stochastic problem. Employing this minimum principle, the optimal control problem can be rewritten as a two-point boundary-value problem to be solved at each time step of a shrinking horizon. To avoid numerical problems, the two-point boundary-value problem will be solved by means of a continuation method. Thus, the curse of dimensionality of dynamic programming is avoided, and good candidates for the optimal state-feedback controls are obtained. The proposed approach will be evaluated by means of a scalar example system.

The regularization functional induced by the graph Laplacian of a random
neighborhood graph based on the data is adaptive in two ways. First it adapts to an underlying
manifold structure and second to the density of the data-generating probability measure.
We identify in this paper the limit of the regularizer and show
uniform convergence over the space of Hoelder functions. As an intermediate
step we derive upper bounds on the covering numbers of Hoelder functions on
compact Riemannian manifolds, which are of independent interest
for the theoretical analysis of manifold-based learning methods.

The Common Spatial Pattern (CSP) algorithm is a highly successful method for efficiently calculating spatial filters for brain signal classification. Spatial filtering can improve classification performance considerably, but demands that a large number of electrodes be mounted, which is inconvenient in day-to-day BCI usage. The CSP algorithm is also known for its tendency to overfit, i.e. to learn the noise in the training set rather than the signal. Both problems motivate an approach in which spatial filters are sparsified. We briefly sketch a reformulation of the problem which allows us to do this, using 1-norm regularisation. Focusing on the electrode selection issue, we present preliminary results on EEG data sets that suggest that effective spatial filters may be computed with as few as 10--20 electrodes, hence offering the potential to simplify the practical realisation of BCI systems significantly.

Given a spatial filtering algorithm that has allowed us to identify task-relevant EEG sources, we present a simple approach
for monitoring the activity of these sources while remaining relatively robust to changes in other (task-irrelevant) brain activity. The idea is to keep spatial *patterns* fixed rather than spatial filters, when transferring from
training to test sessions or from one time window to another. We show that a fixed spatial pattern (FSP)
approach, using a moving-window estimate of signal covariances, can be more robust to non-stationarity than a fixed spatial filter (FSF) approach.

Stability is a common tool to verify the validity of sample
based algorithms. In clustering it is widely used to tune the parameters of
the algorithm, such as the number k of clusters. In spite of the popularity
of stability in practical applications, there has been very little theoretical
analysis of this notion. In this paper we provide a formal definition
of stability and analyze some of its basic properties. Quite surprisingly,
the conclusion of our analysis is that for large sample size, stability is
fully determined by the behavior of the objective function which the
clustering algorithm is aiming to minimize. If the objective function has
a unique global minimizer, the algorithm is stable, otherwise it is unstable.
In particular we conclude that stability is not a well-suited tool
to determine the number of clusters - it is determined by the symmetries
of the data which may be unrelated to clustering parameters. We
prove our results for center-based clusterings and for spectral clustering,
and support our conclusions by many examples in which the behavior of
stability is counter-intuitive.

Real-world data often involves objects that exhibit multiple relationships; for example, papers and authors exhibit both paper-author interactions and paper-paper citation relationships. A typical learning problem requires one to make inferences about a subclass of objects (e.g. papers), while using the remaining objects and relations to provide relevant information. We present a simple, unified mechanism for incorporating information from multiple object types and relations when learning on a targeted subset. In this scheme, all sources of relevant information are marginalized onto the target subclass via random walks. We show that marginalized random walks can be used as a general technique for combining multiple sources of information in relational data. With this approach, we formulate new algorithms for transduction and ranking in relational data, and quantify the performance of new schemes on real world dataachieving good results in many problems.

Designs of micro electro-mechanical devices need to be robust against fluctuations in mass production. Computer experiments with tens of parameters are used to explore the behavior of the system, and to compute sensitivity measures as expectations over the input distribution. Monte Carlo methods are a simple approach to estimate these integrals, but they are infeasible when the models are computationally expensive. Using a Gaussian processes prior, expensive simulation runs can be saved. This Bayesian quadrature allows for an active selection of inputs where the simulation promises to be most valuable, and the number of simulation runs can be reduced further.
We present an active learning scheme for sensitivity analysis which is rigorously derived from the corresponding Bayesian expected loss. On three fully featured, high dimensional physical models of electro-mechanical sensors, we show that the learning rate in the active learning scheme is significantly better than for passive learning.

Principal component analysis (PCA) has been extensively applied in
data mining, pattern recognition and information retrieval for
unsupervised dimensionality reduction. When labels of data are
available, e.g.,~in a classification or regression task, PCA is however not able to use this information. The problem is more interesting if only part of the input data are labeled, i.e.,~in a
semi-supervised setting. In this paper we propose a supervised PCA
model called SPPCA and a semi-supervised PCA model called S$^2$PPCA, both of which are extensions of a probabilistic PCA model. The proposed models are able to incorporate the label information into
the projection phase, and can naturally handle multiple outputs
(i.e.,~in multi-task learning problems). We derive an efficient EM
learning algorithm for both models, and also provide theoretical
justifications of the model behaviors. SPPCA and S$^2$PPCA are
compared with other supervised projection methods on various
learning tasks, and show not only promising performance but also
good scalability.

Semi-Supervised Support Vector Machines
(S3VMs) are an appealing method for using
unlabeled data in classification: their objective
function favors decision boundaries
which do not cut clusters. However their
main problem is that the optimization problem is non-convex and has many local minima, which often results in suboptimal performances.
In this paper we propose to use a
global optimization technique known as continuation
to alleviate this problem. Compared
to other algorithms minimizing the
same objective function, our continuation
method often leads to lower test errors.

Convex learning algorithms, such as Support Vector Machines (SVMs), are
often seen as highly desirable because they offer strong practical
properties and are amenable to theoretical analysis. However, in this work
we show how non-convexity can provide scalability advantages over
convexity. We show how concave-convex programming can be applied to produce
(i) faster SVMs where training errors are no longer support vectors, and
(ii) much faster Transductive SVMs.

We present a new approach to personalized handwriting recognition.
The problem, also known as writer adaptation, consists of converting
a generic (user-independent) recognizer into a personalized
(user-dependent) one, which has an improved recognition rate for a
particular user. The adaptation step usually involves user-specific
samples, which leads to the fundamental question of how to fuse this
new information with that captured by the generic recognizer. We
propose adapting the recognizer by minimizing a regularized risk
functional (a modified SVM) where the prior knowledge from the
generic recognizer enters through a modified regularization term.
The result is a simple personalization framework with very good
practical properties. Experiments on a 100 class real-world data set
show that the number of errors can be reduced by over 40% with as
few as five user samples per character.

An intuitive approach to utilizing unlabeled data in kernel-based
classification algorithms is to simply treat the unknown labels as
additional optimization variables. For margin-based loss functions,
one can view this approach as attempting to learn low-density
separators. However, this is a hard optimization problem to solve in
typical semi-supervised settings where unlabeled data is abundant.
The popular Transductive SVM algorithm is a
label-switching-retraining procedure that is known to be susceptible
to local minima. In this paper, we present a global optimization
framework for semi-supervised Kernel machines where an easier
problem is parametrically deformed to the original hard problem and
minimizers are smoothly tracked. Our approach is motivated from
deterministic annealing techniques and involves a sequence of convex
optimization problems that are exactly and efficiently solved. We
present empirical results on several synthetic and real world
datasets that demonstrate the effectiveness of our approach.

Graph data is getting increasingly popular in, e.g.,
bioinformatics and text processing.
A main difficulty of graph data processing
lies in the intrinsic high dimensionality of graphs, namely,
when a graph is represented as a binary feature vector
of indicators of all possible subgraphs,
the dimensionality gets too large for usual statistical methods.
We propose an efficient method for learning
a binomial mixture model in this feature space.
Combining the $ell_1$ regularizer and the data structure
called DFS code tree, the MAP estimate of non-zero parameters
are computed efficiently by means of the EM algorithm.
Our method is applied to the clustering of RNA graphs,
and is compared favorably with graph kernels and
the spectral graph distance.

Elimination by aspects (EBA) is a probabilistic
choice model describing how humans decide between several options.
The options from which the choice is made are characterized by
binary features and associated weights. For instance, when choosing
which mobile phone to buy the features to consider may be: long
lasting battery, color screen, etc. Existing methods for inferring
the parameters of the model assume pre-specified features. However,
the features that lead to the observed choices are not always known.
Here, we present a non-parametric Bayesian model to infer the
features of the options and the corresponding weights from choice
data. We use the Indian buffet process (IBP) as a prior over the
features. Inference using Markov chain Monte Carlo (MCMC) in
conjugate IBP models has been previously described. The main
contribution of this paper is an MCMC algorithm for the EBA model
that can also be used in inference for other non-conjugate IBP
models---this may broaden the use of IBP priors considerably.

In this paper, we use large neighborhood Markov random fields to learn rich prior
models of color images. Our approach extends the monochromatic Fields of Experts
model (Roth and Blackwell, 2005) to color images. In the Fields of Experts model, the curse
of dimensionality due to very large clique sizes is circumvented by parameterizing the
potential functions according to a product of experts. We introduce several
simplifications of the original approach by Roth and Black which allow us to cope with
the increased clique size (typically 3x3x3 or 5x5x3 pixels) of color images.
Experimental results are presented for image denoising which evidence improvements over
state-of-the-art monochromatic image priors.

WIn this paper we study a new framework introduced by Vapnik (1998) and Vapnik (2006) that is an alternative capacity concept to the large margin approach. In the particular case of binary classification, we are given a set of labeled examples, and a collection of "non-examples" that do not belong to either class of interest. This collection, called the Universum, allows one to encode prior knowledge by representing meaningful concepts in the same domain as the problem at hand. We describe an algorithm to leverage the Universum by maximizing the number of observed contradictions, and show experimentally that this approach delivers accuracy improvements over using labeled data alone.

While kernel canonical correlation analysis (kernel CCA) has been applied in many problems, the asymptotic convergence of the functions estimated from a finite sample to the true functions has not yet been established. This paper gives a rigorous proof of the statistical convergence of kernel CCA and a related method (NOCCO), which provides a theoretical justification for these methods. The result also gives a sufficient condition on the decay of the regularization coefficient in the methods to ensure convergence.

Images represent an important and abundant source of data. Understanding their statistical structure has important applications such as image compression and restoration. In this paper we propose a particular kind of probabilistic model, dubbed the “products of edge-perts model” to describe
the structure of wavelet transformed images. We develop a practical denoising algorithm based on a single edge-pert and show state-ofthe-art denoising performance on benchmark images.

Gaussian processes are attractive models for probabilistic classification but unfortunately exact inference is analytically intractable. We compare Laplace&amp;amp;amp;amp;amp;amp;amp;amp;lsquo;s method and Expectation Propagation (EP) focusing on marginal likelihood estimates and predictive performance. We explain theoretically and corroborate empirically that EP is superior to Laplace. We also compare to a sophisticated MCMC scheme and show that EP is surprisingly accurate.

We present an approach for designing interest operators that are
based on human eye movement statistics. In contrast to existing
methods which use hand-crafted saliency measures, we use machine
learning methods to infer an interest operator directly from eye
movement data. That way, the operator provides a measure of
biologically plausible interestingness. We describe the data
collection, training, and evaluation process, and show that our
learned saliency measure significantly accounts for human eye
movements. Furthermore, we illustrate connections to existing
interest operators, and present a multi-scale interest point
detector based on the learned function.

This Chapter presents the PASCAL Evaluating Predictive Uncertainty Challenge, introduces the contributed Chapters by the participants who obtained outstanding results, and provides a discussion with some lessons to be learnt. The Challenge was set up to evaluate the ability of Machine Learning algorithms to provide good “probabilistic predictions”, rather than just the usual “point predictions” with no measure of uncertainty, in regression and classification problems. Parti-cipants had to compete on a number of regression and classification tasks, and were evaluated by both traditional losses that only take into account point predictions and losses we proposed that evaluate the quality of the probabilistic predictions.

In many regression tasks, in addition to an accurate estimate
of the conditional mean of the target distribution, an indication of the
predictive uncertainty is also required. There are two principal sources
of this uncertainty: the noise process contaminating the data and the
uncertainty in estimating the model parameters based on a limited sample
of training data. Both of them can be summarised in the predictive
variance which can then be used to give confidence intervals. In this paper,
we present various schemes for providing predictive variances for
kernel ridge regression, especially in the case of a heteroscedastic regression,
where the variance of the noise process contaminating the data is
a smooth function of the explanatory variables. The use of leave-one-out
cross-validation is shown to eliminate the bias inherent in estimates of
the predictive variance. Results obtained on all three regression tasks
comprising the predictive uncertainty challenge demonstrate the value
of this approach.

We consider the problem of fitting a linear operator induced equation to point sampled data. In order to do so we systematically exploit the duality between minimizing a regularization functional derived from an operator and
kernel regression methods. Standard machine learning model selection algorithms can then be interpreted as a search of the equation best fitting given data points. For many kernels this operator induced equation is a linear differential equation. Thus, we link a continuous-time system identification task with common machine learning methods. The presented link opens up a wide variety of methods to be applied to this system identification problem. In a series of experiments we demonstrate an example algorithm working on non-uniformly spaced data, giving special focus to the problem of identifying one system from multiple data recordings.

In Advances in Data Analysis: Proceedings of the 30th Annual Conference of The Gesellschaft für Klassifikation, 30, pages: 1, March 2006 (inproceedings)

Abstract

The computation of classical higher-order statistics such as
higher-order moments or spectra is difficult for images due to the
huge number of terms to be estimated and interpreted. We propose an
alternative approach in which multiplicative pixel interactions are
described by a series of Wiener functionals. Since the functionals
are estimated implicitly via polynomial kernels, the combinatorial
explosion associated with the classical higher-order statistics is
avoided. In addition, the kernel framework allows for estimating
infinite series expansions and for the regularized estimation of the
Wiener series. First results show that image structures such as
lines or corners can be predicted correctly, and that pixel
interactions up to the order of five play an important role in
natural images.

In Proceedings of the 9th International Symposium on Artificial Intelligence and Mathematics, pages: 1-11, ISAIM, January 2006 (inproceedings)

Abstract

We propose a new inference rule for estimating causal structure that underlies the observed statistical dependencies among n random variables. Our method is based on comparing the conditional distributions of variables given their direct causes (the so-called Markov kernels") for all hypothetical causal directions and choosing the most plausible one. We consider those Markov kernels most plausible, which maximize the (conditional) entropies constrained by their observed first moment (expectation) and second moments (variance and covariance with its direct causes) based on their given domain. In this paper, we discuss our inference rule for causal relationships between two variables in detail, apply it to a real-world temperature data set with known causality and show that our method provides a correct result for the example.

While operational space control is of essential importance for robotics and well-understood from an analytical point of view, it can be prohibitively hard to achieve accurate control in face of modeling errors, which are inevitable in complex robots, e.g., humanoid robots. In such cases, learning control methods can offer an interesting alternative to analytical control algorithms. However, the resulting learning problem is ill-defined as it requires to learn an inverse mapping of a usually redundant system, which is well known to suffer from the property of non-covexity of the solution space, i.e., the learning system could generate motor commands that try to steer the robot into physically impossible configurations. A first important insight for this paper is that, nevertheless, a physically correct solution to the inverse problem does exits when learning of the inverse map is performed in a suitable piecewise linear way. The second crucial component for our work is based on a recent insight that many operational space controllers can be understood in terms of a constraint optimal control problem. The cost function associated with this optimal control problem allows us to formulate a learning algorithm that automatically synthesizes a globally consistent desired resolution of redundancy while learning the operational space controller. From the view of machine learning, the learning problem corresponds to a reinforcement learning problem that maximizes an immediate reward and that employs an expectation-maximization policy search algorithm. Evaluations on a three degrees of freedom robot arm illustrate the feasability of our suggested approach.

One of the major challenges in both action generation for robotics and in the understanding of human motor control is to learn the "building blocks of movement generation", called motor primitives. Motor primitives, as used in this paper, are parameterized control policies such as splines or nonlinear differential equations with desired attractor properties. While a lot of progress has been made in teaching parameterized motor primitives using supervised or imitation learning, the self-improvement by interaction of the system with the environment remains a challenging problem. In this paper, we evaluate different reinforcement learning approaches for improving the performance of parameterized motor primitives. For pursuing this goal, we highlight the difficulties with current reinforcement learning methods, and outline both established and novel algorithms for the gradient-based improvement of parameterized policies. We compare these algorithms in the context of motor primitive learning, and show that our most modern algorithm, the Episodic Natural Actor-Critic outperforms previous algorithms by at least an order of magnitude. We demonstrate the efficiency of this reinforcement learning method in the application of learning to hit a baseball with an anthropomorphic robot arm.

Self-organization and the phenomenon of emergence play an essential role in living systems and form a challenge to artificial life systems. This is not only because systems become more lifelike, but also since self-organization may help in reducing the design efforts in creating complex behavior systems. The present paper studies self-exploration based on a general approach to the self-organization of behavior, which has been developed and tested in various examples in recent years. This is a step towards autonomous early robot development. We consider agents under the close sensorimotor coupling paradigm with a certain cognitive ability realized by an internal forward model. Starting from tabula rasa initial conditions we overcome the bootstrapping problem and show emerging self-exploration. Apart from that, we analyze the effect of limited actions, which lead to deprivation of the world model. We show that our paradigm explicitly avoids this by producing purposive actions in a natural way. Examples are given using a simulated simple wheeled robot and a spherical robot driven by shifting internal masses.

The online generation of trajectories in humanoid robots remains a difficult problem. In this contribution, we present a system that allows the superposition, and the switch between, discrete and rhythmic movements. Our approach uses nonlinear dynamical systems for generating trajectories online and in real time. Our goal is to make use of attractor properties of dynamical systems in order to provide robustness against small perturbations and to enable online modulation of the trajectories. The system is demonstrated on a humanoid robot performing a drumming task.

Our goal is to understand the principles of Perception, Action and Learning in autonomous systems that successfully interact with complex environments and to use this understanding to design future systems