Compressed Sensing uses a small number of random, linear
measurements to acquire a sparse signal. Nonlinear algorithms,
such as l1 minimization, are used to reconstruct the
signal from the measured data. This paper proposes rowaction
methods as a computational approach to solving the
l1 optimization problem. This paper presents a specific rowaction
method and provides extensive empirical evidence that
it is an effective technique for signal reconstruction. This approach
offers several advantages over interior-point methods,
including minimal storage and computational requirements,
scalability, and robustness.

We present an approach for designing interest operators that are
based on human eye movement statistics. In contrast to existing
methods which use hand-crafted saliency measures, we use machine
learning methods to infer an interest operator directly from eye
movement data. That way, the operator provides a measure of
biologically plausible interestingness. We describe the data
collection, training, and evaluation process, and show that our
learned saliency measure significantly accounts for human eye
movements. Furthermore, we illustrate connections to existing
interest operators, and present a multi-scale interest point
detector based on the learned function.

This Chapter presents the PASCAL Evaluating Predictive Uncertainty Challenge, introduces the contributed Chapters by the participants who obtained outstanding results, and provides a discussion with some lessons to be learnt. The Challenge was set up to evaluate the ability of Machine Learning algorithms to provide good “probabilistic predictions”, rather than just the usual “point predictions” with no measure of uncertainty, in regression and classification problems. Parti-cipants had to compete on a number of regression and classification tasks, and were evaluated by both traditional losses that only take into account point predictions and losses we proposed that evaluate the quality of the probabilistic predictions.

In many regression tasks, in addition to an accurate estimate
of the conditional mean of the target distribution, an indication of the
predictive uncertainty is also required. There are two principal sources
of this uncertainty: the noise process contaminating the data and the
uncertainty in estimating the model parameters based on a limited sample
of training data. Both of them can be summarised in the predictive
variance which can then be used to give confidence intervals. In this paper,
we present various schemes for providing predictive variances for
kernel ridge regression, especially in the case of a heteroscedastic regression,
where the variance of the noise process contaminating the data is
a smooth function of the explanatory variables. The use of leave-one-out
cross-validation is shown to eliminate the bias inherent in estimates of
the predictive variance. Results obtained on all three regression tasks
comprising the predictive uncertainty challenge demonstrate the value
of this approach.

We consider the problem of fitting a linear operator induced equation to point sampled data. In order to do so we systematically exploit the duality between minimizing a regularization functional derived from an operator and
kernel regression methods. Standard machine learning model selection algorithms can then be interpreted as a search of the equation best fitting given data points. For many kernels this operator induced equation is a linear differential equation. Thus, we link a continuous-time system identification task with common machine learning methods. The presented link opens up a wide variety of methods to be applied to this system identification problem. In a series of experiments we demonstrate an example algorithm working on non-uniformly spaced data, giving special focus to the problem of identifying one system from multiple data recordings.

In Advances in Data Analysis: Proceedings of the 30th Annual Conference of The Gesellschaft für Klassifikation, 30, pages: 1, March 2006 (inproceedings)

Abstract

The computation of classical higher-order statistics such as
higher-order moments or spectra is difficult for images due to the
huge number of terms to be estimated and interpreted. We propose an
alternative approach in which multiplicative pixel interactions are
described by a series of Wiener functionals. Since the functionals
are estimated implicitly via polynomial kernels, the combinatorial
explosion associated with the classical higher-order statistics is
avoided. In addition, the kernel framework allows for estimating
infinite series expansions and for the regularized estimation of the
Wiener series. First results show that image structures such as
lines or corners can be predicted correctly, and that pixel
interactions up to the order of five play an important role in
natural images.

We present a kernel-based approach to the classification of time series of gene expression profiles. Our method takes into account the dynamic evolution over time as well as the temporal characteristics of the data. More specifically, we
model the evolution of the gene expression profiles as a Linear Time Invariant (LTI) dynamical system and estimate its model parameters. A kernel on dynamical systems is then used to classify these time series. We successfully test our approach on a published dataset to predict response to drug therapy in Multiple Sclerosis patients. For pharmacogenomics, our method offers a huge potential for advanced computational tools in disease diagnosis, and disease and drug therapy outcome prognosis.

In Proceedings of the 9th International Symposium on Artificial Intelligence and Mathematics, pages: 1-11, ISAIM, January 2006 (inproceedings)

Abstract

We propose a new inference rule for estimating causal structure that underlies the observed statistical dependencies among n random variables. Our method is based on comparing the conditional distributions of variables given their direct causes (the so-called Markov kernels") for all hypothetical causal directions and choosing the most plausible one. We consider those Markov kernels most plausible, which maximize the (conditional) entropies constrained by their observed first moment (expectation) and second moments (variance and covariance with its direct causes) based on their given domain. In this paper, we discuss our inference rule for causal relationships between two variables in detail, apply it to a real-world temperature data set with known causality and show that our method provides a correct result for the example.

While operational space control is of essential importance for robotics and well-understood from an analytical point of view, it can be prohibitively hard to achieve accurate control in face of modeling errors, which are inevitable in complex robots, e.g., humanoid robots. In such cases, learning control methods can offer an interesting alternative to analytical control algorithms. However, the resulting learning problem is ill-defined as it requires to learn an inverse mapping of a usually redundant system, which is well known to suffer from the property of non-covexity of the solution space, i.e., the learning system could generate motor commands that try to steer the robot into physically impossible configurations. A first important insight for this paper is that, nevertheless, a physically correct solution to the inverse problem does exits when learning of the inverse map is performed in a suitable piecewise linear way. The second crucial component for our work is based on a recent insight that many operational space controllers can be understood in terms of a constraint optimal control problem. The cost function associated with this optimal control problem allows us to formulate a learning algorithm that automatically synthesizes a globally consistent desired resolution of redundancy while learning the operational space controller. From the view of machine learning, the learning problem corresponds to a reinforcement learning problem that maximizes an immediate reward and that employs an expectation-maximization policy search algorithm. Evaluations on a three degrees of freedom robot arm illustrate the feasability of our suggested approach.

One of the major challenges in both action generation for robotics and in the understanding of human motor control is to learn the "building blocks of movement generation", called motor primitives. Motor primitives, as used in this paper, are parameterized control policies such as splines or nonlinear differential equations with desired attractor properties. While a lot of progress has been made in teaching parameterized motor primitives using supervised or imitation learning, the self-improvement by interaction of the system with the environment remains a challenging problem. In this paper, we evaluate different reinforcement learning approaches for improving the performance of parameterized motor primitives. For pursuing this goal, we highlight the difficulties with current reinforcement learning methods, and outline both established and novel algorithms for the gradient-based improvement of parameterized policies. We compare these algorithms in the context of motor primitive learning, and show that our most modern algorithm, the Episodic Natural Actor-Critic outperforms previous algorithms by at least an order of magnitude. We demonstrate the efficiency of this reinforcement learning method in the application of learning to hit a baseball with an anthropomorphic robot arm.

The aquisition and improvement of motor skills and
control policies for robotics from trial and error is of essential
importance if robots should ever leave precisely pre-structured
environments. However, to date only few existing reinforcement
learning methods have been scaled into the domains of highdimensional
robots such as manipulator, legged or humanoid
robots. Policy gradient methods remain one of the few exceptions
and have found a variety of applications. Nevertheless, the
application of such methods is not without peril if done in an uninformed
manner. In this paper, we give an overview on learning
with policy gradient methods for robotics with a strong focus on
recent advances in the field. We outline previous applications to
robotics and show how the most recently developed methods can
significantly improve learning performance. Finally, we evaluate
our most promising algorithm in the application of hitting a
baseball with an anthropomorphic arm.

[Abstract]: Various machine learning methods have made a rapid transition to response modeling in search of improved
performance. And support vector machine (SVM) has also been attracting much attention lately. This paper presents an
SVM response model. We are specifically focusing on the how-tos to circumvent practical obstacles, such as how to
face with class imbalance problem, how to produce the scores from an SVM classifier for lift chart analysis, and how
to evaluate the models on accuracy and profit. Besides coping with the intractability problem of SVM training caused
by large marketing dataset, a previously proposed pattern selection algorithm is introduced. SVM training accompanies
time complexity of the cube of training set size. The pattern selection algorithm picks up important training patterns
before SVM response modeling. We made comparison on SVM training results between the pattern selection algorithm and random sampling. Three aspects of SVM response models were evaluated: accuracies, lift chart analysis, and computational efficiency. The SVM trained with selected patterns showed a high accuracy, a high uplift in profit and
in response rate, and a high computational efficiency.

We investigate Bayesian alternatives to classical Monte Carlo methods for evaluating integrals. Bayesian Monte Carlo (BMC) allows the incorporation of prior knowledge, such as smoothness of the integrand, into the estimation. In a simple problem we show that this outperforms any classical importance sampling method. We also attempt more challenging multidimensional integrals involved in computing marginal likelihoods of statistical models (a.k.a. partition functions and model evidences). We find that Bayesian Monte Carlo outperformed Annealed Importance Sampling, although for very high dimensional problems or
problems with massive multimodality BMC may be less adequate. One advantage of the Bayesian approach to Monte Carlo is that samples can be drawn from any distribution. This allows for the possibility of active design of sample points so as to maximise information gain.

We investigate data based procedures for selecting the kernel when learning with Support Vector Machines. We provide generalization error bounds by estimating the Rademacher complexities of the corresponding function classes. In particular we obtain a complexity bound for function classes induced by kernels with given eigenvectors, i.e., we allow to vary the spectrum and keep the eigenvectors fix. This bound is only a logarithmic factor bigger than the complexity of the function class induced by a single kernel. However, optimizing the margin over such classes leads to overfitting. We thus propose a suitable way of constraining the class. We use an efficient algorithm to solve the resulting optimization problem, present preliminary experimental results, and compare them
to an alignment-based approach.

We consider the problem of multi-step ahead prediction in time series analysis using the non-parametric Gaussian process model. k-step ahead forecasting of a discrete-time non-linear dynamic system can be performed by doing repeated one-step ahead predictions. For a state-space model of the form y_t = f(y_{t-1},...,y_{t-L}), the
prediction of y at time t + k is based on the point estimates of the previous outputs. In this paper, we show how, using an analytical Gaussian approximation, we can formally incorporate the uncertainty about intermediate regressor values, thus updating the uncertainty on the current prediction.

We propose a framework to incorporate unlabeled data in kernel
classifier, based on the idea that two points in the same cluster are more likely to have the same label. This is achieved by modifying the eigenspectrum of the kernel matrix. Experimental results assess the validity of this approach.

We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the protein classification problem. These kernels measure sequence similarity
based on shared occurrences of k-length subsequences, counted with up to m mismatches, and do not rely on any generative model for the positive training sequences. We compute the kernels efficiently using a mismatch tree data structure and report experiments on a benchmark SCOP dataset, where we show that the mismatch kernel used with an SVM classifier performs as well as the Fisher kernel, the most successful method for remote homology detection, while achieving considerable computational savings.

In this paper, we consider Tipping‘s relevance vector machine (RVM) and formalize an incremental training
strategy as a variant of the expectation-maximization (EM) algorithm that we call subspace EM. Working with a subset of active basis functions, the sparsity of the RVM solution will ensure that the number of basis functions and thereby the computational complexity is kept low. We also introduce a mean field approach to the intractable classification
model that is expected to give a very good approximation to exact Bayesian inference and contains the Laplace approximation as a special case. We test the algorithms on two large data sets with O(10^3-10^4) examples. The results indicate that Bayesian learning of large data sets, e.g.
the MNIST database is realistic.

Gaussian processes provide an approach to nonparametric modelling which allows a straightforward combination of function and derivative observations in an empirical model. This is of particular importance in identification of nonlinear dynamic systems from experimental data. 1) It
allows us to combine derivative information, and associated uncertainty with normal function observations into the learning and inference process. This derivative information can be in the form of priors specified by an expert or identified from perturbation data close to equilibrium. 2)
It allows a seamless fusion of multiple local linear models in a consistent manner, inferring consistent models and ensuring that integrability constraints are met. 3) It improves dramatically the computational efficiency
of Gaussian process models for dynamic system identification, by summarising large quantities of near-equilibrium data by a handful of linearisations, reducing the training set size - traditionally a problem for
Gaussian process models.

The tangential neurons in the fly brain are sensitive to the typical optic flow patterns generated during self-motion. In this study, we examine whether a simplified linear model of these neurons can be used to estimate self-motion from the optic flow. We present a theory for
the construction of an estimator consisting of a linear combination of optic flow vectors that incorporates prior knowledge both about the distance distribution of the environment, and about the noise and self-motion statistics of the sensor. The estimator is tested on a gantry carrying an omnidirectional vision sensor. The experiments show
that the proposed approach leads to accurate and robust estimates of rotation rates, whereas translation estimates turn out to be less reliable.

Recently the Fisher score (or the Fisher kernel) is increasingly used as a feature extractor for classification problems. The Fisher score is a vector of parameter derivatives of loglikelihood of a probabilistic model. This
paper gives a theoretical analysis about how class information is preserved in the space of the Fisher score, which turns out that the Fisher score consists of a few important dimensions with class information and many nuisance dimensions. When we perform clustering with the Fisher score, K-Means type methods are obviously inappropriate because they make use of all dimensions. So we will develop a novel but simple clustering
algorithm specialized for the Fisher score, which can exploit important dimensions. This algorithm is successfully tested in experiments with artificial data and real data (amino acid sequences).

Training SVM requires large memory and long cpu time when the pattern set is large. To alleviate the computational burden in SVM training, we propose a fast preprocessing algorithm which selects only the patterns near the decision boundary. The time complexity of the proposed algorithm is much smaller than that of the naive M^2 algorithm

Gaussian Process (GP) inference is a probabilistic kernel method where the GP is treated as a latent function. The inference is carried out using the Bayesian online learning and its extension to the more general iterative approach which we call TAP/EP learning.
Sparsity is introduced in this context to make the TAP/EP method applicable to large datasets. We address the prohibitive scaling of the number of parameters by defining a subset of the training data that is used as the support the GP, thus the number of required parameters is independent of the training set, similar to the case of ``Support--‘‘ or ``Relevance--Vectors‘‘.
An advantage of the full probabilistic treatment is that allows the computation of the marginal data likelihood or evidence, leading to hyper-parameter estimation within the GP inference.
An EM algorithm to choose the hyper-parameters is proposed. The TAP/EP learning is the E-step and the M-step then updates the hyper-parameters. Due to the sparse E-step the resulting algorithm does not involve manipulation of large matrices. The presented algorithm is applicable to a wide variety of likelihood functions. We present results of applying the algorithm on classification and nonstandard regression problems for artificial and real datasets.

In Proceedings of the 13th IFAC Symposium on System Identification, pages: 1195-1200, (Editors: Van den Hof, P., B. Wahlberg and S. Weiland), Proceedings of the 13th IFAC Symposium on System Identification, August 2003 (inproceedings)

Abstract

Nonparametric Gaussian Process models, a Bayesian statistics approach, are used to implement a nonlinear adaptive control law. Predictions, including propagation of the state uncertainty are made over a k-step horizon. The expected value of a quadratic cost function is minimised, over this prediction horizon, without ignoring the variance of the model predictions. The general method and its main features are illustrated on a simulation example.

Training support vector classifiers (SVC) requires
large memory and long cpu time when the pattern set is large. To alleviate the computational burden in SVC training, we previously proposed a preprocessing algorithm which selects only
the patterns in the overlap region around the decision boundary, based on neighborhood properties [8], [9], [10]. The k-nearest neighbors class label entropy for each pattern was used to estimate the patterns proximity to the decision boundary. The value of parameter k is critical, yet has been determined by a rather ad-hoc fashion. We propose in this paper a systematic procedure to determine k and show its effectiveness through experiments.

In this paper we present a learning-based approach for the modelling of complex movement sequences. Based on the method of Spatio-Temporal Morphable Models (STMMS. We derive a hierarchical algorithm that, in a first step, identifies automatically movement elements in movement sequences based on a coarse spatio-temporal description, and in a second step models these movement primitives by approximation through linear combinations of learned example movement trajectories. We describe the different steps of the algorithm and show how it can be applied for modelling and synthesis of complex sequences of human movements that contain movement elements with variable style. The proposed method is demonstrated on different applications of movement representation relevant for imitation learning of movement styles in humanoid robotics.

Discriminative models have been of interest in the NLP community in recent years. Previous research has shown that they are advantageous over generative models. In this paper, we investigate how different objective functions and optimization methods affect the performance of the classifiers in the discriminative learning framework. We focus on the sequence labelling problem, particularly POS tagging and NER tasks. Our experiments show that changing the objective function is not as effective as changing the features included in the model.

Training SVM requires large memory and long cpu time when the pattern set is large. To alleviate the computational burden in SVM training, we propose a fast preprocessing algorithm which selects only the patterns near the decision boundary. Preliminary simulation results were promising: Up to two orders of magnitude, training time reduction was achieved including the preprocessing, without any loss in classification accuracies.

Reinforcement learning offers a general framework to explain reward
related learning in artificial and biological motor control. However, current
reinforcement learning methods rarely scale to high dimensional movement
systems and mainly operate in discrete, low dimensional domains
like game-playing, artificial toy problems, etc. This drawback makes them
unsuitable for application to human or bio-mimetic motor control. In
this poster, we look at promising approaches that can potentially scale
and suggest a novel formulation of the actor-critic algorithm which takes
steps towards alleviating the current shortcomings. We argue that methods
based on greedy policies are not likely to scale into high-dimensional
domains as they are problematic when used with function approximation
 a must when dealing with continuous domains. We adopt the path
of direct policy gradient based policy improvements since they avoid the
problems of unstabilizing dynamics encountered in traditional value iteration
based updates. While regular policy gradient methods have demonstrated
promising results in the domain of humanoid notor control, we
demonstrate that these methods can be significantly improved using the
natural policy gradient instead of the regular policy gradient. Based on
this, it is proved that Kakades average natural policy gradient is indeed
the true natural gradient. A general algorithm for estimating the
natural gradient, the Natural Actor-Critic algorithm, is introduced. This
algorithm converges with probability one to the nearest local minimum in
Riemannian space of the cost function. The algorithm outperforms nonnatural
policy gradients by far in a cart-pole balancing evaluation, and
offers a promising route for the development of reinforcement learning for
truly high-dimensionally continuous state-action systems.
Keywords: Reinforcement learning, neurodynamic programming, actorcritic
methods, policy gradient methods, natural policy gradient

In Proceedings of the International Conference on Intelligent Control Systems and Signal Processing ICONS 2003, 1, pages: 137-142, (Editors: Ruano, E.A.), Proceedings of the International Conference on Intelligent Control Systems and Signal Processing ICONS, April 2003 (inproceedings)

Abstract

In this paper an alternative approach to black-box identification of non-linear dynamic systems is compared with the more established approach of using artificial neural networks. The Gaussian process prior approach is a representative of non-parametric modelling approaches. It was compared on a pH process modelling case study. The purpose of modelling was to use the model for control design. The comparison revealed that even though Gaussian process models can be effectively used for modelling dynamic systems caution has to be axercised when signals are selected.

In this paper, we describe an efficient algorithm to sequentially update a density support estimate obtained using one-class support vector machines. The solution provided is an exact solution, which proves to be far more computationally attractive than a batch approach. This deterministic technique is applied to the problem of audio signal segmentation, with simulations demonstrating the computational performance gain on toy data sets, and the accuracy of the segmentation on audio signals.

At the previous workshop (ICA2001) we proposed the ACE-TD method that reduces the post-nonlinear blind source separation problem (PNL BSS) to a linear BSS problem. The method utilizes the Alternating Conditional Expectation (ACE) algorithm to approximately invert the (post-){non-linear} functions. In this contribution, we propose an alternative procedure called Gaussianizing transformation, which is motivated by the fact that linearly mixed signals before nonlinear transformation are approximately Gaussian distributed. This heuristic, but simple and efficient procedure yields similar results as the ACE method and can thus be used as a fast and effective equalization method. After equalizing the nonlinearities, temporal decorrelation separation (TDSEP) allows us to recover the source signals. Numerical simulations on realistic examples are performed to compare "Gauss-TD" with "ACE-TD".

We introduce a new contrast function, the kernel mutual information
(KMI), to measure the degree of independence of continuous random
variables. This contrast function provides an approximate upper bound
on the mutual information, as measured near independence, and is based
on a kernel density estimate of the mutual information between a discretised
approximation of the continuous random variables. We show that Bach
and Jordan&lsquo;s kernel generalised variance (KGV) is also an upper bound
on the same kernel density estimate, but is looser. Finally, we suggest
that the addition of a regularising term in the KGV causes it to approach
the KMI, which motivates the introduction of this regularisation.

Usually, noise is considered to be destructive. We present
a new method that constructively injects noise to assess the
reliability and the group structure of empirical ICA components.
Simulations show that the true root-mean squared
angle distances between the real sources and some source
estimates can be approximated by our method. In a toy experiment,
we see that we are also able to reveal the underlying
group structure of extracted ICA components. Furthermore,
an experiment with fetal ECG data demonstrates
that our approach is useful for exploratory data analysis of
real-world data.

Our goal is to understand the principles of Perception, Action and Learning in autonomous systems that successfully interact with complex environments and to use this understanding to design future systems