Most successful object recognition systems rely on binary classification, deciding only if an object is present or not, but not providing information on the actual object location. To estimate the object‘s location, one can take a sliding window approach, but this strongly increases the computational cost because the classifier or similarity function has to be evaluated over a large set of candidate subwindows. In this paper, we propose a simple yet powerful branch and bound scheme that allows efficient maximization of a large class of quality functions over all possible subimages. It converges to a globally optimal solution typically in linear or even sublinear time, in contrast to the quadratic scaling of exhaustive or sliding window search. We show how our method is applicable to different object detection and image retrieval scenarios. The achieved speedup allows the use of classifiers for localization that formerly were considered too slow for this task, such as SVMs with a spatial pyramid kernel or nearest-neighbor classifiers based on the chi^2 distance. We demonstrate state-of-the-art localization performance of the resulting systems on the UIUC Cars data set, the PASCAL VOC 2006 data set, and in the PASCAL VOC 2007 competition.

Table tennis is a difficult motor skill which requires all basic components
of a general motor skill learning system. In order to get a step closer to such
a generic approach to the automatic acquisition and refinement of table tennis, we
study table tennis from a human motor control point of view. We make use of the
basic models of discrete human movement phases, virtual hitting points, and the
operational timing hypothesis. Using these components, we create a computational
model which is aimed at reproducing human-like behavior. We verify the
functionality of this model in a physically realistic simulation of a BarrettWAM.

An algorithm is developed to generate random rotations in three-dimensional space that follow a probability distribution arising in fitting and matching problems. The rotation matrices are orthogonally transformed into an optimal basis and then parameterized using Euler angles. The conditional distributions of the three Euler angles have a very simple form: the two azimuthal angles can be decoupled by sampling their sum and difference from a von Mises distribution; the cosine of the polar angle is exponentially distributed and thus straighforward to generate. Simulation results are shown and demonstrate the effectiveness of the method. The algorithm is compared to other methods for generating random rotations such as a random walk Metropolis scheme and a Gibbs sampling algorithm recently introduced by Green and Mardia. Finally, the algorithm is applied to a probabilistic version of the Procrustes problem of fitting two point sets and applied in the context of protein structure superposition.

Off-policy reinforcement learning is aimed at efficiently using data samples gathered from a policy that is different from the currently optimized policy. A common approach is to use importance sampling techniques for compensating for the bias of value function estimators caused by the difference between the data-sampling policy and the target policy. However, existing off-policy methods often do not take the variance of the value function estimators explicitly into account and therefore their performance tends to be unstable. To cope with this problem, we propose using an adaptive importance sampling technique which allows us to actively control the trade-off between bias and variance. We further provide a method for optimally determining the trade-off parameter based on a variant of cross-validation. We demonstrate the usefulness of the proposed approach through simulations.

Clustering is a widely used tool for exploratory data analysis. However, the theoretical understanding of clustering is very limited. We still do not have a
well-founded answer to the seemingly simple question of “how many clusters are present in the data?”, and furthermore a formal comparison of clusterings based
on different optimization objectives is far beyond our abilities. The lack of good theoretical support gives rise to multiple heuristics that confuse the practitioners
and stall development of the field. We suggest that the ill-posed nature of clustering problems is caused by the fact that clustering is often taken out of its subsequent application context. We argue that one does not cluster the data just for the sake of clustering it, but rather to
facilitate the solution of some higher level task. By evaluation of the clustering’s contribution to the solution of the higher level task it is possible to compare different
clusterings, even those obtained by different optimization objectives. In the preceding work it was shown that such an approach can be applied to evaluation and design of co-clustering solutions. Here we suggest that this approach can be extended to other settings, where clustering is applied.

Obtaining novel skills is one of the most important problems
in robotics. Machine learning techniques may be a promising approach
for automatic and autonomous acquisition of movement policies. However,
this requires both an appropriate policy representation and suitable
learning algorithms. Employing the most recent form of the dynamical
systems motor primitives originally introduced by Ijspeert et al. [1],
we show how both discrete and rhythmic tasks can be learned using
a concerted approach of both imitation and reinforcement learning, and
present our current best performing learning algorithms. Finally, we show
that it is possible to include a start-up phase in rhythmic primitives. We
apply our approach to two elementary movements, i.e., Ball-in-a-Cup
and Ball-Paddling, which can be learned on a real Barrett WAM robot
arm at a pace similar to human learning.

Generalizing the cost in the standard min-cut problem to a submodular cost function immediately makes the problem harder. Not only do we prove NP hardness even for nonnegative
submodular costs, but also show a lower bound of (|V |1/3) on the approximation factor for the (s, t) cut version of the problem. On the positive side, we propose and compare three approximation algorithms with an overall approximation factor of O(min{|V |,p|E| log |V |}) that appear to do well in practice.

The number of advanced robot systems has been increasing in recent years yielding a
large variety of versatile designs with many degrees of freedom. These robots have the potential of
being applicable in uncertain tasks outside well-structured industrial settings. However, the complexity
of both systems and tasks is often beyond the reach of classical robot programming methods. As a
result, a more autonomous solution for robot task acquisition is needed where robots adaptively adjust
their behaviour to the encountered situations and required tasks.
Learning approaches pose one of the most appealing ways to achieve this goal. However, while
learning approaches are of high importance for robotics, we cannot simply use off-the-shelf methods
from the machine learning community as these usually do not scale into the domains of robotics due
to excessive computational cost as well as a lack of scalability. Instead, domain appropriate approaches
are needed. We focus here on several core domains of robot learning. For accurate task execution,
we need motor learning capabilities. For fast learning of the motor tasks, imitation learning offers the
most promising approach. Self improvement requires reinforcement learning approaches that scale into
the domain of complex robots. Finally, for efficient interaction of humans with robot systems, we will
need a form of interaction learning. This contribution provides a general introduction to these issues
and briefly presents the contributions of the related book chapters to the corresponding research topics.

This paper focuses on ethical aspects of BCI, as a research and a clinical tool, that are challenging for practitioners currently working in the field. Specifically, the difficulties involved in acquiring informed consent from locked-in patients are investigated, in combination with an analysis of the shared moral responsibility in BCI teams, and the complications encountered in establishing effective communication with media.

Precise models of robot inverse dynamics allow the design of significantly more accurate, energy-efficient and compliant robot control. However, in some cases the accuracy of rigid-body models does not suffice for sound control performance due to unmodeled nonlinearities arising from hydraulic cable dynamics, complex friction or actuator dynamics. In such cases, estimating the inverse dynamics model from measured data poses an interesting alternative. Nonparametric regression methods, such as Gaussian process regression (GPR) or locally weighted projection regression (LWPR), are not as restrictive as parametric models and, thus, offer a more flexible framework for approximating unknown nonlinearities. In this paper, we propose a local approximation to the standard GPR, called local GPR (LGP), for real-time model online learning by combining the strengths of both regression methods, i.e., the high accuracy of GPR and the fast speed of LWPR. The approach is shown to have competitive learning performance for hig
h-dimensional data while being sufficiently fast for real-time learning. The effectiveness of LGP is exhibited by a comparison with the state-of-the-art regression techniques, such as GPR, LWPR and &#957;-support vector regression. The applicability of the proposed LGP method is demonstrated by real-time online learning of the inverse dynamics model for robot model-based control on a Barrett WAM robot arm.

We study the task of detecting the occurrence of objects
in large image collections or in videos, a problem that combines
aspects of content based image retrieval and object
localization. While most previous approaches are either
limited to special kinds of queries, or do not scale to large
image sets, we propose a new method, efficient subimage
retrieval (ESR), which is at the same time very flexible and
very efficient. Relying on a two-layered branch-and-bound
setup, ESR performs object-based image retrieval in sets of
100,000 or more images within seconds. An extensive evaluation
on several datasets shows that ESR is not only very
fast, but it also achieves detection accuracies that are on
par with or superior to previously published methods for
object-based image retrieval.

We introduce a system for textual entailment that is based on a probabilistic model of entailment. The model is defined using a calculus of transformations on dependency trees, which is characterized by the fact that derivations in that calculus preserve the truth only with a certain probability. The calculus is successfully evaluated on the datasets of the PASCAL Challenge on Recognizing Textual Entailment.

Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data, little attention has been paid to uncertainty in the results obtained. Dirichlet process mixture models provide a non-parametric Bayesian alternative to the bootstrap approach to modeling uncertainty in gene expression clustering. Most previously published applications of Bayesian model based clustering methods have been to short time series data. In this paper we present a case study of the application of non-parametric Bayesian clustering methods to the clustering of high-dimensional non-time series gene expression data using full Gaussian covariances. We use the probability that two genes belong to the same cluster in a Dirichlet process mixture model as a measure of the similarity of these gene expression profiles. Conversely, this probability can be used to define a dissimilarity measure, which, for the purposes of visualization, can be input to one
of the standard linkage algorithms used for hierarchical clustering. Biologically plausible results are obtained from the Rosetta compendium of expression profiles which extend previously published cluster analyses of this data.

Maximizing some form of Poisson likelihood (either with or without penalization) is central to image reconstruction algorithms in emission tomography. In this paper we introduce NMML, a non-monotonic algorithm for maximum likelihood PET image reconstruction. NMML offers a simple and flexible procedure that also easily incorporates standard convex regular-ization for doing penalized likelihood estimation. A vast number image reconstruction algorithms have been developed for PET, and new ones continue to be designed. Among these, methods based on the expectation maximization (EM) and ordered-subsets (OS) framework seem to have enjoyed the greatest popularity. Our method NMML differs fundamentally from methods based on EM: i) it does not depend on the concept of optimization transfer (or surrogate functions); and ii) it is a rapidly converging nonmonotonic descent procedure. The greatest strengths of NMML, however, are its simplicity, efficiency, and scalability, which make it especially attractive for tomograph
ic reconstruction. We provide a theoretical analysis NMML, and empirically observe it to outperform standard EM based methods, sometimes by orders of magnitude. NMML seamlessly allows integreation of penalties (regularizers) in the likelihood. This ability can prove to be crucial, especially because with the rapidly rising importance of combined PET/MR scanners, one will want to include more prior knowledge into the reconstruction.

We present the first (to our knowledge) approximation algo-
rithm for tensor clusteringa powerful generalization to basic 1D clustering. Tensors are increasingly common in modern applications dealing
with complex heterogeneous data and clustering them is a fundamental
tool for data analysis and pattern discovery. Akin to their 1D cousins,
common tensor clustering formulations are NP-hard to optimize. But,
unlike the 1D case no approximation algorithms seem to be known. We
address this imbalance and build on recent co-clustering work to derive
a tensor clustering algorithm with approximation guarantees, allowing
metrics and divergences (e.g., Bregman) as objective functions. Therewith, we answer two open questions by Anagnostopoulos et al. (2008).
Our analysis yields a constant approximation factor independent of data
size; a worst-case example shows this factor to be tight for Euclidean
co-clustering. However, empirically the approximation factor is observed
to be conservative, so our method can also be used in practice.

In Proceedings of the 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2009), pages: 2610-2615, IEEE Service Center, Piscataway, NJ, USA, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), October 2009 (inproceedings)

Abstract

When children learn to grasp a new object, they often know several possible grasping points from observing a parent&lsquo;s demonstration and subsequently learn better grasps by trial and error. From a machine learning point of view, this process is an active learning approach. In this paper, we present a new robot learning framework for reproducing this ability in robot grasping. For doing so, we chose a straightforward approach: first, the robot observes a few good grasps by demonstration and learns a value function for these grasps using Gaussian process regression. Subsequently, it chooses grasps which are optimal with respect to this value function using a mean-shift optimization approach, and tries them out on the real system. Upon every completed trial, the value function is updated, and in the following trials it is more likely to choose even better grasping points. This method exhibits fast learning due to the data-efficiency of Gaussian process regression framework and the fact th
at t
he mean-shift method provides maxima of this cost function. Experiments were repeatedly carried out successfully on a real robot system. After less than sixty trials, our system has adapted its grasping policy to consistently exhibit successful grasps.

In Proceedings of the 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2009), pages: 3121-3126, IEEE Service Center, Piscataway, NJ, USA, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), October 2009 (inproceedings)

Abstract

The increasing complexity of modern robots makes it prohibitively hard to accurately model such systems as required by many applications. In such cases, machine learning methods offer a promising alternative for approximating such models using measured data. To date, high computational demands have largely restricted machine learning techniques to mostly offline applications. However, making the robots adaptive to changes in the dynamics and to cope with unexplored areas of the state space requires online learning. In this paper, we propose an approximation of the support vector regression (SVR) by sparsification based on the linear independency of training data. As a result, we obtain a method which is applicable in real-time online learning. It exhibits competitive learning accuracy when compared with standard regression techniques, such as nu-SVR, Gaussian process regression (GPR) and locally weighted projection regression (LWPR).

A basic task of information processing is information transfer (flow).
P0
Here we study a pair of Brownian particles each coupled to a thermal bath
at temperatures T1 and T2 . The information flow in such a system is defined
via the time-shifted mutual information. The information flow nullifies at
equilibrium, and its efficiency is defined as the ratio of the flow to the total
entropy production in the system. For a stationary state the information flows
from higher to lower temperatures, and its efficiency is bounded from above by
(max[T1 , T2 ])/(|T1 &amp;amp;amp;amp;amp;#8722; T2 |). This upper bound is imposed by the second law and
it quantifies the thermodynamic cost for information flow in the present class
of systems. It can be reached in the adiabatic situation, where the particles
have widely different characteristic times. The efficiency of heat flowdefined
as the heat flow over the total amount of dissipated heatis limited from above
by the same factor. There is a complementarity between heat and information
flow: the set-up which is most efficient for the former is the least efficient for the
latter and vice versa. The above bound for the efficiency can be (transiently)
overcome in certain non-stationary situations, but the efficiency is still limited
from above. We study yet another measure of information processing (transfer
entropy) proposed in the literature. Though this measure does not require any
thermodynamic cost, the information flow and transfer entropy are shown to be
intimately related for stationary states.

Kernel methods are among the most successful tools in machine learning and are used in challenging data analysis problems in many disciplines. Here we provide examples where kernel methods have proven to be powerful tools for analyzing behavioral data, especially for identifying features in categorization experiments. We also demonstrate that kernel methods relate to perceptrons and exemplar models of categorization. Hence, we argue that kernel methods have neural and psychological plausibility, and theoretical results concerning their behavior are therefore potentially relevant for human category learning. In particular, we believe kernel methods have the potential to provide explanations ranging from the implementational via the algorithmic to the computational level.

In EMBC 2009, pages: 5304-5307, (Editors: Y Kim and B He and G Worrell and X Pan), IEEE Service Center, Piscataway, NJ, USA, 31st Annual International Conference of the IEEE Engineering in Medicine and Biology Society, September 2009 (inproceedings)

Abstract

Implicit Wiener series are a powerful tool to build
Volterra representations of time series with any degree of nonlinearity.
A natural question is then whether higher order
representations yield more useful models. In this work we
shall study this question for ECoG data channel relationships
in epileptic seizure recordings, considering whether quadratic
representations yield more accurate classifiers than linear ones.
To do so we first show how to derive statistical information on
the Volterra coefficient distribution and how to construct seizure
classification patterns over that information. As our results
illustrate, a quadratic model seems to provide no advantages
over a linear one. Nevertheless, we shall also show that the
interpretability of the implicit Wiener series provides insights
into the inter-channel relationships of the recordings.

We present a methodology for incorporating prior knowledge
on class probabilities into the registration process. By using knowledge
from the imaging modality, pre-segmentations, and/or probabilistic atlases,
we construct vectors of class probabilities for each image voxel. By
defining new image similarity measures for distribution-valued images,
we show how the class probability images can be nonrigidly registered in
a variational framework. An experiment on nonrigid registration of MR
and CT full-body scans illustrates that the proposed technique outperforms
standard mutual information (MI) and normalized mutual information
(NMI) based registration techniques when measured in terms of
target registration error (TRE) of manually labeled fiducials.

Creating autonomous robots that can learn to act in unpredictable environments has been a long-standing goal of robotics, artificial intelligence, and the cognitive sciences. In contrast, current commercially available industrial and service robots mostly execute fixed tasks and exhibit little adaptability. To bridge this gap, machine learning offers a myriad set of methods, some of which have already been applied with great success to robotics problems. As a result, there is an increasing interest in machine learning and statistics within the robotics community. At the same time, there has been a growth in the learning community in using robots as motivating applications for new algorithms and formalisms. Considerable evidence of this exists in the use of learning in high-profile competitions such as RoboCup and the Defense Advanced Research Projects Agency (DARPA) challenges, and the growing number of research programs funded by governments around the world.

Recent research has shown that the use of contextual cues significantly improves performance
in sliding window type localization systems. In this work, we propose a method
that incorporates both global and local context information through appropriately defined
kernel functions. In particular, we make use of a weighted combination of kernels defined
over local spatial regions, as well as a global context kernel. The relative importance of
the context contributions is learned automatically, and the resulting discriminant function
is of a form such that localization at test time can be solved efficiently using a branch
and bound optimization scheme. By specifying context directly with a kernel learning
approach, we achieve high localization accuracy with a simple and efficient representation.
This is in contrast to other systems that incorporate context for which expensive
inference needs to be done at test time. We show experimentally on the PASCAL VOC
datasets that the inclusion of context can significantly improve localization performance,
provided the relative contributions of context cues are learned appropriately.

Direct policy search is a promising reinforcement learning framework in particular for controlling in continuous, high-dimensional systems such as anthropomorphic robots. Policy search often requires a large number of samples for obtaining a stable policy update estimator due to its high flexibility. However, this is prohibitive when the sampling cost is expensive. In this paper, we extend a EM-based policy search method so that previously collected samples can be efficiently reused. The usefulness of the proposed method, called Reward-weighted Regression with sample Reuse, is demonstrated through a robot learning experiment.

High-speed smooth and accurate visual tracking of objects in
arbitrary, unstructured environments is essential for robotics and human
motion analysis. However, building a system that can adapt to arbitrary
objects and a wide range of lighting conditions is a challenging problem,
especially if hard real-time constraints apply like in robotics scenarios.
In this work, we introduce a method for learning a discriminative object
tracking system based on the recent structured regression framework for
object localization. Using a kernel function that allows fast evaluation
on the GPU, the resulting system can process video streams at speed of
100 frames per second or more.
Consecutive frames in high speed video sequences are typically very redundant,
and for training an object detection system, it is sufficient to
have training labels from only a subset of all images. We propose an
active learning method that select training examples in a data-driven
way, thereby minimizing the required number of training labeling. Experiments
on realistic data show that the active learning is superior to
previously used methods for dataset subsampling for this task.

We present a novel algorithm for the markerless tracking of deforming surfaces such as faces. We acquire a sequence of 3D scans along with color images at 40Hz. The data is then represented by implicit surface and color functions, using a novel partition-of-unity type method of efficiently combining local regressors using nearest neighbor searches. Both these functions act on the 4D space of 3D plus time, and use temporal information to handle the noise in individual scans. After interactive registration of a template mesh to the first frame, it is then automatically deformed to track the scanned surface, using the variation
of both shape and color as features in a dynamic energy minimization
problem. Our prototype system yields high-quality animated 3D models in correspondence, at a rate of approximately twenty seconds per
timestep. Tracking results for faces and other objects are presented.

Foundations and Trends in Computer Graphics and Vision, 4(3):193-285, September 2009 (article)

Abstract

Over the last years, kernel methods have established themselves as powerful tools for computer vision researchers as well as for practitioners. In this tutorial, we give an introduction to kernel methods in computer vision from a geometric perspective, introducing not only the ubiquitous support vector machines, but also less known techniques for regression, dimensionality reduction, outlier detection and clustering. Additionally, we give an outlook on very recent, non-classical techniques for the prediction of structure data, for the estimation of statistical dependency and for learning the kernel function itself. All methods are illustrated with examples of successful application from the recent computer vision research literature.

We generalize traditional goals of clustering towards distinguishing components in a non-parametric mixture model. The clusters are not necessarily based on point locations, but on higher order criteria. This framework can be implemented by embedding probability distributions
in a Hilbert space. The corresponding clustering objective is very general and relates to a range of common clustering concepts.

A wealth of time series of microarray measurements have become available over recent years. Several two-sample tests for detecting differential gene expression
in these time series have been defined, but they can only answer the question whether a gene is differentially expressed across the whole time series, not in which intervals it is differentially expressed. In this article, we propose a Gaussian process based approach for studying these dynamics of differential gene expression. In experiments on Arabidopsis thaliana gene expression levels, our novel technique helps us to uncover that the family of WRKY transcription factors appears to be involved in the early response to infection by a fungal pathogen.

Recent approaches to independent component analysis (ICA) have used kernel independence measures to obtain highly accurate solutions, particularly where classical methods experience difficulty (for instance, sources with near-zero kurtosis). FastKICA (fast HSIC-based kernel ICA) is a new optimization method for one such kernel independence measure, the Hilbert-Schmidt Independence Criterion (HSIC). The high computational efficiency of this approach is achieved by combining geometric optimization techniques, specifically an approximate Newton-like method on the orthogonal group, with accurate estimates of the gradient and Hessian based on an incomplete Cholesky decomposition. In contrast to other efficient kernel-based ICA algorithms, FastKICA is applicable to any twice differentiable kernel function. Experimental results for problems with large numbers of sources and observations indicate that FastKICA provides more accurate solutions at a given cost than gradient descent on HSIC. Comparing with other recently published ICA methods, FastKICA is competitive in terms of accuracy, relatively insensitive to local minima when initialized far from independence, and more robust towards outliers. An analysis of the local convergence properties of FastKICA is provided.

Many motor skills in humanoid robotics can be learned using parametrized motor primitives from demonstrations. However, most interesting motor learning problems require self-improvement often beyond the reach of current reinforcement learning methods due to the high dimensionality of the state-space. We develop an EM-inspired algorithm applicable to complex motor learning tasks. We compare this algorithm to several well-known parametrized policy search methods and show that it outperforms them. We apply it to motor learning problems and show that it can learn the complex Ball-in-a-Cup task using a real Barrett WAM robot arm.

The pedestal effect is the improvement in the detectability of a sinusoidal grating in the presence of another grating of the same orientation, spatial frequency, and phase—usually called the pedestal. Recent evidence has demonstrated that the pedestal effect is differently modified by spectrally flat and notch-filtered noise: The pedestal effect is reduced in flat noise but virtually disappears in the presence of notched noise (G. B. Henning & F. A. Wichmann, 2007). Here we consider a network consisting of units whose contrast response functions resemble those of the cortical cells believed to underlie human pattern vision and demonstrate that, when the outputs of multiple units are combined by simple weighted summation—a heuristic decision rule that resembles optimal information combination and produces a contrast-dependent weighting profile—the network produces contrast-discrimination data consistent with psychophysical observations: The pedestal effect is present without noise, reduced in broadband noise, but almost disappears in notched noise. These findings follow naturally from the normalization model of simple cells in primary visual cortex, followed by response-based pooling, and suggest that in processing even low-contrast sinusoidal gratings, the visual system may combine information across neurons tuned to different spatial frequencies and orientations.

Journal for General Philosophy of Science, 40(1):51-58, July 2009 (article)

Abstract

We compare Karl Poppers ideas concerning the falsifiability of a theory with similar notions from the part of statistical learning theory known as VC-theory. Poppers notion of the dimension of a theory is contrasted with the apparently very similar VC-dimension. Having located some divergences, we discuss how best to view Poppers work from the perspective of statistical learning theory, either as a precursor or as aiming to capture a different learning activity.

We present a geometric method to determine confidence sets for the
ratio E(Y)/E(X) of the means of random variables X and Y. This
method reduces the problem of constructing confidence sets for the
ratio of two random variables to the problem of constructing
confidence sets for the means of one-dimensional random variables. It
is valid in a large variety of circumstances. In the case of normally
distributed random variables, the so constructed confidence sets
coincide with the standard Fieller confidence sets. Generalizations of
our construction lead to definitions of exact and conservative
confidence sets for very general classes of distributions, provided
the joint expectation of (X,Y) exists and the linear combinations of
the form aX + bY are well-behaved. Finally, our geometric method
allows to derive a very simple bootstrap approach for constructing
conservative confidence sets for ratios which perform favorably in
certain situations, in particular in the asymmetric heavy-tailed
regime.

In Proceedings of Multiplicity and Unification in Statistics and Probability, pages: 1-10, Multiplicity and Unification in Statistics and Probability, June 2009 (inproceedings)

Abstract

The field of machine learning has flourished over the past couple of decades. With huge amounts of data available, efficient algorithms can learn to extrapolate from their training sets to become very accurate classifiers. For example, it is straightforward now to develop classifiers which achieve accuracies of around 99% on databases of handwritten digits.
Now these algorithms have been devised by theorists who arrive at the problem of machine learning with a range of different philosophical outlooks on the subject of inductive reasoning. This has led to a wide range of theoretical rationales for their work. In this talk I shall classify the different forms of justification for inductive machine learning into four kinds, and make some comparisons between them.
With little by way of theoretical knowledge to aid in the learning tasks, while the relevance of these justificatory approaches for the inductive reasoning of the natural sciences is questionable, certain issues surrounding the presuppositions of inductive reasoning are brought sharply into focus. In particular, Frequentist, Bayesian and MDL outlooks can be compared.

From an information-theoretic perspective, a noisy transmission system such as a visual Brain-Computer Interface (BCI) speller could benefit from the use of errorcorrecting codes. However, optimizing the code solely according to the maximal minimum-Hamming-distance criterion tends to lead to an overall increase in target frequency of target stimuli, and hence a significantly reduced average
target-to-target interval (TTI), leading to difficulties in classifying the individual event-related potentials (ERPs) due to overlap and refractory effects. Clearly any change to the stimulus setup must also respect the possible psychophysiological consequences. Here we report new EEG data from experiments in which we explore stimulus types and codebooks in a within-subject design, finding an interaction between the two factors. Our data demonstrate that the traditional, rowcolumn code has particular spatial properties that lead to better performance than one would expect from its TTIs and Hamming-distances alone, but nonetheless error-correcting codes can improve performance provided the right stimulus type is used.

Graph clustering methods such as spectral clustering are defined for general weighted graphs. In machine learning, however, data often is not given in form of a graph, but in terms of similarity (or distance) values between points. In this case, first a neighborhood graph is constructed using the similarities between the points and then a graph clustering algorithm is applied to this graph. In this paper
we investigate the influence of the construction of the similarity graph on the clustering results. We first study the convergence of graph clustering criteria such as the normalized cut (Ncut) as the sample size tends to infinity. We find that the limit expressions are different for different types of graph, for example the r-neighborhood graph or the k-nearest neighbor graph. In plain words:
Ncut on a kNN graph does something systematically different than Ncut on an r-neighborhood graph! This finding shows that graph clustering criteria cannot be studied independently of the kind of graph they are applied to. We also provide examples which show that these differences can be observed for toy and real data already for rather small sample sizes.

Learning in real-time applications, e.g., online approximation of the inverse dynamics model for model-based robot control, requires fast online regression techniques. Inspired by local learning, we propose a method to speed up standard Gaussian Process regression (GPR) with local GP models (LGP). The training data is partitioned in local regions, for each an individual GP model is trained. The prediction for a query point is performed by weighted estimation using nearby local models. Unlike other GP approximations, such as mixtures of experts, we use a distance based measure for partitioning of the data and weighted prediction. The proposed method achieves online learning and prediction in real-time. Comparisons with other nonparametric regression methods show that LGP has higher accuracy than LWPR and close to the performance of standard GPR and nu-SVR.

In Proceedings of the 26th International Conference on Machine Learning, pages: 801-808, (Editors: A Danyluk and L Bottou and ML Littman), ACM Press, New York, NY, USA, ICML, June 2009 (inproceedings)

Abstract

We propose a method that detects the true
direction of time series, by fitting an autoregressive
moving average model to the data.
Whenever the noise is independent of the previous
samples for one ordering of the observations,
but dependent for the opposite ordering,
we infer the former direction to be the
true one. We prove that our method works
in the population case as long as the noise of
the process is not normally distributed (for
the latter case, the direction is not identificable).
A new and important implication of
our result is that it confirms a fundamental
conjecture in causal reasoning - if after regression
the noise is independent of signal for
one direction and dependent for the other,
then the former represents the true causal
direction - in the case of time series. We
test our approach on two types of data: simulated
data sets conforming to our modeling
assumptions, and real world EEG time series.
Our method makes a decision for a significant
fraction of both data sets, and these
decisions are mostly correct. For real world
data, our approach outperforms alternative
solutions to the problem of time direction recovery.

We introduce a family of unsupervised algorithms, numerical taxonomy clustering, to simultaneously cluster data, and to learn a taxonomy that encodes the relationship between the clusters. The algorithms work by maximizing the dependence
between the taxonomy and the original data. The resulting taxonomy is a more informative visualization of complex data than simple clustering; in addition, taking into account the relations between different clusters is shown to
substantially improve the quality of the clustering, when compared with state-ofthe-art algorithms in the literature (both spectral clustering and a previous dependence
maximization approach). We demonstrate our algorithm on image and text data.

In 8th IEEE International Conference on Development and Learning, pages: 1-7, IEEE Service Center, Piscataway, NJ, USA, ICDL, June 2009 (inproceedings)

Abstract

This paper addresses the issue of learning and representing object grasp affordances, i.e. object-gripper relative configurations that lead to successful grasps. The purpose of grasp affordances is to organize and store the whole knowledge that an agent has about the grasping of an object, in order to facilitate reasoning on grasping solutions and their achievability. The affordance representation consists in a continuous probability density function defined on the 6D gripper pose space-3D position and orientation-, within an object-relative reference frame. Grasp affordances are initially learned from various sources, e.g. from imitation or from visual cues, leading to grasp hypothesis densities. Grasp densities are attached to a learned 3D visual object model, and pose estimation of the visual model allows a robotic agent to execute samples from a grasp hypothesis density under various object poses. Grasp outcomes are used to learn grasp empirical densities, i.e. grasps that have been confirmed through experience. We show the result of learning grasp hypothesis densities from both imitation and visual cues, and present grasp empirical densities learned from physical experience by a robot.

EEG connectivity measures could provide a new type of feature space for inferring a subject&amp;amp;lsquo;s intention in Brain-Computer Interfaces (BCIs). However, very little is known on EEG connectivity patterns for BCIs. In this study, EEG connectivity during motor imagery (MI) of the left and right is investigated in a broad frequency range across the whole scalp by combining Beamforming with Transfer Entropy and taking into account possible volume conduction effects. Observed connectivity patterns indicate that modulation intentionally induced by MI is strongest in the gamma-band, i.e., above 35 Hz. Furthermore, modulation between MI and rest is found to be more pronounced than between MI of different hands. This is in contrast to results on MI obtained with bandpower features, and might provide an explanation for the so far only moderate success of connectivity features in BCIs. It is concluded that future studies on connectivity based BCIs should focus on high frequency bands and con
side
r ex
peri
mental paradigms that maximally vary cognitive demands between conditions.

The discovery of causal relationships between a set of observed variables is a fundamental problem in science. For continuous-valued data linear acyclic causal models are often used because these models are well understood and there are well-known methods to fit them to data. In reality, of course, many causal relationships are more or less nonlinear, raising some doubts as to the applicability and usefulness of purely linear methods. In this contribution we show that in fact the basic linear framework can be generalized to nonlinear models with additive noise. In this extended framework, nonlinearities in the data-generating process are in fact a blessing rather than a curse, as they typically provide information on the underlying causal system and allow more aspects of the true data-generating mechanisms to be identified. In addition to theoretical results we show simulations and some simple real data experiments illustrating the identification power provided by nonlinearities.

We propose a novel bound on single-variable marginal probability distributions in factor graphs with discrete variables. The bound is obtained by propagating local
bounds (convex sets of probability distributions) over a subtree of the factor graph, rooted in the variable of interest. By construction, the method not only bounds
the exact marginal probability distribution of a variable, but also its approximate Belief Propagation marginal ("belief"). Thus, apart from providing a practical
means to calculate bounds on marginals, our contribution also lies in providing a better understanding of the error made by Belief Propagation. We show that our bound outperforms the state-of-the-art on some inference problems arising in medical diagnosis.

We show how variational Bayesian inference can be implemented for very large generalized linear models. Our relaxation is proven to be a convex problem for any log-concave model. We provide a generic double loop algorithm for solving this relaxation on models with arbitrary super-Gaussian potentials. By iteratively decoupling the criterion, most of the work can be done by solving large linear systems, rendering our algorithm orders of magnitude faster than previously proposed solvers for the same problem. We evaluate our method on problems of Bayesian active learning for large binary classification models, and show how to address settings with many candidates and sequential inclusion steps.

We study the problem of domain transfer for a supervised classification task in mRNA splicing. We consider a number of recent domain transfer methods from machine learning, including some that are novel, and evaluate them on genomic
sequence data from model organisms of varying evolutionary distance. We find that in cases where the organisms are not closely related, the use of domain adaptation methods can help improve classification performance.

Our goal is to understand the principles of Perception, Action and Learning in autonomous systems that successfully interact with complex environments and to use this understanding to design future systems