2002

We construct an geometry framework for any norm Support Vector Machine
(SVM) classifiers. Within this framework, separating hyperplanes, dual descriptions
and solutions of SVM classifiers are constructed by a purely geometric
fashion. In contrast with the optimization theory used in SVM classifiers, we have no complicated computations any more. Each step in our
theory is guided by elegant geometric intuitions.

The authors used a recognition memory paradigm to assess the influence of color information on visual memory for images of natural scenes. Subjects performed 5-10% better for colored than for black-and-white images independent of exposure duration. Experiment 2 indicated little influence of contrast once the images were suprathreshold, and Experiment 3 revealed that performance worsened when images were presented in color and tested in black and white, or vice versa, leading to the conclusion that the surface property color is part of the memory representation. Experiments 4 and 5 exclude the possibility that the superior recognition memory for colored images results solely from attentional factors or saliency. Finally, the recognition memory advantage disappears for falsely colored images of natural scenes: The improvement in recognition memory depends on the color congruence of presented images with learned knowledge about the color gamut found within natural scenes. The results can be accounted for within a multiple memory systems framework.

We estimate the number of microarrays that is required in order to gain reliable results from a common type of study: the pairwise comparison of different classes of samples. Current knowlegde seems to suffice for the construction of models that are realistic with respect to searches for individual differentially expressed genes. Such models allow to investigate the dependence of the required number of samples on the relevant parameters: the biological variability of the samples within each class; the fold changes in expression; the detection sensitivity of the microarrays; and the acceptable error rates of the results. We supply experimentalists with general conclusions as well as a freely accessible Java applet at http://cartan.gmd.de/~zien/classsize/ for fine tuning simulations to their particular actualities. Since the situation can be assumed to be very similar for large scale proteomics and metabolomics studies, our methods and results might also apply there.

Much of our information about early spatial vision comes from detection experiments involving low-contrast stimuli, which are not, perhaps, particularly "natural" stimuli. Contrast discrimination experiments provide one way to explore the visual system's response to stimuli of higher contrast whilst keeping the number of unknown parameters comparatively small. We explored both detection and contrast discrimination performance with sinusoidal and "pulse-train" (or line) gratings. Both types of grating had a fundamental spatial frequency of 2.09-c/deg but the pulse-train, ideally, contains, in addition to its fundamental component, all the harmonics of the fundamental. Although the 2.09-c/deg pulse-train produced on our display was measured using a high-performance digital camera (Photometrics) and shown to contain at least 8 harmonics at equal contrast, it was no more detectable than its most detectable component; no benefit from having additional information at the harmonics was measurable. The addition of broadband 1-D "pink" noise made it about a factor of four more detectable than any of its components. However, in contrast-discrimination experiments, with an in-phase pedestal or masking grating of the same form and phase as the signal and 15% contrast, the noise did not improve the discrimination performance of the pulse train relative to that of its sinusoidal components. We discuss the implications of these observations for models of early vision in particular the implications for possible sources of internal noise.

SVMs tend to take a very long time to train with a large data
set. If "redundant" patterns are identified and deleted in pre-processing, the training time could be reduced significantly. We propose a k-nearest
neighbors(k-NN) based pattern selection method. The method tries to select the patterns that are near the decision boundary and that are correctly labeled. The simulations over synthetic data sets showed promising results: (1) By converting a non-separable problem to a separable one,
the search for an optimal error tolerance parameter became unnecessary. (2) SVM training time decreased by two orders of magnitude without any loss of accuracy. (3) The redundant SVs were substantially reduced.

Practical experience has shown that in order to obtain the best possible performance, prior knowledge about invariances of a classification
problem at hand ought to be incorporated into the training procedure. We describe and review all known methods for doing so in support vector machines,
provide experimental results, and discuss their respective merits. One of the significant new results reported in this work is our recent achievement of the
lowest reported test error on the well-known MNIST digit recognition benchmark task, with SVM training times that are also significantly faster than
previous SVM methods.

Model selection is an important ingredient of many machine
learning algorithms, in particular when the sample size in
small, in order to strike the right trade-off between overfitting
and underfitting. Previous classical results for linear regression
are based on an asymptotic analysis. We present a new
penalization method for performing model selection for
regression that is appropriate even for small samples.
Our penalization is based on an accurate estimator of the
ratio of the expected training error and the expected
generalization error, in terms of the expected eigenvalues
of the input covariance matrix.

New classification algorithms based on the notion of 'margin'
(e.g. Support Vector Machines, Boosting) have recently been developed.
The goal of this thesis is to better understand how they work, via a
study of their theoretical performance.
In order to do this, a general framework for real-valued
classification is proposed. In this framework, it appears that the
natural tools to use are Concentration Inequalities and Empirical
Processes Theory.
Thanks to an adaptation of these tools, a new measure of the size of a
class of functions is introduced, which can be computed from the data.
This allows, on the one hand, to better understand the role of
eigenvalues of the kernel matrix in Support Vector Machines, and on
the other hand, to obtain empirical model selection criteria.

This thesis presents a theoretical and practical study of Support
Vector Machines (SVM) and related learning algorithms. In a first part,
we introduce a new induction principle from which SVMs can be derived, but
some new algorithms are also presented in this framework.
In a second part, after studying how to estimate the generalization
error of an SVM, we suggest to choose the kernel parameters of an SVM
by minimizing this estimate. Several applications such as feature
selection are presented. Finally the third part deals with the incoporation
of prior knowledge in a learning algorithm and more specifically, we
studied the case of known invariant transormations and the use
of unlabeled data.

The detectability of contrast increments was measured as a function of the contrast of a masking or pedestal grating at a number of different spatial frequencies ranging from 2 to 16 cycles per degree of visual angle. The pedestal grating always had the same orientation, spatial frequency and phase as the signal. The shape of the contrast increment threshold versus pedestal contrast (TvC) functions depend of the performance level used to define the threshold, but when both axes are normalized by the contrast corresponding to 75% correct detection at each frequency, the (TvC) functions at a given performance level are identical. Confidence intervals on the slope of the rising part of the TvC functions are so wide that it is not possible with our data to reject Webers Law.

In this paper we investigate connections between statistical learning
theory and data compression on the basis of support vector machine
(SVM) model selection. Inspired by several generalization bounds we
construct ``compression coefficients'' for SVMs, which measure the
amount by which the training labels can be compressed by some
classification hypothesis. The main idea is to relate the coding
precision of this hypothesis to the width of the margin of the
SVM. The compression coefficients connect well known quantities such
as the radius-margin ratio R^2/rho^2, the eigenvalues of the kernel
matrix and the number of support vectors. To test whether they are
useful in practice we ran model selection experiments on several real
world datasets. As a result we found that compression coefficients can
fairly accurately predict the parameters for which the test error is
minimized.

We introduce new concentration inequalities for functions on product spaces.
They allow to obtain a Bennett type deviation bound for suprema of
empirical processes indexed by upper bounded functions.
The result is an improvement on Rio's version \cite{Rio01b} of Talagrand's
inequality \cite{Talagrand96} for equidistributed variables.

We describe in this article a new code for evolving
axisymmetric isolated systems in general relativity. Such systems are described by asymptotically flat space-times, which have the property that they admit a conformal extension. We are working directly in the extended conformal manifold and solve numerically Friedrich's conformal field equations, which state that Einstein's equations hold in the physical space-time. Because of the compactness of the conformal space-time the entire space-time can be calculated on a finite numerical grid. We describe in detail the numerical scheme, especially the treatment of the axisymmetry and the boundary.

We investigate the behaviour of global and
local Rademacher averages. We present new error bounds which are
based on the local averages and indicate how data-dependent
local averages can be estimated without {it a priori}
knowledge of the class at hand.

Proceedings of the 33rd European Conference on Mathematical Psychology, pages: 44, 2002 (poster)

Abstract

The psychometric function relates an observer's performance to an independent variable, usually some physical quantity of a stimulus in a psychophysical task. Here I describe methods to (1) fitting psychometric functions, (2) assessing goodness-of-fit, and (3) providing confidence intervals for the function's parameters and other estimates derived from them. First I describe a constrained maximum-likelihood method for parameter estimation. Using Monte-Carlo simulations I demonstrate that it is important to have a fitting method that takes stimulus-independent errors (or "lapses") into account. Second, a number of goodness-of-fit tests are introduced. Because psychophysical data sets are usually rather small I advocate the use of Monte Carlo resampling techniques that do not rely on asymptotic theory for goodness-of-fit assessment. Third, a parametric bootstrap is employed to estimate the variability of fitted parameters and derived quantities such as thresholds and slopes. I describe how the bootstrap bridging assumption, on which the validity of the procedure depends, can be tested without incurring too high a cost in computation time. Finally I describe how the methods can be extended to test hypotheses concerning the form and shape of several psychometric functions. Software describing the methods is available (http://www.bootstrap-software.com/psignifit/), as well as articles describing the methods in detail (Wichmann&Hill, Perception&Psychophysics, 2001a,b).

We define notions of stability for learning algorithms
and show
how to use these notions to derive generalization error bounds
based on the empirical error and the leave-one-out error. The
methods we use can be applied in the regression framework as well
as in the classification one when the classifier is obtained by
thresholding a real-valued function. We study the stability
properties of large classes of learning algorithms such as
regularization based algorithms. In particular we focus on Hilbert
space regularization and Kullback-Leibler regularization. We
demonstrate how to apply the results to SVM for regression and
classification.

A number of methods for speeding up Gaussian Process (GP) prediction have been proposed, including the Nystr{\"o}m method of Williams and Seeger (2001). In this paper we focus on two issues (1) the relationship of the Nystr{\"o}m method to the Subset of Regressors method (Poggio and Girosi 1990; Luo and Wahba, 1997) and (2) understanding in what circumstances the Nystr{\"o}m approximation would be expected to provide a good approximation to exact GP regression.

The quantification of perfusion using dynamic susceptibility contrast MR imaging requires deconvolution to obtain the residual impulse-response function (IRF). Here, a method using a Gaussian process for deconvolution, GPD, is proposed. The fact that the IRF is smooth is incorporated as a constraint in the method. The GPD method, which automatically estimates the noise level in each voxel, has the advantage that model parameters are optimized automatically. The GPD is compared to singular value decomposition (SVD) using a common threshold for the singular values and to SVD using a threshold optimized according to the noise level in each voxel. The comparison is carried out using artificial data as well as using data from healthy volunteers. It is shown that GPD is comparable to SVD variable optimized threshold when determining the maximum of the IRF, which is directly related to the perfusion. GPD provides a better estimate of the entire IRF. As the signal to noise ratio increases or the time resolution of the measurements increases, GPD is shown to be superior to SVD. This is also found for large distribution volumes.

In this paper, we examine on-line learning problems in which the target
concept is allowed to change over time. In each trial a master algorithm
receives predictions from a large set of n experts. Its goal is to predict
almost as well as the best sequence of such experts chosen off-line by
partitioning the training sequence into k+1 sections and then choosing
the best expert for each section. We build on methods developed by
Herbster and Warmuth and consider an open problem posed by
Freund where the experts in the best partition are from a small
pool of size m.
Since k >> m, the best expert shifts back and forth
between the experts of the small pool.
We propose algorithms that solve
this open problem by mixing the past posteriors maintained by the master
algorithm. We relate the number of bits needed for encoding the best
partition to the loss bounds of the algorithms.
Instead of paying log n for
choosing the best expert in each section we first pay log (n choose m)
bits in the bounds for identifying the pool of m experts
and then log m bits per new section.
In the bounds we also pay twice for encoding the
boundaries of the sections.

In Proceedings of the 15th annual conference on Computational Learning Theory, Proceedings of the 15th annual conference on Computational Learning Theory, 2002 (inproceedings)

Abstract

We investigate measures of complexity of function classes based on
continuity moduli of Gaussian and Rademacher processes.
For Gaussian processes, we obtain bounds on the continuity modulus on the
convex hull of a function class in terms of the same quantity for the class
itself.
We also obtain new bounds on generalization error in terms of localized
Rademacher complexities. This allows us to prove new results about
generalization performance for convex hulls in terms of characteristics of
the base class.
As a byproduct, we obtain a simple proof of some of the known bounds on the
entropy of convex hulls.

Detection performance was measured with sinusoidal and pulse-train gratings. Although the 2.09-c/deg pulse-train, or line gratings, contained at least 8 harmonics all at equal contrast, they were no more detectable than their most detectable component. The addition of broadband pink noise designed to equalize the detectability of the components of the pulse train made the pulse train about a factor of four more detectable than any of its components. However, in contrast-discrimination experiments, with a pedestal or masking grating of the same form and phase as the signal and 15% contrast, the noise did not affect the discrimination performance of the pulse train relative to that obtained with its sinusoidal components. We discuss the implications of these observations for models of early vision in particular the implications for possible sources of internal noise.

The tangential neurons in the fly brain are sensitive to the typical optic flow patterns generated during self-motion (see example in Fig.1). We examine whether a simplified linear model of these neurons can be used to estimate self-motion from the optic flow. We present a theory for the construction of an optimal linear estimator incorporating prior knowledge both about the distance distribution of the environment, and about the noise and self-motion statistics of the sensor. The optimal estimator is tested on a gantry carrying an omnidirectional vision sensor that can be moved along three translational and one rotational degree of freedom. The experiments indicate that the proposed approach yields accurate results for rotation estimates, independently of the current translation and scene layout. Translation estimates, however, turned out to be sensitive to simultaneous rotation and to the particular distance distribution of the scene. The gantry experiments confirm that the receptive field organization of the tangential neurons allows them, as an ensemble, to extract self-motion from the optic flow.

The problem of automatically tuning multiple parameters for pattern recognition Support Vector Machines (SVM) is considered. This is done by minimizing some estimates of the generalization error of SVMs using a gradient descent algorithm over the set of parameters. Usual methods for choosing parameters, based on exhaustive search become intractable as soon as the number of parameters exceeds two. Some experimental results assess the feasibility of our approach for a large number of parameters (more than 100) and demonstrate an improvement of generalization performance.

We present an extension to the Mixture of Experts (ME) model, where the individual experts are Gaussian Process (GP) regression models. Using a input-dependent adaptation of the Dirichlet Process, we implement a gating network for an infinite number of Experts. Inference in this model may be done efficiently using a Markov Chain relying on Gibbs sampling. The model allows the effective covariance function to vary with the inputs, and may handle large datasets -- thus potentially overcoming two of the biggest hurdles with GP models. Simulations show the viability of this approach.

Most visualization panels today are still built around cathode-ray tubes (CRTs), certainly on personal desktops at work and at home. Whilst capable of producing pleasing images for common applications ranging from email writing to TV and DVD presentation, it is as well to note that there are a number of nonlinear transformations between input (voltage) and output (luminance) which distort the digital and/or analogue images send to a CRT. Some of them are input-independent and hence easy to fix, e.g. gamma correction, but others, such as pixel interactions, depend on the content of the input stimulus and are thus harder to compensate for. CRT-induced image distortions cause problems not only in basic vision research but also for applications where image fidelity is critical, most notably in medicine (digitization of X-ray images for diagnostic purposes) and in forms of online commerce, such as the online sale of images, where the image must be reproduced on some output device which will not have the same transfer function as the customer's CRT. I will present measurements from a number of CRTs and illustrate how some of their shortcomings may be problematic for the aforementioned applications.

Our goal is to understand the principles of Perception, Action and Learning in autonomous systems that successfully interact with complex environments and to use this understanding to design future systems