International peer reviewed conferences and journals:

Abstract

Many different situations related to self control involve competition between two routes to decisions: default and frugal versus more resource-intensive. Examples include habits versus deliberative decisions, fatigue versus cognitive effort, and Pavlovian versus instrumental decision making. We propose that these situations are linked by a strikingly similar core dilemma, pitting the opportunity costs of monopolizing shared resources such as executive functions for some time, against the possibility of obtaining a better outcome. We offer a unifying normative perspective on this underlying rational meta-optimization, review how this may tie together recent advances in many separate areas, and connect several independent models. Finally, we suggest that the crucial mechanisms and meta-decision variables may be shared across domains.

Abstract

Invariant representations in object recognition systems
are generally obtained by pooling feature vectors over spatially local neighborhoods. But pooling is not local in the
feature vector space, so that widely dissimilar features may
be pooled together if they are in nearby locations. Recent
approaches rely on sophisticated encoding methods and
more specialized codebooks (or dictionaries), e.g., learned
on subsets of descriptors which are close in feature space, to
circumvent this problem. In this work, we argue that a common trait found in much recent work in image recognition or
retrieval is that it leverages locality in feature space on top
of purely spatial locality. We propose to apply this idea in its
simplest form to an object recognition system based on the
spatial pyramid framework, to increase the performance of
small dictionaries with very little added engineering. State-of-the-art results on several object recognition benchmarks
show the promise of this approach.

Abstract

Affective valence lies on a spectrum ranging from punishment to reward. The coding of such spectra in the brain almost
always involves opponency between pairs of systems or structures. There is ample evidence for the role of dopamine in the
appetitive half of this spectrum, but little agreement about the existence, nature, or role of putative aversive opponents such
as serotonin. In this review, we consider the structure of opponency in terms of previous biases about the nature of the
decision problems that animals face, the conflicts that may thus arise between Pavlovian and instrumental responses, and an
additional spectrum joining invigoration to inhibition. We use this analysis to shed light on aspects of the role of serotonin and
its interactions with dopamine.
Keywords: dopamine; norepinephrine; opponency; psychiatry and behavioral sciences; reinforcement learning; serotonin

Abstract

We propose an unsupervised method for learning multi-stage hierarchies of sparse
convolutional features. While sparse coding has become an increasingly popular
method for learning visual features, it is most often trained at the patch level.
Applying the resulting ﬁlters convolutionally results in highly redundant codes
because overlapping patches are encoded in isolation. By training convolutionally
over large image windows, our method reduces the redudancy between feature
vectors at neighboring locations and improves the efﬁciency of the overall
representation. In addition to a linear decoder that reconstructs the image
from sparse
features, our method trains an efﬁcient feed-forward encoder that predicts
quasisparse features from the input. While patch-based training rarely produces
anything but oriented edge detectors, we show that convolutional training
produces
highly diverse ﬁlters, including center-surround ﬁlters, corner detectors,
cross detectors, and oriented grating detectors. We show that using these
ﬁlters in multistage convolutional network architecture improves performance on
a number of
visual recognition and detection tasks.

Abstract

Many modern visual recognition algorithms incorporate a step of spatial ‘pooling’, where the
outputs of several nearby feature detectors are
combined into a local or global ‘bag of features’,
in a way that preserves task-related information
while removing irrelevant details. Pooling is
used to achieve invariance to image transformations, more compact representations, and better
robustness to noise and clutter. Several papers
have shown that the details of the pooling operation can greatly inﬂuence the performance, but
studies have so far been purely empirical. In this
paper, we show that the reasons underlying the
performance of various pooling methods are obscured by several confounding factors, such as
the link between the sample cardinality in a spatial pool and the resolution at which low-level
features have been extracted. We provide a detailed theoretical analysis of max
pooling and average pooling, and give extensive empirical comparisons for
object recognition tasks.

Abstract

Many successful models for scene or object recognition
transform low-level descriptors (such as Gabor ﬁlter responses, or SIFT descriptors) into richer representations
of intermediate complexity. This process can often be broken down into two steps: (1) a coding step, which performs a pointwise transformation of the descriptors into a
representation better adapted to the task, and (2) a pooling step, which summarizes the coded features over larger
neighborhoods. Several combinations of coding and pooling schemes have been proposed in the literature. The goal
of this paper is threefold. We seek to establish the relative importance of each step of mid-level feature extraction through a comprehensive cross evaluation of several
types of coding modules (hard and soft vector quantization,
sparse coding) and pooling schemes (by taking the average, or the maximum), which obtains state-of-the-art performance or better on several recognition benchmarks. We
show how to improve the best performing coding scheme by
learning a supervised discriminative dictionary for sparse
coding. We provide theoretical and empirical insight into
the remarkable performance of max pooling. By teasing
apart components shared by modern mid-level feature extractors, our approach aims to facilitate the design of better
recognition architectures.

Abstract

Unsupervised learning algorithms aim to discover the structure hidden in the data,
and to learn representations that are more suitable as input to a supervised machine
than the raw input. Many unsupervised methods are based on reconstructing the
input from the representation, while constraining the representation to have
certain desirable properties (e.g. low dimension, sparsity, etc). Others are
based on
approximating density by stochastically reconstructing the input from the
representation. We describe a novel and efﬁcient algorithm to learn sparse
representations, and compare it theoretically and experimentally with a similar
machine
trained probabilistically, namely a Restricted Boltzmann Machine. We propose a
simple criterion to compare and select different unsupervised machines based on
the trade-off between the reconstruction error and the information content of the
representation. We demonstrate this method by extracting features from a dataset
of handwritten numerals, and from a dataset of natural image patches. We show
that by stacking multiple levels of such machines and by training sequentially,
high-order dependencies between the input observed variables can be captured.

Abstract

We present an unsupervised method for learning a hierarchy of sparse feature detectors that are invariant to small
shifts and distortions. The resulting feature extractor consists of multiple convolution ﬁlters, followed by a pointwise sigmoid non-linearity, and a feature-pooling layer
that computes the max of each ﬁlter output within adjacent windows. A second level of larger and more invariant features is obtained by training the same algorithm
on patches of features from the ﬁrst level. Training a supervised classiﬁer on these features yields 0.64% error on
MNIST, and 54% average recognition rate on Caltech 101
with 30 training samples per category. While the resulting architecture is similar to convolutional networks, the
layer-wise unsupervised training procedure alleviates the
over-parameterization problems that plague purely supervised learning procedures, and yields good performance
with very few labeled training samples.

Abstract

We introduce a view of unsupervised learning that integrates probabilistic and nonprobabilistic methods for clustering, dimensionality reduction, and feature extraction in
a uniﬁed framework. In this framework, an
energy function associates low energies to input points that are similar to training samples, and high energies to unobserved points.
Learning consists in minimizing the energies
of training samples while ensuring that the
energies of unobserved ones are higher. Some
traditional methods construct the architecture so that only a small number of points
can have low energy, while other methods
explicitly “pull up” on the energies of unobserved points. In probabilistic methods the
energy of unobserved points is pulled by minimizing the log partition function, an expensive, and sometimes intractable process. We
explore different and more efficient methods
using an energy-based approach. In particular, we show that a simple solution is to restrict the amount of information contained
in codes that represent the data. We demonstrate such a method by training it on natural image patches and by applying to image
denoising.

Thesis:

Abstract

Telling cow from sheep is effortless for most animals, but requires much engineering for computers.
In this thesis, we seek to tease out basic principles that underlie many recent advances in image recognition.
First, we recast many methods into a common unsupervised feature extraction framework based on an alternation
of coding steps, which encode the input by comparing it with a collection of reference patterns, and pooling steps,
which compute an aggregation statistic summarizing the codes within some region of interest of the image.
Within that framework, we conduct extensive comparative evaluations of many coding or pooling operators proposed
in the literature. Our results demonstrate a robust superiority of sparse coding (which decomposes an input as a linear
combination of a few visual words) and max pooling (which summarizes a set of inputs by their maximum value).
We also propose macrofeatures, which import into the popular spatial pyramid framework the joint encoding of nearby
features commonly practiced in neural networks, and obtain significantly improved image recognition performance.
Next, we analyze the statistical properties of max pooling that underlie its better performance, through a simple theoretical
model of feature activation. We then present results of experiments that confirm many predictions of the model.
Beyond the pooling operator itself, an important parameter is the set of pools over which the summary statistic is computed.
We propose locality in feature configuration space as a natural criterion for devising better pools. Finally, we propose ways
to make coding faster and more powerful through fast convolutional feedforward architectures, and examine how to
incorporate supervision into feature extraction schemes. Overall, our experiments offer insights into what makes current
systems work so well, and state-of-the-art results on several image recognition benchmarks.