Papers

We address an important problem in sequence-to-sequence (Seq2Seq) learning
referred to as copying, in which certain segments in the input sequence are
selectively replicated in the output sequence. A similar phenomenon is
observable in human language communication. For example, humans tend to repeat
entity names or even long phrases in conversation. The challenge with regard to
copying in Seq2Seq is that new machinery is needed to decide when to perform
the operation. In this paper, we incorporate copying into neural network-based
Seq2Seq learning and propose a new model called CopyNet with encoder-decoder
structure. CopyNet can nicely integrate the regular way of word generation in
the decoder with the new copying mechanism which can choose sub-sequences in
the input sequence and put them at proper places in the output sequence. Our
empirical study on both synthetic data sets and real world data sets
demonstrates the efficacy of CopyNet. For example, CopyNet can outperform
regular RNN-based model with remarkable margins on text summarization tasks.

Many real world graphs, such as the graphs of molecules, exhibit structure at
multiple different scales, but most existing kernels between graphs are either
purely local or purely global in character. In contrast, by building a
hierarchy of nested subgraphs, the Multiscale Laplacian Graph kernels (MLG
kernels) that we define in this paper can account for structure at a range of
different scales. At the heart of the MLG construction is another new graph
kernel, called the Feature Space Laplacian Graph kernel (FLG kernel), which has
the property that it can lift a base kernel defined on the vertices of two
graphs to a kernel between the graphs. The MLG kernel applies such FLG kernels
to subgraphs recursively. To make the MLG kernel computationally feasible, we
also introduce a randomized projection procedure, similar to the Nystr\"om
method, but for RKHS operators.

We study nonconvex finite-sum problems and analyze stochastic variance
reduced gradient (SVRG) methods for them. SVRG and related methods have
recently surged into prominence for convex optimization given their edge over
stochastic gradient descent (SGD); but their theoretical analysis almost
exclusively assumes convexity. In contrast, we prove non-asymptotic rates of
convergence (to stationary points) of SVRG for nonconvex optimization, and show
that it is provably faster than SGD and gradient descent. We also analyze a
subclass of nonconvex problems on which SVRG attains linear convergence to the
global optimum. We extend our analysis to mini-batch variants of SVRG, showing
(theoretical) linear speedup due to mini-batching in parallel settings.

Probabilistic inference algorithms such as Sequential Monte Carlo (SMC)
provide powerful tools for constraining procedural models in computer graphics,
but they require many samples to produce desirable results. In this paper, we
show how to create procedural models which learn how to satisfy constraints. We
augment procedural models with neural networks which control how the model
makes random choices based on the output it has generated thus far. We call
such models neurally-guided procedural models. As a pre-computation, we train
these models to maximize the likelihood of example outputs generated via SMC.
They are then used as efficient SMC importance samplers, generating
high-quality results with very few samples. We evaluate our method on
L-system-like models with image-based constraints. Given a desired quality
threshold, neurally-guided models can generate satisfactory results up to 10x
faster than unguided models.

We introduce $\mathtt{Katyusha}$, the first direct, primal-only stochastic
gradient method that has a provably accelerated convergence rate in convex
optimization. In contrast, previous methods are based on dual coordinate
descent which are more restrictive, or based on outer-inner loops which make
them "blind" to the underlying stochastic nature of the optimization process.
$\mathtt{Katyusha}$ is the first algorithm that incorporates acceleration
directly into stochastic gradient updates.
Unlike previous results, $\mathtt{Katyusha}$ obtains an optimal convergence
rate. It also supports proximal updates, non-Euclidean norm smoothness,
non-uniform sampling, and mini-batch sampling. When applied to interesting
classes of convex objectives, including smooth objectives (e.g., Lasso,
Logistic Regression), strongly-convex objectives (e.g., SVM), and non-smooth
objectives (e.g., L1SVM), $\mathtt{Katyusha}$ improves the best known
convergence rates.
The main ingredient behind our result is $\textit{Katyusha momentum}$, a
novel "negative momentum on top of momentum" that can be incorporated into a
variance-reduction based algorithm and speed it up. As a result, since variance
reduction has been successfully applied to a fast growing list of practical
problems, our paper suggests that in each of such cases, one had better hurry
up and give Katyusha a hug.

This paper presents a novel approach to recurrent neural network (RNN)
regularization. Differently from the widely adopted dropout method, which is
applied to \textit{forward} connections of feed-forward architectures or RNNs,
we propose to drop neurons directly in \textit{recurrent} connections in a way
that does not cause loss of long-term memory. Our approach is as easy to
implement and apply as the regular feed-forward dropout and we demonstrate its
effectiveness for Long Short-Term Memory network, the most popular type of RNN
cells. Our experiments on NLP benchmarks show consistent improvements even when
combined with conventional feed-forward dropout.

Humans have an impressive ability to reason about new concepts and
experiences from just a single example. In particular, humans have an ability
for one-shot generalization: an ability to encounter a new concept, understand
its structure, and then be able to generate compelling alternative variations
of the concept. We develop machine learning systems with this important
capacity by developing new deep generative models, models that combine the
representational power of deep learning with the inferential power of Bayesian
reasoning. We develop a class of sequential generative models that are built on
the principles of feedback and attention. These two characteristics lead to
generative models that are among the state-of-the art in density estimation and
image generation. We demonstrate the one-shot generalization ability of our
models using three tasks: unconditional sampling, generating new exemplars of a
given concept, and generating new exemplars of a family of concepts. In all
cases our models are able to generate compelling and diverse samples---having
seen new examples just once---providing an important class of general-purpose
models for one-shot machine learning.

A significant weakness of most current deep Convolutional Neural Networks is
the need to train them using vast amounts of manu- ally labelled data. In this
work we propose a unsupervised framework to learn a deep convolutional neural
network for single view depth predic- tion, without requiring a pre-training
stage or annotated ground truth depths. We achieve this by training the network
in a manner analogous to an autoencoder. At training time we consider a pair of
images, source and target, with small, known camera motion between the two such
as a stereo pair. We train the convolutional encoder for the task of predicting
the depth map for the source image. To do so, we explicitly generate an inverse
warp of the target image using the predicted depth and known inter-view
displacement, to reconstruct the source image; the photomet- ric error in the
reconstruction is the reconstruction loss for the encoder. The acquisition of
this training data is considerably simpler than for equivalent systems,
requiring no manual annotation, nor calibration of depth sensor to camera. We
show that our network trained on less than half of the KITTI dataset (without
any further augmentation) gives com- parable performance to that of the state
of art supervised methods for single view depth estimation.

State-of-the-art results of semantic segmentation are established by Fully
Convolutional neural Networks (FCNs). FCNs rely on cascaded convolutional and
pooling layers to gradually enlarge the receptive fields of neurons, resulting
in an indirect way of modeling the distant contextual dependence. In this work,
we advocate the use of spatially recurrent layers (i.e. ReNet layers) which
directly capture global contexts and lead to improved feature representations.
We demonstrate the effectiveness of ReNet layers by building a Naive deep ReNet
(N-ReNet), which achieves competitive performance on Stanford Background
dataset. Furthermore, we integrate ReNet layers with FCNs, and develop a novel
Hybrid deep ReNet (H-ReNet). It enjoys a few remarkable properties, including
full-image receptive fields, end-to-end training, and efficient network
execution. On the PASCAL VOC 2012 benchmark, the H-ReNet improves the results
of state-of-the-art approaches Piecewise, CRFasRNN and DeepParsing by 3.6%,
2.3% and 0.2%, respectively, and achieves the highest IoUs for 13 out of the 20
object classes.

Gatys et al. recently demonstrated that deep networks can generate beautiful
textures and stylized images from a single texture example. However, their
methods requires a slow and memory-consuming optimization process. We propose
here an alternative approach that moves the computational burden to a learning
stage. Given a single example of a texture, our approach trains compact
feed-forward convolutional networks to generate multiple samples of the same
texture of arbitrary size and to transfer artistic style from a given image to
any other image. The resulting networks are remarkably light-weight and can
generate textures of quality comparable to Gatys~et~al., but hundreds of times
faster. More generally, our approach highlights the power and flexibility of
generative feed-forward models trained with complex and expressive loss
functions.

Neural network architectures with memory and attention mechanisms exhibit
certain reasoning capabilities required for question answering. One such
architecture, the dynamic memory network (DMN), obtained high accuracy on a
variety of language tasks. However, it was not shown whether the architecture
achieves strong results for question answering when supporting facts are not
marked during training or whether it could be applied to other modalities such
as images. Based on an analysis of the DMN, we propose several improvements to
its memory and input modules. Together with these changes we introduce a novel
input module for images in order to be able to answer visual questions. Our new
DMN+ model improves the state of the art on both the Visual Question Answering
dataset and the \babi-10k text question-answering dataset without supporting
fact supervision.

State-of-the-art named entity recognition systems rely heavily on
hand-crafted features and domain-specific knowledge in order to learn
effectively from the small, supervised training corpora that are available. In
this paper, we introduce two new neural architectures---one based on
bidirectional LSTMs and conditional random fields, and the other that
constructs and labels segments using a transition-based approach inspired by
shift-reduce parsers. Our models rely on two sources of information about
words: character-based word representations learned from the supervised corpus
and unsupervised word representations learned from unannotated corpora. Our
models obtain state-of-the-art performance in NER in four languages without
resorting to any language-specific knowledge or resources such as gazetteers.

Many real-world applications can be described as large-scale games of
imperfect information. To deal with these challenging domains, prior work has
focused on computing Nash equilibria in a handcrafted abstraction of the
domain. In this paper we introduce the first scalable end-to-end approach to
learning approximate Nash equilibria without prior domain knowledge. Our method
combines fictitious self-play with deep reinforcement learning. When applied to
Leduc poker, Neural Fictitious Self-Play (NFSP) approached a Nash equilibrium,
whereas common reinforcement learning methods diverged. In Limit Texas Holdem,
a poker game of real-world scale, NFSP learnt a strategy that approached the
performance of state-of-the-art, superhuman algorithms based on significant
domain expertise.

Model-free reinforcement learning has been successfully applied to a range of
challenging problems, and has recently been extended to handle large neural
network policies and value functions. However, the sample complexity of
model-free algorithms, particularly when using high-dimensional function
approximators, tends to limit their applicability to physical systems. In this
paper, we explore algorithms and representations to reduce the sample
complexity of deep reinforcement learning for continuous control tasks. We
propose two complementary techniques for improving the efficiency of such
algorithms. First, we derive a continuous variant of the Q-learning algorithm,
which we call normalized adantage functions (NAF), as an alternative to the
more commonly used policy gradient and actor-critic methods. NAF representation
allows us to apply Q-learning with experience replay to continuous tasks, and
substantially improves performance on a set of simulated robotic control tasks.
To further improve the efficiency of our approach, we explore the use of
learned models for accelerating model-free reinforcement learning. We show that
iteratively refitted local linear models are especially effective for this, and
demonstrate substantially faster learning on domains where such models are
applicable.

We suggest a compositional vector representation of parse trees that relies
on a recursive combination of recurrent-neural network encoders. To demonstrate
its effectiveness, we use the representation as the backbone of a greedy,
bottom-up dependency parser, achieving state-of-the-art accuracies for English
and Chinese, without relying on external word embeddings. The parser's
implementation is available for download at the first author's webpage.

Most learning algorithms are not invariant to the scale of the function that
is being approximated. We propose to adaptively normalize the targets used in
learning. This is useful in value-based reinforcement learning, where the
magnitude of appropriate value approximations can change over time when we
update the policy of behavior. Our main motivation is prior work on learning to
play Atari games, where the rewards were all clipped to a predetermined range.
This clipping facilitates learning across many different games with a single
learning algorithm, but a clipped reward function can result in qualitatively
different behavior. Using the adaptive normalization we can remove this
domain-specific heuristic without diminishing overall performance.

Despite progress in perceptual tasks such as image classification, computers
still perform poorly on cognitive tasks such as image description and question
answering. Cognition is core to tasks that involve not just recognizing, but
reasoning about our visual world. However, models used to tackle the rich
content in images for cognitive tasks are still being trained using the same
datasets designed for perceptual tasks. To achieve success at cognitive tasks,
models need to understand the interactions and relationships between objects in
an image. When asked "What vehicle is the person riding?", computers will need
to identify the objects in an image as well as the relationships riding(man,
carriage) and pulling(horse, carriage) in order to answer correctly that "the
person is riding a horse-drawn carriage".
In this paper, we present the Visual Genome dataset to enable the modeling of
such relationships. We collect dense annotations of objects, attributes, and
relationships within each image to learn these models. Specifically, our
dataset contains over 100K images where each image has an average of 21
objects, 18 attributes, and 18 pairwise relationships between objects. We
canonicalize the objects, attributes, relationships, and noun phrases in region
descriptions and questions answer pairs to WordNet synsets. Together, these
annotations represent the densest and largest dataset of image descriptions,
objects, attributes, relationships, and question answers.

We show that every packing of congruent regular pentagons in the Euclidean
plane has density at most $(5-\sqrt5)/3$, which is about 0.92. More
specifically, this article proves the pentagonal ice-ray conjecture of Henley
(1986), and Kuperberg and Kuperberg (1990), which asserts that an optimal
packing of congruent regular pentagons in the plane is a double lattice, formed
by aligned vertical columns of upward pointing pentagons alternating with
aligned vertical columns of downward pointing pentagons. The strategy is based
on estimates of the areas of Delaunay triangles. Our strategy reduces the
pentagonal ice-ray conjecture to area minimization problems that involve at
most four Delaunay triangles. These minimization problems are solved by
computer. The computer-assisted portions of the proof use techniques such as
interval arithmetic, automatic differentiation, and a meet-in-the-middle
algorithm.

Learning efficient representations for concepts has been proven to be an
important basis for many applications such as machine translation or document
classification. Proper representations of medical concepts such as diagnosis,
medication, procedure codes and visits will have broad applications in
healthcare analytics. However, in Electronic Health Records (EHR) the visit
sequences of patients include multiple concepts (diagnosis, procedure, and
medication codes) per visit. This structure provides two types of relational
information, namely sequential order of visits and co-occurrence of the codes
within each visit. In this work, we propose Med2Vec, which not only learns
distributed representations for both medical codes and visits from a large EHR
dataset with over 3 million visits, but also allows us to interpret the learned
representations confirmed positively by clinical experts. In the experiments,
Med2Vec displays significant improvement in key medical applications compared
to popular baselines such as Skip-gram, GloVe and stacked autoencoder, while
providing clinically meaningful interpretation.

Despite widespread adoption, machine learning models remain mostly black
boxes. Understanding the reasons behind predictions is, however, quite
important in assessing trust, which is fundamental if one plans to take action
based on a prediction, or when choosing whether to deploy a new model. Such
understanding also provides insights into the model, which can be used to
transform an untrustworthy model or prediction into a trustworthy one. In this
work, we propose LIME, a novel explanation technique that explains the
predictions of any classifier in an interpretable and faithful manner, by
learning an interpretable model locally around the prediction. We also propose
a method to explain models by presenting representative individual predictions
and their explanations in a non-redundant way, framing the task as a submodular
optimization problem. We demonstrate the flexibility of these methods by
explaining different models for text (e.g. random forests) and image
classification (e.g. neural networks). We show the utility of explanations via
novel experiments, both simulated and with human subjects, on various scenarios
that require trust: deciding if one should trust a prediction, choosing between
models, improving an untrustworthy classifier, and identifying why a classifier
should not be trusted.