Papers

We propose a new equilibrium enforcing method paired with a loss derived from
the Wasserstein distance for training auto-encoder based Generative Adversarial
Networks. This method balances the generator and discriminator during training.
Additionally, it provides a new approximate convergence measure, fast and
stable training and high visual quality. We also derive a way of controlling
the trade-off between image diversity and visual quality. We focus on the image
generation task, setting a new milestone in visual quality, even at higher
resolutions. This is achieved while using a relatively simple model
architecture and a standard training procedure.

An important editing policy in Wikipedia is to provide citations for added
statements in Wikipedia pages, where statements can be arbitrary pieces of
text, ranging from a sentence to a paragraph. In many cases citations are
either outdated or missing altogether.
In this work we address the problem of finding and updating news citations
for statements in entity pages. We propose a two-stage supervised approach for
this problem. In the first step, we construct a classifier to find out whether
statements need a news citation or other kinds of citations (web, book,
journal, etc.). In the second step, we develop a news citation algorithm for
Wikipedia statements, which recommends appropriate citations from a given news
collection. Apart from IR techniques that use the statement to query the news
collection, we also formalize three properties of an appropriate citation,
namely: (i) the citation should entail the Wikipedia statement, (ii) the
statement should be central to the citation, and (iii) the citation should be
from an authoritative source.
We perform an extensive evaluation of both steps, using 20 million articles
from a real-world news collection. Our results are quite promising, and show
that we can perform this task with high precision and at scale.

A text-to-speech synthesis system typically consists of multiple stages, such
as a text analysis frontend, an acoustic model and an audio synthesis module.
Building these components often requires extensive domain expertise and may
contain brittle design choices. In this paper, we present Tacotron, an
end-to-end generative text-to-speech model that synthesizes speech directly
from characters. Given <text, audio> pairs, the model can be trained completely
from scratch with random initialization. We present several key techniques to
make the sequence-to-sequence framework perform well for this challenging task.
Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English,
outperforming a production parametric system in terms of naturalness. In
addition, since Tacotron generates speech at the frame level, it's
substantially faster than sample-level autoregressive methods.

We use the scattering network as a generic and fixed initialization of the
first layers of a supervised hybrid deep network. We show that early layers do
not necessarily need to be learned, providing the best results to-date with
pre-defined representations while being competitive with Deep CNNs. Using a
shallow cascade of 1x1 convolutions, which encodes scattering coefficients that
correspond to spatial windows of very small sizes, permits to obtain AlexNet
accuracy on the imagenet ILSVRC2012. We demonstrate that this local encoding
explicitly learns in-variance w.r.t. rotations. Combining scattering networks
with a modern ResNet, we achieve a single-crop top 5 error of 11.4% on imagenet
ILSVRC2012, comparable to the Resnet-18 architecture, while utilizing only 10
layers. We also find that hybrid architectures can yield excellent performance
in the small sample regime, exceeding their end-to-end counterparts, through
their ability to incorporate geometrical priors. We demonstrate this on subsets
of the CIFAR-10 dataset and by setting a new state-of-the-art on the STL-10
dataset.

We present a recurrent encoder-decoder deep neural network architecture that
directly translates speech in one language into text in another. The model does
not explicitly transcribe the speech into text in the source language, nor does
it require supervision from the ground truth source language transcription
during training. We apply a slightly modified sequence-to-sequence with
attention architecture that has previously been used for speech recognition and
show that it can be repurposed for this more complex task, illustrating the
power of attention-based models. A single model trained end-to-end obtains
state-of-the-art performance on the Fisher Callhome Spanish-English speech
translation task, outperforming a cascade of independently trained
sequence-to-sequence speech recognition and machine translation models by 1.8
BLEU points on the Fisher test set. In addition, we find that making use of the
training data in both languages by multi-task training sequence-to-sequence
speech translation and recognition models with a shared encoder network can
improve performance by a further 1.4 BLEU points.

Imitation learning has been commonly applied to solve different tasks in
isolation. This usually requires either careful feature engineering, or a
significant number of samples. This is far from what we desire: ideally, robots
should be able to learn from very few demonstrations of any given task, and
instantly generalize to new situations of the same task, without requiring
task-specific engineering. In this paper, we propose a meta-learning framework
for achieving such capability, which we call one-shot imitation learning.
Specifically, we consider the setting where there is a very large set of
tasks, and each task has many instantiations. For example, a task could be to
stack all blocks on a table into a single tower, another task could be to place
all blocks on a table into two-block towers, etc. In each case, different
instances of the task would consist of different sets of blocks with different
initial states. At training time, our algorithm is presented with pairs of
demonstrations for a subset of all tasks. A neural net is trained that takes as
input one demonstration and the current state (which initially is the initial
state of the other demonstration of the pair), and outputs an action with the
goal that the resulting sequence of states and actions matches as closely as
possible with the second demonstration. At test time, a demonstration of a
single instance of a new task is presented, and the neural net is expected to
perform well on new instances of this new task. The use of soft attention
allows the model to generalize to conditions and tasks unseen in the training
data. We anticipate that by training this model on a much greater variety of
tasks and settings, we will obtain a general system that can turn any
demonstrations into robust policies that can accomplish an overwhelming variety
of tasks.
Videos available at https://bit.ly/one-shot-imitation.

We present a conceptually simple, flexible, and general framework for object
instance segmentation. Our approach efficiently detects objects in an image
while simultaneously generating a high-quality segmentation mask for each
instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a
branch for predicting an object mask in parallel with the existing branch for
bounding box recognition. Mask R-CNN is simple to train and adds only a small
overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to
generalize to other tasks, e.g., allowing us to estimate human poses in the
same framework. We show top results in all three tracks of the COCO suite of
challenges, including instance segmentation, bounding-box object detection, and
person keypoint detection. Without tricks, Mask R-CNN outperforms all existing,
single-model entries on every task, including the COCO 2016 challenge winners.
We hope our simple and effective approach will serve as a solid baseline and
help ease future research in instance-level recognition. Code will be made
available.

We introduce the first goal-driven training for visual question answering and
dialog agents. Specifically, we pose a cooperative 'image guessing' game
between two agents -- Qbot and Abot -- who communicate in natural language
dialog so that Qbot can select an unseen image from a lineup of images. We use
deep reinforcement learning (RL) to learn the policies of these agents
end-to-end -- from pixels to multi-agent multi-round dialog to game reward.
We demonstrate two experimental results.
First, as a 'sanity check' demonstration of pure RL (from scratch), we show
results on a synthetic world, where the agents communicate in ungrounded
vocabulary, i.e., symbols with no pre-specified meanings (X, Y, Z). We find
that two bots invent their own communication protocol and start using certain
symbols to ask/answer about certain visual attributes (shape/color/size). Thus,
we demonstrate the emergence of grounded language and communication among
'visual' dialog agents with no human supervision at all.
Second, we conduct large-scale real-image experiments on the VisDial dataset,
where we pretrain with supervised dialog data and show that the RL 'fine-tuned'
agents significantly outperform SL agents. Interestingly, the RL Qbot learns to
ask questions that Abot is good at, ultimately resulting in more informative
dialog and a better team.

Current Deep Learning approaches have been very successful using
convolutional neural networks (CNN) trained on large graphical processing units
(GPU)-based computers. Three limitations of this approach are: 1) they are
based on a simple layered network topology, i.e., highly connected layers,
without intra-layer connections; 2) the networks are manually configured to
achieve optimal results, and 3) the implementation of neuron model is expensive
in both cost and power. In this paper, we evaluate deep learning models using
three different computing architectures to address these problems: quantum
computing to train complex topologies, high performance computing (HPC) to
automatically determine network topology, and neuromorphic computing for a
low-power hardware implementation. We use the MNIST dataset for our experiment,
due to input size limitations of current quantum computers. Our results show
the feasibility of using the three architectures in tandem to address the above
deep learning limitations. We show a quantum computer can find high quality
values of intra-layer connections weights, in a tractable time as the
complexity of the network increases; a high performance computer can find
optimal layer-based topologies; and a neuromorphic computer can represent the
complex topology and weights derived from the other architectures in low power
memristive hardware.

While humans easily recognize relations between data from different domains
without any supervision, learning to automatically discover them is in general
very challenging and needs many ground-truth pairs that illustrate the
relations. To avoid costly pairing, we address the task of discovering
cross-domain relations given unpaired data. We propose a method based on
generative adversarial networks that learns to discover relations between
different domains (DiscoGAN). Using the discovered relations, our proposed
network successfully transfers style from one domain to another while
preserving key attributes such as orientation and face identity.

Many phenomena taking place in the solar photosphere are controlled by plasma
motions. Although the line-of-sight component of the velocity can be estimated
using the Doppler effect, we do not have direct spectroscopic access to the
components that are perpendicular to the line-of-sight. These components are
typically estimated using methods based on local correlation tracking. We have
designed DeepVel, an end-to-end deep neural network that produces an estimation
of the velocity at every single pixel and at every time step and at three
different heights in the atmosphere from just two consecutive continuum images.
We confront DeepVel with local correlation tracking, pointing out that they
give very similar results in the time- and spatially-averaged cases. We use the
network to study the evolution in height of the horizontal velocity field in
fragmenting granules, supporting the buoyancy-braking mechanism for the
formation of integranular lanes in these granules. We also show that DeepVel
can capture very small vortices, so that we can potentially expand the scaling
cascade of vortices to very small sizes and durations.

Despite their overwhelming capacity to overfit, deep learning architectures
tend to generalize relatively well to unseen data, allowing them to be deployed
in practice. However, explaining why this is the case is still an open area of
research. One standing hypothesis that is gaining popularity, e.g. Hochreiter &
Schmidhuber (1997); Keskar et al. (2017), is that the flatness of minima of the
loss function found by stochastic gradient based methods results in good
generalization. This paper argues that most notions of flatness are problematic
for deep models and can not be directly applied to explain generalization.
Specifically, when focusing on deep networks with rectifier units, we can
exploit the particular geometry of parameter space induced by the inherent
symmetries that these architectures exhibit to build equivalent models
corresponding to arbitrarily sharper minima. Furthermore, if we allow to
reparametrize a function, the geometry of its parameters can change drastically
without affecting its generalization properties.

We propose Significance-Offset Convolutional Neural Network, a deep
convolutional network architecture for regression of multivariate asynchronous
time series. The model is inspired by standard autoregressive (AR) models and
gating mechanisms used in recurrent neural networks. It involves an AR-like
weighting system, where the final predictor is obtained as a weighted sum of
adjusted regressors, while the weights are data-dependent functions learnt
through a convolutional network. The architecture was designed for applications
on asynchronous time series and is evaluated on such datasets: a hedge fund
proprietary dataset of over 2 million quotes for a credit derivative index, an
artificially generated noisy autoregressive series and household electricity
consumption dataset. The proposed architecture achieves promising results as
compared to convolutional and recurrent neural networks.

Increasing evidence suggests that a growing amount of social media content is
generated by autonomous entities known as social bots. In this work we present
a framework to detect such entities on Twitter. We leverage more than a
thousand features extracted from public data and meta-data about users:
friends, tweet content and sentiment, network patterns, and activity time
series. We benchmark the classification framework by using a publicly available
dataset of Twitter bots. This training data is enriched by a manually annotated
collection of active Twitter users that include both humans and bots of varying
sophistication. Our models yield high accuracy and agreement with each other
and can detect bots of different nature. Our estimates suggest that between 9%
and 15% of active Twitter accounts are bots. Characterizing ties among
accounts, we observe that simple bots tend to interact with bots that exhibit
more human-like behaviors. Analysis of content flows reveals retweet and
mention strategies adopted by bots to interact with different target groups.
Using clustering analysis, we characterize several subclasses of accounts,
including spammers, self promoters, and accounts that post content from
connected applications.

Divergent word usages reflect differences among people. In this paper, we
present a novel angle for studying word usage divergence -- word
interpretations. We propose an approach that quantifies semantic differences in
interpretations among different groups of people. The effectiveness of our
approach is validated by quantitative evaluations. Experiment results indicate
that divergences in word interpretations exist. We further apply the approach
to two well studied types of differences between people -- gender and region.
The detected words with divergent interpretations reveal the unique features of
specific groups of people. For gender, we discover that certain different
interests, social attitudes, and characters between males and females are
reflected in their divergent interpretations of many words. For region, we find
that specific interpretations of certain words reveal the geographical and
cultural features of different regions.

Deep neural networks coupled with fast simulation and improved computation
have led to recent successes in the field of reinforcement learning (RL).
However, most current RL-based approaches fail to generalize since: (a) the gap
between simulation and real world is so large that policy-learning approaches
fail to transfer; (b) even if policy learning is done in real world, the data
scarcity leads to failed generalization from training to test scenarios (e.g.,
due to different friction or object masses). Inspired from H-infinity control
methods, we note that both modeling errors and differences in training and test
scenarios can be viewed as extra forces/disturbances in the system. This paper
proposes the idea of robust adversarial reinforcement learning (RARL), where we
train an agent to operate in the presence of a destabilizing adversary that
applies disturbance forces to the system. The jointly trained adversary is
reinforced -- that is, it learns an optimal destabilization policy. We
formulate the policy learning as a zero-sum, minimax objective function.
Extensive experiments in multiple environments (InvertedPendulum, HalfCheetah,
Swimmer, Hopper and Walker2d) conclusively demonstrate that our method (a)
improves training stability; (b) is robust to differences in training/test
conditions; and c) outperform the baseline even in the absence of the
adversary.

Traditional image and video compression algorithms rely on hand-crafted
encoder/decoder pairs (codecs) that lack adaptability and are agnostic to the
data being compressed. Here we describe the concept of generative compression,
the compression of data using generative models, and show its potential to
produce more accurate and visually pleasing reconstructions at much deeper
compression levels for both image and video data. We also demonstrate that
generative compression is orders-of-magnitude more resilient to bit error rates
(e.g. from noisy wireless channels) than traditional variable-length entropy
coding schemes.

Bellemare et al. (2016) introduced the notion of a pseudo-count to generalize
count-based exploration to non-tabular reinforcement learning. This
pseudo-count is derived from a density model which effectively replaces the
count table used in the tabular setting. Using an exploration bonus based on
this pseudo-count and a mixed Monte Carlo update applied to a DQN agent was
sufficient to achieve state-of-the-art on the Atari 2600 game Montezuma's
Revenge.
In this paper we consider two questions left open by their work: First, how
important is the quality of the density model for exploration? Second, what
role does the Monte Carlo update play in exploration? We answer the first
question by demonstrating the use of PixelCNN, an advanced neural density model
for images, to supply a pseudo-count. In particular, we examine the intrinsic
difficulties in adapting Bellemare et al's approach when assumptions about the
model are violated. The result is a more practical and general algorithm
requiring no special apparatus. We combine PixelCNN pseudo-counts with
different agent architectures to dramatically improve the state of the art on
several hard Atari games. One surprising finding is that the mixed Monte Carlo
update is a powerful facilitator of exploration in the sparsest of settings,
including Montezuma's Revenge.

Most of the existing image-to-image translation frameworks---mapping an image
in one domain to a corresponding image in another---are based on supervised
learning, i.e., pairs of corresponding images in two domains are required for
learning the translation function. This largely limits their applications,
because capturing corresponding images in two different domains is often a
difficult task. To address the issue, we propose the UNsupervised
Image-to-image Translation (UNIT) framework, which is based on variational
autoencoders and generative adversarial networks. The proposed framework can
learn the translation function without any corresponding images in two domains.
We enable this learning capability by combining a weight-sharing constraint and
an adversarial training objective. Through visualization results from various
unsupervised image translation tasks, we verify the effectiveness of the
proposed framework. An ablation study further reveals the critical design
choices. Moreover, we apply the UNIT framework to the unsupervised domain
adaptation task and achieve better results than competing algorithms do in
benchmark datasets.

Sophisticated gated recurrent neural network architectures like LSTMs and
GRUs have been shown to be highly effective in a myriad of applications. We
develop an un-gated unit, the statistical recurrent unit (SRU), that is able to
learn long term dependencies in data by only keeping moving averages of
statistics. The SRU's architecture is simple, un-gated, and contains a
comparable number of parameters to LSTMs; yet, SRUs perform favorably to more
sophisticated LSTM and GRU alternatives, often outperforming one or both in
various tasks. We show the efficacy of SRUs as compared to LSTMs and GRUs in an
unbiased manner by optimizing respective architectures' hyperparameters in a
Bayesian optimization scheme for both synthetic and real-world tasks.