Papers

For any positive integer $k$, there exist neural networks with $\Theta(k^3)$
layers, $\Theta(1)$ nodes per layer, and $\Theta(1)$ distinct parameters which
can not be approximated by networks with $\mathcal{O}(k)$ layers unless they
are exponentially large --- they must possess $\Omega(2^k)$ nodes. This result
is proved here for a class of nodes termed "semi-algebraic gates" which
includes the common choices of ReLU, maximum, indicator, and piecewise
polynomial functions, therefore establishing benefits of depth against not just
standard networks with ReLU gates, but also convolutional networks with ReLU
and maximization gates, sum-product networks, and boosted decision trees (in
this last case with a stronger separation: $\Omega(2^{k^3})$ total tree nodes
are required).

Unsupervised methods for learning distributed representations of words are
ubiquitous in today's NLP research, but far less is known about the best ways
to learn distributed phrase or sentence representations from unlabelled data.
This paper is a systematic comparison of models that learn such
representations. We find that the optimal approach depends critically on the
intended application. Deeper, more complex models are preferable for
representations to be used in supervised systems, but shallow log-linear models
work best for building representation spaces that can be decoded with simple
spatial distance metrics. We also propose two new unsupervised
representation-learning objectives designed to optimise the trade-off between
training time, domain portability and performance.

We investigate a new method to augment recurrent neural networks with extra
memory without increasing the number of network parameters. The system has an
associative memory based on complex-valued vectors and is closely related to
Holographic Reduced Representations and Long Short-Term Memory networks.
Holographic Reduced Representations have limited capacity: as they store more
information, each retrieval becomes noisier due to interference. Our system in
contrast creates redundant copies of stored information, which enables
retrieval with reduced noise. Experiments demonstrate faster learning on
multiple memorization tasks.

Attention mechanisms in neural networks have proved useful for problems in
which the input and output do not have fixed dimension. Often there exist
features that are locally translation invariant and would be valuable for
directing the model's attention, but previous attentional architectures are not
constructed to learn such features specifically. We introduce an attentional
neural network that employs convolution on the input tokens to detect local
time-invariant and long-range topical attention features in a context-dependent
way. We apply this architecture to the problem of extreme summarization of
source code snippets into short, descriptive function name-like summaries.
Using those features, the model sequentially generates a summary by
marginalizing over two attention mechanisms: one that predicts the next summary
token based on the attention weights of the input tokens and another that is
able to copy a code token as-is directly into the summary. We demonstrate our
convolutional attention neural network's performance on 10 popular Java
projects showing that it achieves better performance compared to previous
attentional mechanisms.

We introduce the value iteration network (VIN): a fully differentiable neural
network with a `planning module' embedded within. VINs can learn to plan, and
are suitable for predicting outcomes that involve planning-based reasoning,
such as policies for reinforcement learning. Key to our approach is a novel
differentiable approximation of the value-iteration algorithm, which can be
represented as a convolutional neural network, and trained end-to-end using
standard backpropagation. We evaluate VIN based policies on discrete and
continuous path-planning domains, and on a natural-language based search task.
We show that by learning an explicit planning computation, VIN policies
generalize better to new, unseen domains.

Machine learning (ML) models, e.g., state-of-the-art deep neural networks
(DNNs), are vulnerable to adversarial examples: malicious inputs modified to
yield erroneous model outputs, while appearing unmodified to human observers.
Potential attacks include having malicious content like malware identified as
legitimate or controlling vehicle behavior. Yet, all existing adversarial
example attacks require knowledge of either the model internals or its training
data. We introduce the first practical demonstration of an attacker controlling
a remotely hosted DNN with no such knowledge. Indeed, the only capability of
our black-box adversary is to observe labels given by the DNN to chosen inputs.
Our attack strategy consists in training a local model to substitute for the
target DNN, using inputs synthetically generated by an adversary and labeled by
the target DNN. We then use the local substitute to craft adversarial examples,
and find that they are misclassified by the targeted DNN. To perform a
real-world and properly-blinded evaluation, we attack a DNN hosted by MetaMind,
an online deep learning API. After labeling 6,400 synthetic inputs to train our
substitute, we find that their DNN misclassifies adversarial examples crafted
with our substitute at a rate of 84.24%. We demonstrate that our strategy
generalizes to many ML techniques like logistic regression or SVMs, regardless
of the ML model chosen for the substitute. We instantiate the same attack
against models hosted by Amazon and Google, using logistic regression
substitutes trained with only 800 label queries. They yield adversarial
examples misclassified by Amazon and Google at rates of 96.19% and 88.94%. We
also find that this black-box attack strategy is capable of evading defense
strategies previously found to make adversarial example crafting harder.

Image-generating machine learning models are typically trained with loss
functions based on distance in the image space. This often leads to
over-smoothed results. We propose a class of loss functions, which we call deep
perceptual similarity metrics (DeePSiM), that mitigate this problem. Instead of
computing distances in the image space, we compute distances between image
features extracted by deep neural networks. This metric better reflects
perceptually similarity of images and thus leads to better results. We show
three applications: autoencoder training, a modification of a variational
autoencoder, and inversion of deep convolutional networks. In all cases, the
generated images look sharp and resemble natural images.

In this work we explore recent advances in Recurrent Neural Networks for
large scale Language Modeling, a task central to language understanding. We
extend current models to deal with two key challenges present in this task:
corpora and vocabulary sizes, and complex, long term structure of language. We
perform an exhaustive study on techniques such as character Convolutional
Neural Networks or Long-Short Term Memory, on the One Billion Word Benchmark.
Our best single model significantly improves state-of-the-art perplexity from
51.3 down to 30.0 (whilst reducing the number of parameters by a factor of 20),
while an ensemble of models sets a new record by improving perplexity from 41.0
down to 23.7. We also release these models for the NLP and ML community to
study and improve upon.

Recurrent neural networks are increasing popular models for sequential
learning. Unfortunately, although the most effective RNN architectures are
perhaps excessively complicated, extensive searches have not found simpler
alternatives. This paper imports ideas from physics and functional programming
into RNN design to provide guiding principles. From physics, we introduce type
constraints, analogous to the constraints that forbids adding meters to
seconds. From functional programming, we require that strongly-typed
architectures factorize into stateless learnware and state-dependent firmware,
reducing the impact of side-effects. The features learned by strongly-typed
nets have a simple semantic interpretation via dynamic average-pooling on
one-dimensional convolutions. We also show that strongly-typed gradients are
better behaved than in classical architectures, and characterize the
representational power of strongly-typed nets. Finally, experiments show that,
despite being more constrained, strongly-typed architectures achieve lower
training and comparable generalization error to classical architectures.

We present Submatrix-wise Vector Embedding Learner (Swivel), a method for
generating low-dimensional feature embeddings from a feature co-occurrence
matrix. Swivel performs approximate factorization of the point-wise mutual
information matrix via stochastic gradient descent. It uses a piecewise loss
with special handling for unobserved co-occurrences, and thus makes use of all
the information in the matrix. While this requires computation proportional to
the size of the entire matrix, we make use of vectorized multiplication to
process thousands of rows and columns at once to compute millions of predicted
values. Furthermore, we partition the matrix into shards in order to
parallelize the computation across many nodes. This approach results in more
accurate embeddings than can be achieved with methods that consider only
observed co-occurrences, and can scale to much larger corpora than can be
handled with sampling methods.

We propose sparsemax, a new activation function similar to the traditional
softmax, but able to output sparse probabilities. After deriving its
properties, we show how its Jacobian can be efficiently computed, enabling its
use in a network trained with backpropagation. Then, we propose a new smooth
and convex loss function which is the sparsemax analogue of the logistic loss.
We reveal an unexpected connection between this new loss and the Huber
classification loss. We obtain promising empirical results in multi-label
classification problems and in attention-based neural networks for natural
language inference. For the latter, we achieve a similar performance as the
traditional softmax, but with a selective, more compact, attention focus.

We propose a conceptually simple and lightweight framework for deep
reinforcement learning that uses asynchronous gradient descent for optimization
of deep neural network controllers. We present asynchronous variants of four
standard reinforcement learning algorithms and show that parallel
actor-learners have a stabilizing effect on training allowing all four methods
to successfully train neural network controllers. The best performing method,
an asynchronous variant of actor-critic, surpasses the current state-of-the-art
on the Atari domain while training for half the time on a single multi-core CPU
instead of a GPU. Furthermore, we show that asynchronous actor-critic succeeds
on a wide variety of continuous motor control problems as well as on a new task
of navigating random 3D mazes using a visual input.

State-of-the-art deep neural networks (DNNs) have hundreds of millions of
connections and are both computationally and memory intensive, making them
difficult to deploy on embedded systems with limited hardware resources and
power budgets. While custom hardware helps the computation, fetching weights
from DRAM is two orders of magnitude more expensive than ALU operations, and
dominates the required power.
Previously proposed 'Deep Compression' makes it possible to fit large DNNs
(AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by
pruning the redundant connections and having multiple connections share the
same weight. We propose an energy efficient inference engine (EIE) that
performs inference on this compressed network model and accelerates the
resulting sparse matrix-vector multiplication with weight sharing. Going from
DRAM to SRAM gives EIE 120x energy saving; Exploiting sparsity saves 10x;
Weight sharing gives 8x; Skipping zero activations from ReLU saves another 3x.
Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to
CPU and GPU implementations of the same DNN without compression. EIE has a
processing power of 102GOPS/s working directly on a compressed network,
corresponding to 3TOPS/s on an uncompressed network, and processes FC layers of
AlexNet at 1.88x10^4 frames/sec with a power dissipation of only 600mW. It is
24,000x and 3,400x more energy efficient than a CPU and GPU respectively.
Compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy
efficiency and area efficiency.

Modeling the distribution of natural images is a landmark problem in
unsupervised learning. This task requires an image model that is at once
expressive, tractable and scalable. We present a deep neural network that
sequentially predicts the pixels in an image along the two spatial dimensions.
Our method models the discrete probability of the raw pixel values and encodes
the complete set of dependencies in the image. Architectural novelties include
fast two-dimensional recurrent layers and an effective use of residual
connections in deep recurrent networks. We achieve log-likelihood scores on
natural images that are considerably better than the previous state of the art.
Our main results also provide benchmarks on the diverse ImageNet dataset.
Samples generated from the model appear crisp, varied and globally coherent.

Based on the assumption that there exists a neural network that efficiently
represents a set of Boolean functions between all binary inputs and outputs, we
propose a process for developing and deploying neural networks whose weight
parameters, bias terms, input, and intermediate hidden layer output signals,
are all binary-valued, and require only basic bit logic for the feedforward
pass. The proposed Bitwise Neural Network (BNN) is especially suitable for
resource-constrained environments, since it replaces either floating or
fixed-point arithmetic with significantly more efficient bitwise operations.
Hence, the BNN requires for less spatial complexity, less memory bandwidth, and
less power consumption in hardware. In order to design such networks, we
propose to add a few training schemes, such as weight compression and noisy
backpropagation, which result in a bitwise network that performs almost as well
as its corresponding real-valued network. We test the proposed network on the
MNIST dataset, represented using binary features, and show that BNNs result in
competitive performance while offering dramatic computational savings.

In this paper we present a connection between two dynamical systems arising
in entirely different contexts: one in signal processing and the other in
biology. The first is the famous Iteratively Reweighted Least Squares (IRLS)
algorithm used in compressed sensing and sparse recovery while the second is
the dynamics of a slime mold (Physarum polycephalum). Both of these dynamics
are geared towards finding a minimum l1-norm solution in an affine subspace.
Despite its simplicity the convergence of the IRLS method has been shown only
for a certain regularization of it and remains an important open problem. Our
first result shows that the two dynamics are projections of the same dynamical
system in higher dimensions. As a consequence, and building on the recent work
on Physarum dynamics, we are able to prove convergence and obtain complexity
bounds for a damped version of the IRLS algorithm.

Recurrent Neural Networks (RNN) have obtained excellent result in many
natural language processing (NLP) tasks. However, understanding and
interpreting the source of this success remains a challenge. In this paper, we
propose Recurrent Memory Network (RMN), a novel RNN architecture, that not only
amplifies the power of RNN but also facilitates our understanding of its
internal functioning and allows us to discover underlying patterns in data. We
demonstrate the power of RMN on language modeling and sentence completion
tasks. On language modeling, RMN outperforms Long Short-Term Memory (LSTM)
network on three large German, Italian, and English dataset. Additionally we
perform in-depth analysis of various linguistic dimensions that RMN captures.
On Sentence Completion Challenge, for which it is essential to capture sentence
coherence, our RMN obtains 69.2% accuracy, surpassing the previous
state-of-the-art by a large margin.

One of the core problems of modern statistics is to approximate
difficult-to-compute probability densities. This problem is especially
important in Bayesian statistics, which frames all inference about unknown
quantities as a calculation involving the posterior density. In this paper, we
review variational inference (VI), a method from machine learning that
approximates probability densities through optimization. VI has been used in
many applications and tends to be faster than classical methods, such as Markov
chain Monte Carlo sampling. The idea behind VI is to first posit a family of
densities and then to find the member of that family which is close to the
target. Closeness is measured by Kullback-Leibler divergence. We review the
ideas behind mean-field variational inference, discuss the special case of VI
applied to exponential family models, present a full example with a Bayesian
mixture of Gaussians, and derive a variant that uses stochastic optimization to
scale up to massive data. We discuss modern research in VI and highlight
important open problems. VI is powerful, but it is not yet well understood. Our
hope in writing this paper is to catalyze statistical research on this class of
algorithms.

Recurrent neural networks (RNNs) stand at the forefront of many recent
developments in deep learning. Yet a major difficulty with these models is
their tendency to overfit, with dropout shown to fail when applied to recurrent
layers. Recent results at the intersection of Bayesian modelling and deep
learning offer a Bayesian interpretation of common deep learning techniques
such as dropout. This grounding of dropout in approximate Bayesian inference
suggests an extension of the theoretical results, offering insights into the
use of dropout with RNN models. We apply this new variational inference based
dropout technique in LSTM and GRU models, assessing it on language modelling
and sentiment analysis tasks. The new approach outperforms existing techniques,
and to the best of our knowledge improves on the single model state-of-the-art
in language modelling with the Penn Treebank (73.4 test perplexity). This
extends our arsenal of variational tools in deep learning.