Papers

Embedding words in a vector space has gained a lot of attention in recent
years. While state-of-the-art methods provide efficient computation of word
similarities via a low-dimensional matrix embedding, their motivation is often
left unclear. In this paper, we argue that word embedding can be naturally
viewed as a ranking problem due to the ranking nature of the evaluation
metrics. Then, based on this insight, we propose a novel framework WordRank
that efficiently estimates word representations via robust ranking, in which
the attention mechanism and robustness to noise are readily achieved via the
DCG-like ranking losses. The performance of WordRank is measured in word
similarity and word analogy benchmarks, and the results are compared to the
state-of-the-art word embedding techniques. Our algorithm is very competitive
to the state-of-the- arts on large corpora, while outperforms them by a
significant margin when the training set is limited (i.e., sparse and noisy).
With 17 million tokens, WordRank performs almost as well as existing methods
using 7.2 billion tokens on a popular word similarity benchmark. Our multi-node
distributed implementation of WordRank is publicly available for general usage.

Policy gradient methods are an appealing approach in reinforcement learning
because they directly optimize the cumulative reward and can straightforwardly
be used with nonlinear function approximators such as neural networks. The two
main challenges are the large number of samples typically required, and the
difficulty of obtaining stable and steady improvement despite the
nonstationarity of the incoming data. We address the first challenge by using
value functions to substantially reduce the variance of policy gradient
estimates at the cost of some bias, with an exponentially-weighted estimator of
the advantage function that is analogous to TD(lambda). We address the second
challenge by using trust region optimization procedure for both the policy and
the value function, which are represented by neural networks.
Our approach yields strong empirical results on highly challenging 3D
locomotion tasks, learning running gaits for bipedal and quadrupedal simulated
robots, and learning a policy for getting the biped to stand up from starting
out lying on the ground. In contrast to a body of prior work that uses
hand-crafted policy representations, our neural network policies map directly
from raw kinematics to joint torques. Our algorithm is fully model-free, and
the amount of simulated experience required for the learning tasks on 3D bipeds
corresponds to 1-2 weeks of real time.

In this paper, we explore the inclusion of latent random variables into the
dynamic hidden state of a recurrent neural network (RNN) by combining elements
of the variational autoencoder. We argue that through the use of high-level
latent random variables, the variational RNN (VRNN)1 can model the kind of
variability observed in highly structured sequential data such as natural
speech. We empirically evaluate the proposed model against related sequential
models on four speech datasets and one handwriting dataset. Our results show
the important roles that latent random variables can play in the RNN dynamic
hidden state.

Learning a distinct representation for each sense of an ambiguous word could
lead to more powerful and fine-grained models of vector-space representations.
Yet while `multi-sense' methods have been proposed and tested on artificial
word-similarity tasks, we don't know if they improve real natural language
understanding tasks. In this paper we introduce a multi-sense embedding model
based on Chinese Restaurant Processes that achieves state of the art
performance on matching human word similarity judgments, and propose a
pipelined architecture for incorporating multi-sense embeddings into language
understanding.
We then test the performance of our model on part-of-speech tagging, named
entity recognition, sentiment analysis, semantic relation identification and
semantic relatedness, controlling for embedding dimensionality. We find that
multi-sense embeddings do improve performance on some tasks (part-of-speech
tagging, semantic relation identification, semantic relatedness) but not on
others (named entity recognition, various forms of sentiment analysis). We
discuss how these differences may be caused by the different role of word sense
information in each of the tasks. The results highlight the importance of
testing embedding models in real applications.

While neural networks have been successfully applied to many NLP tasks the
resulting vector-based models are very difficult to interpret. For example it's
not clear how they achieve {\em compositionality}, building sentence meaning
from the meanings of words and phrases. In this paper we describe four
strategies for visualizing compositionality in neural models for NLP, inspired
by similar work in computer vision. We first plot unit values to visualize
compositionality of negation, intensification, and concessive clauses, allow us
to see well-known markedness asymmetries in negation. We then introduce three
simple and straightforward methods for visualizing a unit's {\em salience}, the
amount it contributes to the final composed meaning: (1) gradient
back-propagation, (2) the variance of a token from the average word node, (3)
LSTM-style gates that measure information flow. We test our methods on
sentiment using simple recurrent nets and LSTMs. Our general-purpose methods
may have wide applications for understanding compositionality and other
semantic properties of deep networks , and also shed light on why LSTMs
outperform simple recurrent nets,

There is large consent that successful training of deep networks requires
many thousand annotated training samples. In this paper, we present a network
and training strategy that relies on the strong use of data augmentation to use
the available annotated samples more efficiently. The architecture consists of
a contracting path to capture context and a symmetric expanding path that
enables precise localization. We show that such a network can be trained
end-to-end from very few images and outperforms the prior best method (a
sliding-window convolutional network) on the ISBI challenge for segmentation of
neuronal structures in electron microscopic stacks. Using the same network
trained on transmitted light microscopy images (phase contrast and DIC) we won
the ISBI cell tracking challenge 2015 in these categories by a large margin.
Moreover, the network is fast. Segmentation of a 512x512 image takes less than
a second on a recent GPU. The full implementation (based on Caffe) and the
trained networks are available at
http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .

Two recent approaches have achieved state-of-the-art results in image
captioning. The first uses a pipelined process where a set of candidate words
is generated by a convolutional neural network (CNN) trained on images, and
then a maximum entropy (ME) language model is used to arrange these words into
a coherent sentence. The second uses the penultimate activation layer of the
CNN as input to a recurrent neural network (RNN) that then generates the
caption sequence. In this paper, we compare the merits of these different
language modeling approaches for the first time by using the same
state-of-the-art CNN as input. We examine issues in the different approaches,
including linguistic irregularities, caption repetition, and data set overlap.
By combining key aspects of the ME and RNN methods, we achieve a new record
performance over previously published results on the benchmark COCO dataset.
However, the gains we see in BLEU do not translate to human judgments.

In this paper, we propose the new fixed-size ordinally-forgetting encoding
(FOFE) method, which can almost uniquely encode any variable-length sequence of
words into a fixed-size representation. FOFE can model the word order in a
sequence using a simple ordinally-forgetting mechanism according to the
positions of words. In this work, we have applied FOFE to feedforward neural
network language models (FNN-LMs). Experimental results have shown that without
using any recurrent feedbacks, FOFE based FNN-LMs can significantly outperform
not only the standard fixed-input FNN-LMs but also the popular RNN-LMs.

In this paper, we propose a deep neural network architecture for object
recognition based on recurrent neural networks. The proposed network, called
ReNet, replaces the ubiquitous convolution+pooling layer of the deep
convolutional neural network with four recurrent neural networks that sweep
horizontally and vertically in both directions across the image. We evaluate
the proposed ReNet on three widely-used benchmark datasets; MNIST, CIFAR-10 and
SVHN. The result suggests that ReNet is a viable alternative to the deep
convolutional neural network, and that further investigation is needed.

Is he/she my type or not? The answer to this question depends on the personal
preferences of the one asking it. The individual process of obtaining a full
answer may generally be difficult and time consuming, but often an approximate
answer can be obtained simply by looking at a photo of the potential match.
Such approximate answers based on visual cues can be produced in a fraction of
a second, a phenomenon that has led to a series of recently successful dating
apps in which users rate others positively or negatively using primarily a
single photo. In this paper we explore using convolutional networks to create a
model of an individual's personal preferences based on rated photos. This
introduced task is difficult due to the large number of variations in profile
pictures and the noise in attractiveness labels. Toward this task we collect a
dataset comprised of $9364$ pictures and binary labels for each. We compare
performance of convolutional models trained in three ways: first directly on
the collected dataset, second with features transferred from a network trained
to predict gender, and third with features transferred from a network trained
on ImageNet. Our findings show that ImageNet features transfer best, producing
a model that attains $68.1\%$ accuracy on the test set and is moderately
successful at predicting matches.

Learning long term dependencies in recurrent networks is difficult due to
vanishing and exploding gradients. To overcome this difficulty, researchers
have developed sophisticated optimization techniques and network architectures.
In this paper, we propose a simpler solution that use recurrent neural networks
composed of rectified linear units. Key to our solution is the use of the
identity matrix or its scaled version to initialize the recurrent weight
matrix. We find that our solution is comparable to LSTM on our four benchmarks:
two toy problems involving long-range temporal structures, a large language
modeling problem and a benchmark speech recognition problem.

Distributional models that learn rich semantic word representations are a
success story of recent NLP research. However, developing models that learn
useful representations of phrases and sentences has proved far harder. We
propose using the definitions found in everyday dictionaries as a means of
bridging this gap between lexical and phrasal semantics. Neural language
embedding models can be effectively trained to map dictionary definitions
(phrases) to (lexical) representations of the words defined by those
definitions. We present two applications of these architectures: "reverse
dictionaries" that return the name of a concept given a definition or
description and general-knowledge crossword question answerers. On both tasks,
neural language embedding models trained on definitions from a handful of
freely-available lexical resources perform as well or better than existing
commercial systems that rely on significant task-specific engineering. The
results highlight the effectiveness of both neural embedding architectures and
definition-based training for developing models that understand phrases and
sentences.

Several variants of the Long Short-Term Memory (LSTM) architecture for
recurrent neural networks have been proposed since its inception in 1995. In
recent years, these networks have become the state-of-the-art models for a
variety of machine learning problems. This has led to a renewed interest in
understanding the role and utility of various computational components of
typical LSTM variants. In this paper, we present the first large-scale analysis
of eight LSTM variants on three representative tasks: speech recognition,
handwriting recognition, and polyphonic music modeling. The hyperparameters of
all LSTM variants for each task were optimized separately using random search
and their importance was assessed using the powerful fANOVA framework. In
total, we summarize the results of 5400 experimental runs (about 15 years of
CPU time), which makes our study the largest of its kind on LSTM networks. Our
results show that none of the variants can improve upon the standard LSTM
architecture significantly, and demonstrate the forget gate and the output
activation function to be its most critical components. We further observe that
the studied hyperparameters are virtually independent and derive guidelines for
their efficient adjustment.

One long-term goal of machine learning research is to produce methods that
are applicable to reasoning and natural language, in particular building an
intelligent dialogue agent. To measure progress towards that goal, we argue for
the usefulness of a set of proxy tasks that evaluate reading comprehension via
question answering. Our tasks measure understanding in several ways: whether a
system is able to answer questions via chaining facts, simple induction,
deduction and many more. The tasks are designed to be prerequisites for any
system that aims to be capable of conversing with a human. We believe many
existing learning systems can currently not solve them, and hence our aim is to
classify these tasks into skill sets, so that researchers can identify (and
then rectify) the failings of their systems. We also extend and improve the
recently introduced Memory Networks model, and show it is able to solve some,
but not all, of the tasks.

Parameter-specific adaptive learning rate methods are computationally
efficient ways to reduce the ill-conditioning problems encountered when
training large deep networks. Following recent work that strongly suggests that
most of the critical points encountered when training such networks are saddle
points, we find how considering the presence of negative eigenvalues of the
Hessian could help us design better suited adaptive learning rate schemes. We
show that the popular Jacobi preconditioner has undesirable behavior in the
presence of both positive and negative curvature, and present theoretical and
empirical evidence that the so-called equilibration preconditioner is
comparatively better suited to non-convex problems. We introduce a novel
adaptive learning rate scheme, called ESGD, based on the equilibration
preconditioner. Our experiments show that ESGD performs as well or better than
RMSProp in terms of convergence speed, always clearly improving over plain
stochastic gradient descent.

Interstellar is the first Hollywood movie to attempt depicting a black hole
as it would actually be seen by somebody nearby. For this we developed a code
called DNGR (Double Negative Gravitational Renderer) to solve the equations for
ray-bundle (light-beam) propagation through the curved spacetime of a spinning
(Kerr) black hole, and to render IMAX-quality, rapidly changing images. Our
ray-bundle techniques were crucial for achieving IMAX-quality smoothness
without flickering.
This paper has four purposes: (i) To describe DNGR for physicists and CGI
practitioners . (ii) To present the equations we use, when the camera is in
arbitrary motion at an arbitrary location near a Kerr black hole, for mapping
light sources to camera images via elliptical ray bundles. (iii) To describe
new insights, from DNGR, into gravitational lensing when the camera is near the
spinning black hole, rather than far away as in almost all prior studies. (iv)
To describe how the images of the black hole Gargantua and its accretion disk,
in the movie \emph{Interstellar}, were generated with DNGR. There are no new
astrophysical insights in this accretion-disk section of the paper, but disk
novices may find it pedagogically interesting, and movie buffs may find its
discussions of Interstellar interesting.

Pixel-level labelling tasks, such as semantic segmentation, play a central
role in image understanding. Recent approaches have attempted to harness the
capabilities of deep learning techniques for image recognition to tackle
pixel-level labelling tasks. One central issue in this methodology is the
limited capacity of deep learning techniques to delineate visual objects. To
solve this problem, we introduce a new form of convolutional neural network
that combines the strengths of Convolutional Neural Networks (CNNs) and
Conditional Random Fields (CRFs)-based probabilistic graphical modelling. To
this end, we formulate mean-field approximate inference for the Conditional
Random Fields with Gaussian pairwise potentials as Recurrent Neural Networks.
This network, called CRF-RNN, is then plugged in as a part of a CNN to obtain a
deep network that has desirable properties of both CNNs and CRFs. Importantly,
our system fully integrates CRF modelling with CNNs, making it possible to
train the whole deep network end-to-end with the usual back-propagation
algorithm, avoiding offline post-processing methods for object delineation. We
apply the proposed method to the problem of semantic image segmentation,
obtaining top results on the challenging Pascal VOC 2012 segmentation
benchmark.

Inspired by recent work in machine translation and object detection, we
introduce an attention based model that automatically learns to describe the
content of images. We describe how we can train this model in a deterministic
manner using standard backpropagation techniques and stochastically by
maximizing a variational lower bound. We also show through visualization how
the model is able to automatically learn to fix its gaze on salient objects
while generating the corresponding words in the output sequence. We validate
the use of attention with state-of-the-art performance on three benchmark
datasets: Flickr8k, Flickr30k and MS COCO.

Training of large-scale deep neural networks is often constrained by the
available computational resources. We study the effect of limited precision
data representation and computation on neural network training. Within the
context of low-precision fixed-point computations, we observe the rounding
scheme to play a crucial role in determining the network's behavior during
training. Our results show that deep networks can be trained using only 16-bit
wide fixed-point number representation when using stochastic rounding, and
incur little to no degradation in the classification accuracy. We also
demonstrate an energy-efficient hardware accelerator that implements
low-precision fixed-point arithmetic with stochastic rounding.

This article demontrates that we can apply deep learning to text
understanding from character-level inputs all the way up to abstract text
concepts, using temporal convolutional networks (ConvNets). We apply ConvNets
to various large-scale datasets, including ontology classification, sentiment
analysis, and text categorization. We show that temporal ConvNets can achieve
astonishing performance without the knowledge of words, phrases, sentences and
any other syntactic or semantic structures with regards to a human language.
Evidence shows that our models can work for both English and Chinese.