Papers

We introduce a new language representation model called BERT, which stands
for Bidirectional Encoder Representations from Transformers. Unlike recent
language representation models, BERT is designed to pre-train deep
bidirectional representations by jointly conditioning on both left and right
context in all layers. As a result, the pre-trained BERT representations can be
fine-tuned with just one additional output layer to create state-of-the-art
models for a wide range of tasks, such as question answering and language
inference, without substantial task-specific architecture modifications.
BERT is conceptually simple and empirically powerful. It obtains new
state-of-the-art results on eleven natural language processing tasks, including
pushing the GLUE benchmark to 80.4% (7.6% absolute improvement), MultiNLI
accuracy to 86.7 (5.6% absolute improvement) and the SQuAD v1.1 question
answering Test F1 to 93.2 (1.5% absolute improvement), outperforming human
performance by 2.0%.

In many reinforcement learning tasks, the goal is to learn a policy to
manipulate an agent, whose design is fixed, to maximize some notion of
cumulative reward. The design of the agent's physical structure is rarely
optimized for the task at hand. In this work, we explore the possibility of
learning a version of the agent's design that is better suited for its task,
jointly with the policy. We propose a minor alteration to the OpenAI Gym
framework, where we parameterize parts of an environment, and allow an agent to
jointly learn to modify these environment parameters along with its policy. We
demonstrate that an agent can learn a better structure of its body that is not
only better suited for the task, but also facilitates policy learning. Joint
learning of policy and structure may even uncover design principles that are
useful for assisted-design applications. Videos of results at
https://designrl.github.io/

We present Optimal Completion Distillation (OCD), a training procedure for
optimizing sequence to sequence models based on edit distance. OCD is
efficient, has no hyper-parameters of its own, and does not require pretraining
or joint optimization with conditional log-likelihood. Given a partial sequence
generated by the model, we first identify the set of optimal suffixes that
minimize the total edit distance, using an efficient dynamic programming
algorithm. Then, for each position of the generated sequence, we use a target
distribution that puts equal probability on the first token of all the optimal
suffixes. OCD achieves the state-of-the-art performance on end-to-end speech
recognition, on both Wall Street Journal and Librispeech datasets, achieving
$9.3\%$ WER and $4.5\%$ WER respectively.

In spite of remarkable progress in deep latent variable generative modeling,
training still remains a challenge due to a combination of optimization and
generalization issues. In practice, a combination of heuristic algorithms (such
as hand-crafted annealing of KL-terms) is often used in order to achieve the
desired results, but such solutions are not robust to changes in model
architecture or dataset. The best settings can often vary dramatically from one
problem to another, which requires doing expensive parameter sweeps for each
new case. Here we develop on the idea of training VAEs with additional
constraints as a way to control their behaviour. We first present a detailed
theoretical analysis of constrained VAEs, expanding our understanding of how
these models work. We then introduce and analyze a practical algorithm termed
Generalized ELBO with Constrained Optimization, GECO. The main advantage of
GECO for the machine learning practitioner is a more intuitive, yet principled,
process of tuning the loss. This involves defining of a set of constraints,
which typically have an explicit relation to the desired model performance, in
contrast to tweaking abstract hyper-parameters which implicitly affect the
model behavior. Encouraging experimental results in several standard datasets
indicate that GECO is a very robust and effective tool to balance
reconstruction and compression constraints.

Despite recent progress in generative image modeling, successfully generating
high-resolution, diverse samples from complex datasets such as ImageNet remains
an elusive goal. To this end, we train Generative Adversarial Networks at the
largest scale yet attempted, and study the instabilities specific to such
scale. We find that applying orthogonal regularization to the generator renders
it amenable to a simple "truncation trick", allowing fine control over the
trade-off between sample fidelity and variety by truncating the latent space.
Our modifications lead to models which set the new state of the art in
class-conditional image synthesis. When trained on ImageNet at 128x128
resolution, our models (BigGANs) achieve an Inception Score (IS) of 166.3 and
Frechet Inception Distance (FID) of 9.6, improving over the previous best IS of
52.52 and FID of 18.65.

The ability to exploit the opportunities offered by AI within UK Defence
calls for an understanding of systemic issues required to achieve an effective
operational capability. This paper provides the authors' views of issues which
currently block UK Defence from fully benefitting from AI technology. These are
situated within a reference model for the AI Value Train, so enabling the
community to address the exploitation of such data and software intensive
systems in a systematic, end to end manner. The paper sets out the conditions
for success including: Researching future solutions to known problems and
clearly defined use cases; Addressing achievable use cases to show benefit;
Enhancing the availability of Defence-relevant data; Enhancing Defence 'know
how' in AI; Operating Software Intensive supply chain eco-systems at required
breadth and pace; Governance and, the integration of software and platform
supply chains and operating models.

Topological Data Analysis (TDA) is the collection of mathematical tools that
capture the structure of shapes in data. Despite computational topology and
computational geometry, the utilization of TDA in time series and signal
processing is relatively new. In some recent contributions, TDA has been
utilized as an alternative to the conventional signal processing methods.
Specifically, TDA is been considered to deal with noisy signals and time
series. In these applications, TDA is used to find the shapes in data as the
main properties, while the other properties are assumed much less informative.
In this paper, we will review recent developments and contributions where
topological data analysis especially persistent homology has been applied to
time series analysis, dynamical systems and signal processing. We will cover
problem statements such as stability determination, risk analysis, systems
behaviour, and predicting critical transitions in financial markets.

Through many recent successes in simulation, model-free reinforcement
learning has emerged as a promising approach to solving continuous control
robotic tasks. The research community is now able to reproduce, analyze and
build quickly on these results due to open source implementations of learning
algorithms and simulated benchmark tasks. To carry forward these successes to
real-world applications, it is crucial to withhold utilizing the unique
advantages of simulations that do not transfer to the real world and experiment
directly with physical robots. However, reinforcement learning research with
physical robots faces substantial resistance due to the lack of benchmark tasks
and supporting source code. In this work, we introduce several reinforcement
learning tasks with multiple commercially available robots that present varying
levels of learning difficulty, setup, and repeatability. On these tasks, we
test the learning performance of off-the-shelf implementations of four
reinforcement learning algorithms and analyze sensitivity to their
hyper-parameters to determine their readiness for applications in various
real-world tasks. Our results show that with a careful setup of the task
interface and computations, some of these implementations can be readily
applicable to physical robots. We find that state-of-the-art learning
algorithms are highly sensitive to their hyper-parameters and their relative
ordering does not transfer across tasks, indicating the necessity of re-tuning
them for each task for best performance. On the other hand, the best
hyper-parameter configuration from one task may often result in effective
learning on held-out tasks even with different robots, providing a reasonable
default. We make the benchmark tasks publicly available to enhance
reproducibility in real-world reinforcement learning.

Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2)
are proposed and achieve state-of-the-art performance, they still suffer from
two problems: 1) low efficiency during training and inference; 2) hard to model
long dependency using current recurrent neural networks (RNNs). Inspired by the
success of Transformer network in neural machine translation (NMT), in this
paper, we introduce and adapt the multi-head attention mechanism to replace the
RNN structures and also the original attention mechanism in Tacotron2. With the
help of multi-head self-attention, the hidden states in the encoder and decoder
are constructed in parallel, which improves training efficiency. Meanwhile, any
two inputs at different times are connected directly by a self-attention
mechanism, which solves the long range dependency problem effectively. Using
phoneme sequences as input, our Transformer TTS network generates mel
spectrograms, followed by a WaveNet vocoder to output the final audio results.
Experiments are conducted to test the efficiency and performance of our new
network. For the efficiency, our Transformer TTS network can speed up the
training about 4.25 times faster compared with Tacotron2. For the performance,
rigorous human tests show that our proposed model achieves state-of-the-art
performance (outperforms Tacotron2 with a gap of 0.048) and is very close to
human quality (4.39 vs 4.44 in MOS).

Procedural content generation via machine learning (PCGML) is typically
framed as the task of fitting a generative model to full-scale examples of a
desired content distribution. This approach presents a fundamental tension: the
more design effort expended to produce detailed training examples for shaping a
generator, the lower the return on investment from applying PCGML in the first
place. In response, we propose the use of discriminative models (which capture
the validity of a design rather the distribution of the content) trained on
positive and negative examples. Through a modest modification of
WaveFunctionCollapse, a commercially-adopted PCG approach that we characterize
as using elementary machine learning, we demonstrate a new mode of control for
learning-based generators. We demonstrate how an artist might craft a focused
set of additional positive and negative examples by critique of the generator's
previous outputs. This interaction mode bridges PCGML with mixed-initiative
design assistance tools by working with a machine to define a space of valid
designs rather than just one new design.

Collaborative reasoning for understanding each image-question pair is very
critical but underexplored for an interpretable visual question answering
system. Although very recent works also attempted to use explicit compositional
processes to assemble multiple subtasks embedded in the questions, their models
heavily rely on annotations or handcrafted rules to obtain valid reasoning
processes, leading to either heavy workloads or poor performance on composition
reasoning. In this paper, to better align image and language domains in diverse
and unrestricted cases, we propose a novel neural network model that performs
global reasoning on a dependency tree parsed from the question, and we thus
phrase our model as parse-tree-guided reasoning network (PTGRN). This network
consists of three collaborative modules: i) an attention module to exploit the
local visual evidence for each word parsed from the question, ii) a gated
residual composition module to compose the previously mined evidence, and iii)
a parse-tree-guided propagation module to pass the mined evidence along the
parse tree. Our PTGRN is thus capable of building an interpretable VQA system
that gradually derives the image cues following a question-driven parse-tree
reasoning route. Experiments on relational datasets demonstrate the superiority
of our PTGRN over current state-of-the-art VQA methods, and the visualization
results highlight the explainable capability of our reasoning system.

Statistical significance testing plays an important role when drawing
conclusions from experimental results in NLP papers. Particularly, it is a
valuable tool when one would like to establish the superiority of one algorithm
over another. This appendix complements the guide for testing statistical
significance in NLP presented in \cite{dror2018hitchhiker} by proposing valid
statistical tests for the common tasks and evaluation measures in the field.

A generative recurrent neural network is quickly trained in an unsupervised
manner to model popular reinforcement learning environments through compressed
spatio-temporal representations. The world model's extracted features are fed
into compact and simple policies trained by evolution, achieving state of the
art results in various environments. We also train our agent entirely inside of
an environment generated by its own internal world model, and transfer this
policy back into the actual environment. Interactive version of paper at
https://worldmodels.github.io

In neural text generation such as neural machine translation, summarization,
and image captioning, beam search is widely used to improve the output text
quality. However, in the neural generation setting, hypotheses can finish in
different steps, which makes it difficult to decide when to end beam search to
ensure optimality. We propose a provably optimal beam search algorithm that
will always return the optimal-score complete hypothesis (modulo beam size),
and finish as soon as the optimality is established (finishing no later than
the baseline). To counter neural generation's tendency for shorter hypotheses,
we also introduce a bounded length reward mechanism which allows a modified
version of our beam search algorithm to remain optimal. Experiments on neural
machine translation demonstrate that our principled beam search algorithm leads
to improvement in BLEU score over previously proposed alternatives.

A hallmark of variational autoencoders (VAEs) for text processing is their
combination of powerful encoder-decoder models, such as LSTMs, with simple
latent distributions, typically multivariate Gaussians. These models pose a
difficult optimization problem: there is an especially bad local optimum where
the variational posterior always equals the prior and the model does not use
the latent variable at all, a kind of "collapse" which is encouraged by the KL
divergence term of the objective. In this work, we experiment with another
choice of latent distribution, namely the von Mises-Fisher (vMF) distribution,
which places mass on the surface of the unit hypersphere. With this choice of
prior and posterior, the KL divergence term now only depends on the variance of
the vMF distribution, giving us the ability to treat it as a fixed
hyperparameter. We show that doing so not only averts the KL collapse, but
consistently gives better likelihoods than Gaussians across a range of modeling
conditions, including recurrent language modeling and bag-of-words document
modeling. An analysis of the properties of our vMF representations shows that
they learn richer and more nuanced structures in their latent representations
than their Gaussian counterparts.

Neural network-based methods for image processing are becoming widely used in
practical applications. Modern neural networks are computationally expensive
and require specialized hardware, such as graphics processing units. Since such
hardware is not always available in real life applications, there is a
compelling need for the design of neural networks for mobile devices. Mobile
neural networks typically have reduced number of parameters and require a
relatively small number of arithmetic operations. However, they usually still
are executed at the software level and use floating-point calculations. The use
of mobile networks without further optimization may not provide sufficient
performance when high processing speed is required, for example, in real-time
video processing (30 frames per second). In this study, we suggest
optimizations to speed up computations in order to efficiently use already
trained neural networks on a mobile device. Specifically, we propose an
approach for speeding up neural networks by moving computation from software to
hardware and by using fixed-point calculations instead of floating-point. We
propose a number of methods for neural network architecture design to improve
the performance with fixed-point calculations. We also show an example of how
existing datasets can be modified and adapted for the recognition task in hand.
Finally, we present the design and the implementation of a floating-point gate
array-based device to solve the practical problem of real-time handwritten
digit classification from mobile camera video feed.

Although exploration in reinforcement learning is well understood from a
theoretical point of view, provably correct methods remain impractical. In this
paper we study the interplay between exploration and approximation, what we
call \emph{approximate exploration}. We first provide results when the
approximation is explicit, quantifying the performance of an exploration
algorithm, MBIE-EB \citep{strehl2008analysis}, when combined with state
aggregation. In particular, we show that this allows the agent to trade off
between learning speed and quality of the policy learned. We then turn to a
successful exploration scheme in practical, pseudo-count based exploration
bonuses \citep{bellemare2016unifying}. We show that choosing a density model
implicitly defines an abstraction and that the pseudo-count bonus incentivizes
the agent to explore using this abstraction. We find, however, that implicit
exploration may result in a mismatch between the approximated value function
and exploration bonus, leading to either under- or over-exploration.

Entity Linking (EL) is an essential task for semantic text understanding and
information extraction. Popular methods separately address the Mention
Detection (MD) and Entity Disambiguation (ED) stages of EL, without leveraging
their mutual dependency. We here propose the first neural end-to-end EL system
that jointly discovers and links entities in a text document. The main idea is
to consider all possible spans as potential mentions and learn contextual
similarity scores over their entity candidates that are useful for both MD and
ED decisions. Key components are context-aware mention embeddings, entity
embeddings and a probabilistic mention - entity map, without demanding other
engineered features. Empirically, we show that our end-to-end method
significantly outperforms popular systems on the Gerbil platform when enough
training data is available. Conversely, if testing datasets follow different
annotation conventions compared to the training set (e.g. queries/ tweets vs
news documents), our ED model coupled with a traditional NER system offers the
best or second best EL accuracy.

The accuracy and reliability of machine learning algorithms are an important
concern for suppliers of artificial intelligence (AI) services, but
considerations beyond accuracy, such as safety, security, and provenance, are
also critical elements to engender consumers' trust in a service. In this
paper, we propose a supplier's declaration of conformity (SDoC) for AI services
to help increase trust in AI services. An SDoC is a transparent, standardized,
but often not legally required, document used in many industries and sectors to
describe the lineage of a product along with the safety and performance testing
it has undergone. We envision an SDoC for AI services to contain purpose,
performance, safety, security, and provenance information to be completed and
voluntarily released by AI service providers for examination by consumers.
Importantly, it conveys product-level rather than component-level functional
testing. We suggest a set of declaration items tailored to AI and provide
examples for two fictitious AI services.

We describe the fundamental differential-geometric structures of information
manifolds, state the fundamental theorem of information geometry, and
illustrate some uses of these information manifolds in information sciences.
The exposition is self-contained by concisely introducing the necessary
concepts of differential geometry with proofs omitted for brevity.