XLA: Optimizing Compiler for Machine Learning

XLA (Accelerated Linear Algebra) is a domain-specific compiler for linear
algebra that can accelerate TensorFlow models with potentially no source code
changes.

The results are improvements in speed and memory usage: most internal benchmarks
run ~1.15x faster after XLA is enabled. The dataset below is evaluated on a
single NVidia V100 GPU:

Introduction

When a TensorFlow program is run, all of the operations are executed
individually by the TensorFlow executor. Each TensorFlow operation has a
precompiled GPU kernel implementation that the executor dispatches to.

XLA provides an alternative mode of running models: it compiles the TensorFlow
graph into a sequence of computation kernels generated specifically for the
given model. Because these kernels are unique to the model, they can exploit
model-specific information for optimization. For example, let's look at an
optimization XLA does in the context of a simple TensorFlow computation:

def model_fn(x, y, z):
return tf.reduce_sum(x + y * z)

Run without XLA, the graph launches three kernels: one for the multiplication,
one for the addition and one for the reduction. However, XLA can optimize the
graph so that it computes the result in a single kernel launch. It does this by
"fusing" the addition, multiplication and reduction into a single GPU kernel.
Moreover, this fused operation does not write out the intermediate values
produced by y*z and x+y*z to memory; instead it "streams" the results of
these intermediate computations directly to their users while keeping them
entirely in GPU registers. Fusion is XLA's single most important optimization.
Memory bandwidth is typically the scarcest resource on hardware accelerators, so
removing memory operations is one of the best ways to improve performance.

Enable XLA for TensorFlow models

Auto-clustering

A simplest way to start using XLA in TensorFlow models is to enable
auto-clustering, which automatically finds clusters (connected subgraphs)
within the TensorFlow graph which can be compiled and executed using XLA.
Auto-clustering on GPU can be enabled by setting the TF_XLA_FLAGS environment
variable:

$ TF_XLA_FLAGS=--tf_xla_auto_jit=2 path/to/your/tf/program

Auto-clustering is currently optimized for GPU workloads, but it can also be
enabled on CPU by additionally using the flag --tf_xla_cpu_global_jit:

Explicit compilation with tf.function

Auto-clustering is a great tool for making the model faster without any changes
to the code, but it may be hard to understand what changes have been performed.

Explicit compilation API offers a more fine-grained control for choosing which
functions should be compiled.
For example, the following TensorFlow function which performs the MNIST training
is compiled with XLA:

The experimental_compile API has must-compile semantics: either the entire
function is compiled with XLA, or an errors.InvalidArgumentError exception is
thrown. XLA can not currently compile functions where dimensions are not
inferrable: that is, if it's not possible to infer the dimensions of all
tensors without running the entire computation. For example, the following
function will not compile:

Reproducible bug reports

A bug report is much easier to reproduce if it includes dumps for the generated
XLA programs and the used auto-clustering embedding.
To generate them for a TensorFlow program running with auto-clustering, launch:

When filing bugs, attach the contents of the /tmp/generated directory
(referenced above).

If possible, try to isolate
a bug to a single XLA program by using the
replay_computation
and iteratively running it on generated programs.

Known Issues

Compilation with XLA can greatly improve the performance of your programs, but
the TensorFlow interop has a number of known sharp corners.

TensorArray TF/XLA Interconversion

The problem manifests itself as an error message
Support for TensorList crossing the XLA/TF boundary is not implemented.

XLA supports tf.TensorArray. However, the interconversion between TF and
XLA representations is not implemented yet.
This error often arises when the TensorArray is used inside the compiled
block, but the derivative is taken outside.

Workaround: compile the outermost scope which is taking the derivative.

Random Number Generation

XLA currently ignores TF seeds to random operations. This affects stateful TF
random operations, such as tf.random.normal, or tf.nn.dropout. XLA will
behave as if the compilation was seeded with a new unique seed at each run. This
limitation does not apply to stateless random ops.