Create a Compute Engine VM and a TPU resource

Before you use the tools in this guide, create a Compute Engine VM instance
using the tf-1-6 image in the ml-images image family. Also create a
Cloud TPU resource. Complete the Cloud TPU
quickstart for instructions.

Set up TensorBoard

The computations you'll use TensorFlow for - like training a massive deep neural
network - can be complex and confusing. To make it easier to understand, debug,
and optimize TensorFlow programs, we've included a suite of visualization tools
called TensorBoard. You can use
TensorBoard to visualize your TensorFlow graph, plot quantitative metrics about
the execution of your graph, and show additional data like images that pass
through it. For more details follow the
tutorials
available at tensorflow.org

In order to use TensorBoard with TensorFlow and Cloud TPUs, there are a few
things you need to do.

Connect to your Compute Engine VM using the gcloud compute ssh command,
with port forwarding for TensorBoard. Replace tpu-demo-vm with the name
of your TPU instance:

Set a model_dir directory. Estimators (and TPU Estimators) handle the
heavy lifting of integrating with TensorBoard for you, however they need
some configuration. When you build your estimator, provide a path to a
directory in your Cloud Storage bucket
to save metadata about your model.

Open TensorBoard in your browser from your local workstation. Navigate to
http://localhost:6006 in your web browser to view TensorBoard. You should
be able to see the XLA graph of your model under the Graph
tab. Because you haven't captured a TPU profile, you will not be able to
see the profiler tools yet.

Install Cloud TPU Profiler

You need to have cloud-tpu-profiler >= 1.5.1 installed on your system to capture
a TPU profile:

Capturing Trace Information

Before you can use the tools, you need to capture trace information while your
model is running. Capture a trace by executing the following command on your VM:

(vm)$ capture_tpu_profile --tpu_name=$TPU_NAME --logdir=${model_dir}

By default, this captures a 2-second trace. You can set the trace duration with
the duration_ms command-line option. Open your TensorBoard in your local
browser again. Now you should be able to see a Profile tab showing up.

Overview Page

The Overview Page provides a top level view of how the workload performed as it
ran on the TPU. The page displays data in the following panels:

Performance Summary, which includes:

Step time averaged over all sampled steps

Percentage of idle Host time

Percentage of idle TPU time

Percentage utilization of the TPU matrix units

Step-time Graph, which plots a graph of step time (in milliseconds) over
all the steps sampled. The blue area corresponds to the part of the step
time waiting for input data from the host; the orange area corresponds to
the compute time.

Top 10 TensorFlow operations on TPU, which shows the TensorFlow
operations executed on the TPU that consumed the majority of time. Clicking
the "Show" button displays a table similar to the following:
Each row shows self
time (as the percentage of time taken by all operations), cumulative time,
category, name, and FLOP rate achieved for an operation.

Run Environment, which includes:

Number of hosts used

Type of TPU used

Number of TPU cores

Training batch size

Recommendation for Next Steps, which reports if the workload is input
bounded and, if so, suggests tools you can use to reduce the bottleneck
depending on whether the issue is the input time, the TPU time, or both.

XLA graphs

While a model is compiled, it also generates a graph that represents the
XLA (Accelerated Linear Algebra)
program observed on the TPU devices. The graph is dumped to the model_dir
directory and can be found in the "Graphs" tab in TensorBoard.

A node in the graph represents an XLA
instruction. If
an XLA instruction (for example, add) is lowered from a TensorFlow op (for
example, x/y/z), it is shown as x/y/z/add in the graph.

The XLA graph can give you more information on how a TPU is executing a
particular model and what the shapes of inputs and outputs of the different
operations are. In conjunction with Trace Viewer, this can give
you insight into where most of the runtime is spent.

Notes:

Not all XLA instructions have corresponding TensorFlow operations (for
example, those injected by XLA compiler), thus some nodes don't have
TensorFlow namespaces.

The TensorFlow program structure is incorporated in the XLA graph where
possible. However, the XLA program running on TPU devices is highly
optimized, so the graph structure might be quite different from the original
TensorFlow program.

There is a special XLA instruction called fusion. This instruction can
merge multiple instructions from different TensorFlow operations into a
single computation. The TensorFlow operation corresponding to the root
instruction in the fusion is used as the namespace of the entire fusion
operation.

TPU Compatibility Checker

The TensorBoard graph viewer includes the TPU Compatibility Checker—a tool
which calls out TensorFlow ops that may potentially be problematic when
compiling a model for TPU use. The tool works by scanning the model TensorFlow
graph for ops which are currently not available on the TPU.

Prerequisites

Make sure that you have configured your model to write the model graph to a
file. If you are using the tf.estimator
API, you
do this by setting the model_dir property of your Estimator.

The TPU Compatibility Checker does not check any operations that have been
explicitly assigned to a non-TPU device using manual device
placement.
For example, operations which are explicitly assigned to a GPU are skipped.
If you have such assignments, remove the manual placements for any
operations that you intend to run on the TPU.

Note: This tool examines only the TensorFlow graph and does not actually compile
the model for TPU execution. As such, results should be interpreted as a
reasonable estimate of compatibility.

Using the TPU Compatibility Checker

To access the TPU Compatibility Checker, open the Graphs tab in TensorBoard.
Note that you can upload and check any model's
graph by clicking
the Choose File button. In the configuration pane on the left, go to the
Color section and select the TPU Compatibility option:

The model graph now appears like this (using the Abalone
tutorial
as an example):

TPU-compatible operations are colored green; TPU-incompatible operations are
colored red. Graph nodes that contain both compatible and incompatible
operations show both colors in proportion to the percentage of compatible and
incompatible operations contained in them.

The right hand side of the screen displays a compatibility summary:

The percentage at the top expresses what percentage of all operations are TPU
compatible. Below that is a list of operations that are not compatible. Clicking
on one of these operations selects it in the main graph view, expanding nodes as
necessary to make the operation visible.

Interpreting the Results

The Abalone example shows how to interpret the results of the TPU Compatibility
Checker. Here again is the compatibility summary:

For the purpose of illustration, we show a theoretical compatibility summary
that has several unavailable ops.

When this model was run, no manual device placement was specified. As a result,
the Compatibility Checker checked all operations, even those that should always
be run on the CPU. The various "save" and "report_uninitialized_variables"
operations definitely fall in this category.

This leaves three operations that are potentially an issue: the GradientDescent
operation and the two AssignAdd operations in root_mean_squared_error.

Let's look at the GradientDescent node:

The incompatible operation is an AssignAdd that updates the global step count.
This operation would typically be run on the CPU, so it's not a concern.

Moving on to root_mean_squared_error, we see from the source
code
that it is only used as an additional evaluation metric:

Trace Viewer

Duration of the various operations in your TensorFlow model that were
executed.

Which part of the system (TPU or host machine) executed an operation.
Typically, the host machine executes infeed operations, which preprocess
training data and transfer it to the TPU, whereas the TPU executes the
actual model training.

Trace Viewer allows you to identify performance problems in your model, then
take steps to resolve them. For example, at a high level, you can identify
whether infeed or model training is taking the majority of the time. Drilling
down, you can identify which TensorFlow operations are taking the longest to
execute.

User Interface Overview

To open Trace Viewer, go to TensorBoard and click on the "Profile" tab at the
top of the screen. You will see something like this:

This screen contains the following main elements (marked with numbers above):

A Runsdropdown, which contains all of the runs for which you've
captured trace information. By default, Trace Viewer opens the most recent
run. You can open the dropdown to select a different run.

A Toolsdropdown, which selects different profiling tools.

A Timeline pane, which shows the operations that the TPUs and host
machine executed over time.

A Detailspane, which shows additional information for operations
selected in the Timeline pane.

Here's a closer look at the Timeline pane:

The Timeline pane contains the following elements:

A top bar, which contains various auxiliary controls.

A time axis, which shows time relative to the beginning of the trace.

Section and track labels. Each section contains multiple tracks and has
a triangle on the left that you can click to expand and collapse the
section. There is one section for every processing element in the system.
Sections and tracks will be explained in more detail below.

A tool selector, which contains various tools for interacting with the
Trace Viewer.

Events. These show the time during which an operation was executed or
the duration of meta-events, such as training steps.

A vertical tab bar. This does not have a useful purpose for TPUs. It
exists because Trace Viewer is a general-purpose tool provided by
Chrome
that is used for a variety of performance analysis tasks. We'll discuss
sections, tracks and events next, since this is where you'll be spending
most of your time.

Sections and Tracks

Trace Viewer contains the following sections:

One section for each TPU node, labeled with the number of the TPU chip
and the TPU node within the chip (e.g. "Chip 2: TPU Core 1"). Each TPU node
section contains the following tracks:

Step: This track shows the duration of the training steps that were
running on the TPU.

TensorFlow Ops: TensorFlow operations executed on the TPU.

XLA Ops:XLA
operations. Each TensorFlow operation is translated into one or several
XLA operations. The XLA compiler then translates these XLA operations
into code that runs on the TPU.

An additional section for threads running on the host machine's CPU,
labeled "Host Threads". This section contains one track for each CPU
thread. Note: Some other information is displayed alongside the section
labels (e.g. "n-e93653ba-w-0", "pid 49"). This exists only for internal
reasons and can be ignored.

Tool Selector

The tool selector contains tools that you can use to interact with the timeline
view. Click on a tool to make it active (or use keyboard shortcuts, see below).
The tool currently active tool will be highlighted. You can move the tool
selector around the screen by clicking and dragging on the dotted area at the
top.

Here is what the individual tools do:

Selection tool
Click on an event to select it. Click and drag to select multiple events.
Additional information about the selected event or events (name, start time,
and duration) will be displayed in the details pane.

Pan tool
Click and drag to pan the timeline view horizontally and vertically.

Zoom tool
Click and drag up to zoom in or down to zoom out along the horizontal (time)
axis. The horizontal position of the mouse cursor determines the center around
which the zoom takes place.

Note: The zoom tool has a known bug where zoom remains active
if you release the mouse button while the mouse cursor is outside the timeline
view. If this happens to you, just click briefly on the timeline view to stop
zooming.

Timing tool
Click and drag horizontally to mark a time interval. The length of the interval
appears on the time axis. To adjust the interval, drag its ends. To clear the
interval, click anywhere inside the timeline view.

Note that the interval remains marked if you select one of the other tools.

Events

Events have different colors to make it easier to distinguish them visually. The
colors themselves have no specific meaning.

Top Bar (Timeline pane)

The top bar of the Timeline pane contains several auxiliary controls:

Metadata display: Not used for TPUs.

View Options: Not used for TPUs.

Search box: Enter text to search for all events whose name contains the
text. Click the arrow buttons to the right of the search box to move
forwards and backwards through the matching events, selecting each event in
turn.

Console button: Not used for TPUs.

Help button: Click to display a quick help summary.

Keyboard Shortcuts

Here are some keyboard shortcuts you can use in Trace Viewer. Click the help
button in the top bar to see more keyboard shortcuts.

The f shortcut in particular can be highly useful. Try selecting a step and
pressing f to zoom into it.

Characteristic Events

Here are some event types that will be of interest when analyzing TPU
performance.

InfeedDequeueTuple: This TensorFlow operation runs on the TPU and
receives input data coming from the host. If this is taking a long time, it
can mean that the TensorFlow operations that preprocess data on the host
machine cannot keep up with the rate at which the TPU can consume the data.
You can see corresponding events in the host traces called
InfeedEnqueueTuple. You can look at more detailed input-pipeline
analysis using our Input Pipeline Analyzer tool.

CrossReplicaSum: This TensorFlow operation runs on the TPU and computes
a sum across replicas. Because each replica corresponds to a different TPU
node, this operation needs to wait for all TPU nodes to be done with a step.
If you see a lot of time being spent in this operation, it typically doesn't
mean that the sum itself is slow but that the TPU node is waiting for some
other TPU node. This often happens because other TPU nodes were delayed by a
slow data infeed.

Dataset Ops: When loading data with the Dataset
API, Trace Viewer
will visualize those dataset operations. The
Iterator::Filter::Batch::ForeverRepeat::Memory in the example is compiled
and corresponding to the dataset.map() operation. Checking out those
operations on Trace Viewer are very helpful to debug and mitigate the input
pipeline bottlenecks.

Prefetch Threads: Use dataset.prefetch() to buffer the input data.
This technique prevents sporadic slowdowns in file access that create a
bottleneck in the input pipeline. Prefetch threads show up in Trace Viewer
when dataset.prefetch() operations are captured.

What Can Go Wrong

Here are some potential "gotchas" to be aware of when using Trace Viewer:

Event display limit: Trace Viewer will display a maximum of 1 million
events. If you captured more events, only the 1 million events that came
earliest will be displayed; later events will be dropped. To capture more
TPU events, you can explictly ask the capture_tpu_profile to exclude the
dataset ops with the --include_dataset_ops=False flag.

Very long events: If an event began before the capture started or ended
after the capture finished, it won't be visible in Trace Viewer. This means
that very long events can be missed.

When to start trace capture: If you start trace capture too early, the
TPU may still be starting up and you may see only a few events or no events
at all. You can add the --duration_ms flag and/or the
--num_tracing_attempts flag to increase the profiling duration and
automatically retry trace collection when there is no trace event collected:

Op Profile

TensorBoard also contains the Op Profile, a tool that displays the performance
statistics of XLA operations
executed during the profiling period. Op Profile shows:

How your application uses the TPU. The TPU FLOPS utilization reported is
defined as the measured number of floating point operations per second
(FLOPS) normalized to the peak FLOPS available on the TPU.

The most time consuming operations. Those operations are potential targets
for optimization.

Details of individual operations, including the shape, padding and
expression.

Op Profile provides you insights on how well your model uses the TPU and helps
you find good targets for optimization. For example, if your model only achieves
5% of the TPU peak FLOPS, you can drill down and identify which XLA operations
are taking the longest time to execute and how much TPU FLOPS they consume.

Using the Op Profile

While collecting a profile, capture_tpu_profile also collects
a op_profile.json file that contains performance statistics of XLA operations.
To open Op Profile, go to TensorBoard and click on the Profile tab at the
top of the screen. Select the op_profile from the Tools dropdown. You
will see something like this:

An overview section, which shows the overall TPU utilization and the
operation that spends the most time in the profile duration. The tool also
tells you on how well that operation uses the computational potential of the
chip and gives suggestions for optimization.

A control panel. You can select how many ops to show for each XLA
category by sliding the bar on the left. You can also toggle the button on
the right to only list ops within 90th percentile of the total execution
time.

An op table, which lists the XLA operations by category and sorted by
the time spent in descending order.

Op details cards. When you hover over a table entry, a card appears
showing more details about the operation, for example, the FLOPS
utilization, XLA expression and the layout.

Op Table

Each entry in the table contains multiple columns and has a triangle on the left
that you can click to expand and collapse the entry. There is one entry for each
operation category. For each category, the table shows the time, operation
category name, the name of the associated TensorFlow op and its FLOPS
utilization.

Time, which shows the total percentage of the time spent by all the
operations in that category. You can click to expand the entry and see the
breakdown of the time spent by each individual operation.

Horizontal Bar, which visualizes the time distribution across
categories.

Top10 Ops. When you click to expand each category, the top 10 operations
that spend the most of time are listed. You can further expand a fusion op
entry to see the non-fusion elementwise operations included.

TensorFlow Op, which shows the TensorFlow op name associated with the
XLA operation.

FLOPS, which shows the FLOPS utilization, that is the measured FLOPS
normalized to the peak FLOPS of the device. Higher FLOPS utilization is
better as operations run fast. The table cell is color coded: green for high
FLOPS utilization (good) and red for low FLOPS utilization (bad).

Op Details Cards

When you hover over a table entry, a card appears on the left telling you more
details about the XLA op or the operation category. A typical card looks like
this:

Name, which shows the XLA operation name.

Category, which shows the operation category.

FLOPS utilization, which includes the value and a color coded progress
bar.

Layout (optional), which shows the shape and
layout of a tensor. Note
that layout is only shown for convolution operations. The tool also shows
whether the shape of the tensor is a exact fit for the matrix units and how
it is padded.

Interpreting the Results

For illustration, this section gives a quick interpretation of the numbers shown
in the above example: Overall the model reaches 34% of the highest FLOPS that
can be achieved by the device. Output fusion and convolution dominate
the execution time while there is also a long tail of vector or scalar
operations that has very low FLOPS. One optimization strategy is to transform
those vector or scalar operations to convolution operations.

For convolution ops, the TPU FLOPS utilization can also be low due to the
following reasons:

padding, the matrix units are used only partially.

the convolution op is memory bound.

In the following example, the %convolution.11 shows lower FLOPS utilization
than the %convolution.193 in the previous example.

Taking a closer look at its layout, there is a padding from 64 to 128, which
indicates that only half of the matrix units are effectively used. Therefore,
compared to the previous case which has an exact fit, the FLOPS utilization is
almost halved.

Input Pipeline Analyzer

TensorBoard provides a powerful tool to analyze the TensorFlow input
pipeline.
When a TensorFlow program reads data from files, the data is read at the
beginning of a TensorFlow graph in a pipelined manner: the read process is
divided into multiple data processing stages connected in series, where the
output of one stage is the input of the next one. This process of reading files
is called input pipeline.

A typical pipeline for reading records from files has the following stages:

File reading

File preprocessing (optional)

Transferral of the file from the host machine to the device

An inefficient input pipeline can severely slow down your application. We say an
application is input bound when it spends a significant portion of time in
input pipeline. This tool presents an in-depth analysis of your input pipeline
performance based on various performance data collected. At the high level, the
tool tells you whether your program is input bound. If that is the case, the
tool can also walk you through the device and host side analysis to debug which
stage of the pipeline is the bottleneck.

User Interface Overview

The Input Pipeline Analyzer tool reads the performance analysis results from a
input_pipeline.json file that is also collected by the
capture_tpu_profile. To open Input Pipeline Analyzer, select
the input_pipeline_analyzer from the Tools dropdown. The analysis
contains three sections:

Summary, which tells you the overall input pipeline analysis: whether
your application is input bound and by how much.

Device-side analysis, which shows you the detailed device-side analysis
results, including the device step time and how much is spent waiting for
the input data.

Host-side analysis, which shows you the detailed analysis on the host
side, including a breakdown of input processing time on the host, and a
tabular view of details for each input operation.

How to Tell Your Application is Input Bound

Section 1 is a summary of the overall analysis. It reports if your TPU program
is input-bound and by how much (in terms of percentage of device time spent on
waiting for input from the host). In addition, if you are using a standard input
pipeline that has been instrumented, the tool reports where most of the input
processing time is spent. For example:

Device-side Analysis

Section 2 shows the details of device-side analysis, which gives you insights on
how much time is spent in the device versus in the host and how much device time
is spent waiting for input data from the host.

Step time plotted against step number, which plots a graph of device
step time (in milliseconds) over all the steps sampled. The blue area
corresponds to the part of the step time that is waiting for input data from
the host while the orange area corresponds to the non-input time.

Step time statistics, which reports the average, standard deviation and
the range ([minimum, maximum]) of the device step time.

Range of time waiting for input data, plotted against step number, which
plots a line chart showing the fraction of device time waiting for input
data processing (normalized to total device step time) over all the steps.
Note that, the fraction of time spent varies for different TPU cores.
Therefore, in addition to the fraction averaged across all the cores, the
range of the fractions for different cores are also plotted for each step.
Ideally, you want this range to be as small as possible, because the
eventual step time of a particular step is determined by the slowest core.

Fraction of time waiting for input data, which reports the average,
standard deviation and the range ([minimum, maximum]) of the fraction of
time spent in device waiting for the input data normalized to the total
device step time.

Host-side Analysis

Section 3 shows the details of host-side analysis, which reports a breakdown of
the input processing time (the time spent on the Dataset API operations) on the
host into several categories:

Reading data from files on demand, which is the time spent on reading
data from files without caching, prefetching, and interleaving.

Reading data from files in advance, including caching, prefetching, and
interleaving.

Data preprocessing, for example, image decompression.

Enqueuing data to be transferred to device, which is typically called by
TensorFlow to put the data into an infeed queue before transferring to the
device.

If you want to see the statistics of individual input operations and their
categories in the execution time breakdown, you can click the button "Show Input
Op statistics". You will see a table like this:

Each table entry contains the following information:

Input Op, which shows the TensorFlow op name of the input operation.

Count, which shows the total number of instances of the operation
executed during the profiling period.

Total Time, which shows the accumulative sum of the wall clock time
spent on each of those instances.

Total Time %, which shows the total time spent on that operation as a
fraction of the total time spent in input processing.

Total Self Time, which shows the accumulative sum of the self time spent
on each of those instances. The self time here measures the time spent
inside the function body, excluding the time spent in the function it calls.
For example, the Iterator::PaddedBatch::Filter::ForeverRepeat::Map is
called by Iterator::PaddedBatch::Filter, therefore its total self time is
excluded from the total self time of the latter.

Total Self Time %, which shows the total self time as a fraction of the
total time spent on input processing.

Category, which corresponds to the categories defined above for the
breakdown.