For example, a variable created under a MirroredStrategy is a
MirroredVariable. If no devices are specified in the constructor argument of
the strategy then it will use all the available GPUs. If no GPUs are found, it
will use the available CPUs. Note that TensorFlow treats all CPUs on a
machine as a single device, and uses threads internally for parallelism.

While using distribution strategies, all the variable creation should be done
within the strategy's scope. This will replicate the variables across all the
replicas and keep them in sync using an all-reduce algorithm.

Variables created inside a MirroredStrategy which is wrapped with a
tf.function are still MirroredVariables.

experimental_distribute_dataset can be used to distribute the dataset across
the replicas when writing your own training loop. If you are using .fit and
.compile methods available in tf.keras, then tf.keras will handle the
distribution for you.

Args:

devices: a list of device strings such as ['/gpu:0', '/gpu:1']. If
None, all available GPUs are used. If no GPUs are found, CPU is used.

cross_device_ops: optional, a descedant of CrossDeviceOps. If this is not
set, NcclAllReduce() will be used by default. One would customize this
if NCCL isn't available or if a special implementation that exploits
the particular hardware is available.

We will assume that the input dataset is batched by the
global batch size. With this assumption, we will make a best effort to
divide each batch across all the replicas (one or more workers).

In a multi-worker setting, we will first attempt to distribute the dataset
by attempting to detect whether the dataset is being created out of
ReaderDatasets (e.g. TFRecordDataset, TextLineDataset, etc.) and if so,
attempting to shard the input files. Note that there has to be at least one
input file per worker. If you have less than one input file per worker, we
suggest that you should disable distributing your dataset using the method
below.

If that attempt is unsuccessful (e.g. the dataset is created from a
Dataset.range), we will shard the dataset evenly at the end by appending a
.shard operation to the end of the processing pipeline. This will cause
the entire preprocessing pipeline for all the data to be run on every
worker, and each worker will do redundant work. We will print a warning
if this method of sharding is selected.

Within each worker, we will also split the data among all the worker
devices (if more than one a present), and this will happen even if
multi-worker sharding is disabled using the method above.

If the above batch splitting and dataset sharding logic is undesirable,
please use experimental_distribute_datasets_from_function instead, which
does not do any automatic splitting or sharding.

You can also use the element_spec property of the distributed dataset
returned by this API to query the tf.TypeSpec of the elements returned
by the iterator. This can be used to set the input_signature property
of a tf.function.

experimental_distribute_datasets_from_function

dataset_fn will be called once for each worker in the strategy. Each
replica on that worker will dequeue one batch of inputs from the local
Dataset (i.e. if a worker has two replicas, two batches will be dequeued
from the Dataset every step).

This method can be used for several purposes. For example, where
experimental_distribute_dataset is unable to shard the input files, this
method might be used to manually shard the dataset (avoiding the slow
fallback behavior in experimental_distribute_dataset). In cases where the
dataset is infinite, this sharding can be done by creating dataset replicas
that differ only in their random seed.
experimental_distribute_dataset may also sometimes fail to split the
batch across replicas on a worker. In that case, this method can be used
where that limitation does not exist.

The dataset_fn should take an tf.distribute.InputContext instance where
information about batching and input replication can be accessed:

IMPORTANT: The tf.data.Dataset returned by dataset_fn should have a
per-replica batch size, unlike experimental_distribute_dataset, which uses
the global batch size. This may be computed using
input_context.get_per_replica_batch_size.

To query the tf.TypeSpec of the elements in the distributed dataset
returned by this API, you need to use the element_spec property of the
distributed iterator. This tf.TypeSpec can be used to set the
input_signature property of a tf.function.

# If you want to specify `input_signature` for a `tf.function` you must
# first create the iterator.
iterator = iter(inputs)
@tf.function(input_signature=[iterator.element_spec])
def replica_fn_with_signature(inputs):
# train the model with inputs
return
for _ in range(steps):
strategy.run(replica_fn_with_signature,
args=(next(iterator),))

IMPORTANT: Depending on the tf.distribute.Strategy implementation being
used, and whether eager execution is enabled, fn may be called one or more
times (once for each replica).

Args:

fn: The function to run. The inputs to the function must match the outputs
of input_iterator.get_next(). The output must be a tf.nest of
Tensors.

input_iterator: (Optional) input iterator from which the inputs are taken.

Returns:

Merged return value of fn across replicas. The structure of the return
value is the same as the return value from fn. Each element in the
structure can either be PerReplica (if the values are unsynchronized),
Mirrored (if the values are kept in sync), or Tensor (if running on a
single replica).

make_dataset_iterator

Data from the given dataset will be distributed evenly across all the
compute replicas. We will assume that the input dataset is batched by the
global batch size. With this assumption, we will make a best effort to
divide each batch across all the replicas (one or more workers).
If this effort fails, an error will be thrown, and the user should instead
use make_input_fn_iterator which provides more control to the user, and
does not try to divide a batch across replicas.

The user could also use make_input_fn_iterator if they want to
customize which input is fed to which replica/worker etc.

Args:

dataset: tf.data.Dataset that will be distributed evenly across all
replicas.

Returns:

An tf.distribute.InputIterator which returns inputs for each step of the
computation. User should call initialize on the returned iterator.

Returns:

An iterator object that should first be .initialize()-ed. It may then
either be passed to strategy.experimental_run() or you can
iterator.get_next() to get the next value to pass to
strategy.extended.call_for_each_replica().

reduce

Given a per-replica value returned by run, say a
per-example loss, the batch will be divided across all the replicas. This
function allows you to aggregate across replicas and optionally also across
batch elements. For example, if you have a global batch size of 8 and 2
replicas, values for examples [0, 1, 2, 3] will be on replica 0 and
[4, 5, 6, 7] will be on replica 1. By default, reduce will just
aggregate across replicas, returning [0+4, 1+5, 2+6, 3+7]. This is useful
when each replica is computing a scalar or some other value that doesn't
have a "batch" dimension (like a gradient). More often you will want to
aggregate across the global batch, which you can get by specifying the batch
dimension as the axis, typically axis=0. In this case it would return a
scalar 0+1+2+3+4+5+6+7.

If there is a last partial batch, you will need to specify an axis so
that the resulting shape is consistent across replicas. So if the last
batch has size 6 and it is divided into [0, 1, 2, 3] and [4, 5], you
would get a shape mismatch unless you specify axis=0. If you specify
tf.distribute.ReduceOp.MEAN, using axis=0 will use the correct
denominator of 6. Contrast this with computing reduce_mean to get a
scalar value on each replica and this function to average those means,
which will weigh some values 1/8 and others 1/4.

Args:

value: A "per replica" value, e.g. returned by run to
be combined into a single tensor.

axis: Specifies the dimension to reduce along within each
replica's tensor. Should typically be set to the batch dimension, or
None to only reduce across replicas (e.g. if the tensor has no batch
dimension).

Returns:

A Tensor.

run

Executes ops specified by fn on each replica. If args or kwargs have
tf.distribute.DistributedValues, such as those produced by a
"distributed Dataset" or experimental_distribute_values_from_function
when fn is executed on a particular replica, it will be executed with the
component of tf.distribute.DistributedValues that correspond to that
replica.

Returns:

Merged return value of fn across replicas. The structure of the return
value is the same as the return value from fn. Each element in the
structure can either be tf.distribute.DistributedValues, Tensor
objects, or Tensors (for example, if running on a single replica).