Should be called after autograd.backward() and outside of record() scope,
and after trainer.update().

For normal parameter updates, step() should be used, which internally calls
allreduce_grads() and then update(). However, if you need to get the reduced
gradients to perform certain transformation, such as in gradient clipping, then
you may want to manually call allreduce_grads() and update() separately.

Parameters

batch_size (int) – Batch size of data processed. Gradient will be normalized by 1/batch_size.
Set this to 1 if you normalized loss manually with loss = mean(loss).

ignore_stale_grad (bool, optional, default=False) – If true, ignores Parameters with stale gradient (gradient that has not
been updated by backward after last step) and skip update.