For details of the update algorithm see sgd_update
and sgd_mom_update.
In addition to the SGD updates the LBSGD optimizer uses the LARS, Layer-wise
Adaptive Rate Scaling, algorithm to have a separate learning rate for each
layer of the network, which leads to better stability over large batch sizes.

This optimizer accepts the following parameters in addition to those accepted
by Optimizer.