This implements the Adam optimizer from Section 2 of the Adam
paper : https://arxiv.org/abs/1412.6980.
Adam is a first-order gradient-based optimization method based on
adaptive estimates of lower-order moments.

where, \(\rho\) is a hyperparameter and typical values are 0.9, 0.95
and so on. :math: beta is the momentum term. :math: epsilon is a
smoothing term to avoid division by zero, usually set somewhere in range
from 1e-4 to 1e-8.

Parameters:

learning_rate (float) – global learning rate.

rho (float) – rho is :math: rho in equation, set 0.95 by default.

epsilon (float) –

math:

epsilon in equation is smoothing term to

avoid division by zero, set 1e-6 by default.

momentum (float) – \(\beta\) in equation is the momentum term,
set 0.0 by default.

Accumulate the average of parameters whtin sliding window. The average
result will be saved in temporary variables which can be applied to
parameter variables of current model by calling ‘apply()’ method. And the
‘restore()’ method is used to restored the parameter values of current model.

The size of average window is determined by average_window_rate,
min_average_window, max_average_window and current update times.

where, \(\rho\) is a hyperparameter and typical values are 0.9, 0.95
and so on. :math: beta is the momentum term. :math: epsilon is a
smoothing term to avoid division by zero, usually set somewhere in range
from 1e-4 to 1e-8.

Parameters:

learning_rate (float) – global learning rate.

rho (float) – rho is :math: rho in equation, set 0.95 by default.

epsilon (float) –

math:

epsilon in equation is smoothing term to

avoid division by zero, set 1e-6 by default.

momentum (float) – \(\beta\) in equation is the momentum term,
set 0.0 by default.