Note: if sparse_grad is set to True, the gradient w.r.t weight will be
sparse. Only a subset of optimizers support sparse gradients, including SGD, AdaGrad
and Adam. By default lazy updates is turned on, which may perform differently
from standard updates. For more details, please check the Optimization API at:
https://mxnet.incubator.apache.org/api/python/optimization/optimization.html