On the convergence properties of a $K$-step averaging stochastic gradient descent algorithm for nonconvex optimization.

Despite their popularity, the practical performance of asynchronousstochastic gradient descent methods (ASGD) for solving large scale machinelearning problems are not as good as theoretical results indicate. We adopt andanalyze a synchronous K-step averaging stochastic gradient descent algorithmwhich we call K-AVG. We establish the convergence results of K-AVG fornonconvex objectives and explain why the K-step delay is necessary and leads tobetter performance than traditional parallel stochastic gradient descent whichis a special case of K-AVG with $K=1$. We also show that K-AVG scales betterthan ASGD. Another advantage of K-AVG over ASGD is that it allows largerstepsizes. On a cluster of $128$ GPUs, K-AVG is faster than ASGDimplementations and achieves better accuracies and faster convergence for\cifar dataset.