Venue

Publication Year

Authors

BibTeX

Abstract

While techniques such as ensembling and distillation promise model quality
improvements when paired with almost any base model they are seldom used as the
multi-stage training setups they require are cumbersome and the extra
hyperparameters introduced make the process of tuning even more expensive. In this
paper we explore a variant of distillation which is relatively straightforward to
use as it does not require a complicated multi-stage setup. We also show that
distillation can be used as a meaningful distributed learning algorithm: instead of
independent workers exchanging gradients, which requires worrying about delays and
synchronization, independent workers can exchange full model checkpoints. This can
be done far less frequently than exchanging gradients, breaking one of the
scalability barriers of stochastic gradient descent. We have experiments on Criteo
clickthrough rate, and the largest to-date dataset used for neural language
modeling, based on Common Crawl and containing $6\times 10^{11}$ tokens. In these
experiments we show we can scale at least $2\times$ as well as the maximum limit of
distributed stochastic gradient descent. Finally, we also show that online
distillation can dramatically reduce the churn in the predictions between different
versions of a model.