We present Optimal Completion Distillation (OCD), a training procedure for
optimizing sequence to sequence models based on edit distance. OCD is
efficient, has no hyper-parameters of its own, and does not require pretraining
or joint optimization with conditional log-likelihood. Given a partial sequence
generated by the model, we first identify the set of optimal suffixes that
minimize the total edit distance, using an efficient dynamic programming
algorithm. Then, for each position of the generated sequence, we use a target
distribution that puts equal probability on the first token of all the optimal
suffixes. OCD achieves the state-of-the-art performance on end-to-end speech
recognition, on both Wall Street Journal and Librispeech datasets, achieving
$9.3\%$ WER and $4.5\%$ WER respectively.

Captured tweets and retweets: 2

Made with a human heart + one part enriched uranium + four parts unicorn blood