Abstract

Pretrained masked language models (MLMs) require finetuning for most NLPtasks. Instead, we evaluate MLMs out of the box via their pseudo-log-likelihoodscores (PLLs), which are computed by masking tokens one by one. We show thatPLLs outperform scores from autoregressive language models like GPT-2 in avariety of tasks. By rescoring ASR and NMT hypotheses, RoBERTa reduces anend-to-end LibriSpeech model's WER by 30% relative and adds up to +1.7 BLEU onstate-of-the-art baselines for low-resource translation pairs, with furthergains from domain adaptation. We attribute this success to PLL's unsupervisedexpression of linguistic acceptability without a left-to-right bias, greatlyimproving on scores from GPT-2 (+10 points on island effects, NPI licensing inBLiMP). One can finetune MLMs to give scores without masking, enablingcomputation in a single inference pass. In all, PLLs and their associatedpseudo-perplexities (PPPLs) enable plug-and-play use of the growing number ofpretrained MLMs; e.g., we use a single cross-lingual model to reranktranslations in multiple languages. We release our library for language modelscoring at https://github.com/awslabs/mlm-scoring.