Thieves of Sesame Street: Model Extraction on BERT-based APIs

Abstract

We study the problem of model extraction in natural language processing, where an adversary with query access to a victim model attempts to reconstruct a local copy of the model. We show that when both the adversary and victim model fine-tune existing pretrained models such as BERT, the adversary does not need to have access to any training data to mount the attack. Indeed, we show that randomly sampled sequences of words, which do not satisfy grammar structures, make effective queries to extract textual models. This is true even for complex tasks such as natural language inference or question answering.

Our attacks can be mounted with a modest query budget of less than $400.The extraction's accuracy can be further improved using a large textual corpus like Wikipedia, or with intuitive heuristics we introduce. Finally, we measure the effectiveness of two potential defense strategies---membership classification and API watermarking. While these defenses mitigate certain adversaries and come at a low overhead because they do not require re-training of the victim model, fully coping with model extraction remains an open problem.