Abstract

State-of-the-art machine learning models frequently misclassify inputs thathave been perturbed in an adversarial manner. Adversarial perturbationsgenerated for a given input and a specific classifier often seem to beeffective on other inputs and even different classifiers. In other words,adversarial perturbations seem to transfer between different inputs, models,and even different neural network architectures. In this work, we show that inthe context of linear classifiers and two-layer ReLU networks, there provablyexist directions that give rise to adversarial perturbations for manyclassifiers and data points simultaneously. We show that these "transferableadversarial directions" are guaranteed to exist for linear separators of agiven set, and will exist with high probability for linear classifiers trainedon independent sets drawn from the same distribution. We extend our results tolarge classes of two-layer ReLU networks. We further show that adversarialdirections for ReLU networks transfer to linear classifiers while the reverseneed not hold, suggesting that adversarial perturbations for more complexmodels are more likely to transfer to other classifiers. We validate ourfindings empirically, even for deeper ReLU networks.