obability. We propose a sys- tematic algorithm for computing universal perturbations, • 異なるモデルに同じノ and show イズが適用可 that state-of-the-art deep neural networks are Grille Jay highly vulnerable to such perturbations, albeit being quasi-imperceptible to the human eye. We further empirically an- • 少ないデータで強力なノイズが作成可 alyze these universal perturbations and

show, in particular, that they generalize very well across neural networks. The Thresher Labrador surprising existence of universal perturbations reveals im- • データ点から識別境界への法線方向のベク portant geometric correlations among the high-dimensionaldecision boundary of classifiers. It further outlines poten- Flagpole Labrador tial security breaches with the existence of single directions トルを足し上げることでノ in the input space that イズを構築可 adversaries can possibly e

xploit to break a classifier on most natural images.1 Tibetan mastiff Tibetan mastiff • 識別境界への法線ベクト 1. Introduction ルは多くのデータ Lycaenid Brabancon griffon Can we find a single small image perturbation that fools 点で共通の方向を向い a state-of-the-art ており強い相関 deep neural network classifier on all nat- ural images? We show in this paper the existence of such Balloon Labrador quasi-imperceptible universal perturbation vectors that lead （異なる識別境界領域が低次元で記述可） to misclassify natural images with high probability. Specif- arXiv:1610.08401v3 [cs.CV] 9 Mar 2017 ically, by adding such a quasi-imperceptible perturbation to natural images, the label estimated by the deep neu- Whiptail lizard Border terrier ral network is changed with high probability (see Fig. 1). Such perturbations are dubbed universal, as they are image- agnostic. The existence of these perturbations is problem- Figure 1: When added to a natural image, a universal per- atic when the classifier is deployed in real-world (and pos- turbation image causes the image to be misclassified by the sibly hostile) environments, as they can be exploited by ad- deep neural network with Ref: highhttps:probability//arxiv.org/.a Leftbs/images:1610.08401 5/32 Original natural images. The labels are shown on top of ⇤The first two authors contributed equally to this work. each arrow. Central image: Universal perturbation. Right † ´Ecole Polytechnique F´ed´erale de Lausanne, Switzerland images: Perturbed images. The estimated labels of the per- ‡ENS de Lyon, LIP, UMR 5668 ENS Lyon - CNRS - UCBL - INRIA, turbed images are shown on top of each arrow. Universit´e de Lyon, France 1To encourage reproducible research, the code is available at gitHub. Furthermore, a video demonstrating the effect of universal perturbations on a smartphone can be found here. 1

Introduction 6/32

Deep Learning は されやすい 人間の目には違いが分からない摂動ノイズを加えることで、 Deep Learning (DL) モデルを誤認識させることができてしまう！ 元画像 (a) ノイズ 変換画像 元画像 (b) ノイズ 変換画像 Figure 5: Adversarial examples generated 変換画像は全て ostfor ric AlexNet h, Struthi [9].(Left) o camelus is a correctly と認識され predicted てしまう sample, (center) dif- ference between correct image, and image predicted incorrectly magnified by 10x (values shifted by 128 and clamped), (right) adversarial example. All images in the right column are predicted to Ref:be htt anps://a “ostricrxiv.org/abs/ h, Struthio1312.6199 camelus” 7/32 . Average distortion based on 64 examples is 0.006508. Plase refer to http://goo.gl/huaGPb for full resolution images. The examples are strictly randomly chosen. There is not any postselection involved. (a) (b) Figure 6: Adversarial examples for QuocNet [10]. A binary car classifier was trained on top of the last layer features without fine-tuning. The randomly chosen examples on the left are recognized correctly as cars, while the images in the middle are not recognized. The rightmost column is the magnified absolute value of the difference between the two images. the original training set all the time. We used weight decay, but no dropout for this network. For comparison, a network of this size gets to 1.6% errors when regularized by weight decay alone and can be improved to around 1.3% by using carefully applied dropout. A subtle, but essential detail is that we only got improvements by generating adversarial examples for each layer outputs which were used to train all the layers above. The network was trained in an alternating fashion, maintain- ing and updating a pool of adversarial examples for each layer separately in addition to the original training set. According to our initial observations, adversarial examples for the higher layers seemed to be significantly more useful than those on the input or lower layers. In our future work, we plan to compare these effects in a systematic manner.For space considerations, we just present results for a representative subset (see Table 1) of the MNIST experiments we performed. The results presented here are consistent with those on a larger variety of non-convolutional models. For MNIST, we do not have results for convolutional mod- els yet, but our first qualitative experiments with AlexNet gives us reason to believe that convolu- tional networks may behave similarly as well. Each of our models were trained with L-BFGS until convergence. The first three models are linear classifiers that work on the pixel level with various weight decay parameters . All our examples use quadratic weight decay on the connection weights: P lossdecay = w2i/k added to the total loss, where k is the number of units in the layer. Three of our models are simple linear (softmax) classifier without hidden units (FC10( )). One of them, FC10(1), is trained with extremely high = 1 in order to test whether it is still possible to generate adversarial examples in this extreme setting as well.Two other models are a simple sigmoidal neural network with two hidden layers and a classifier. The last model, AE400-10, consists of a single layer sparse autoencoder with sigmoid activations and 400 nodes with a Softmax classifier. This network has been trained until it got very high quality first layer filters and this layer was not fine-tuned. The last column measures the minimum average pixel level distortion necessary to reach 0% accuracy q P on the training set. The distortion is measure by (x0 x i i )2 between the original x and distorted n 6

increasing the robustness and convergence speed of the models [9, 13]. These deformations are, however, statistically inefficient, for a given example: they are highly correlated and are drawn from the same distribution throughout the entire training of the model. We propose a scheme to make this process adaptive in a way that exploits the model and its deficiencies in modeling the local space around the training data.We make the connection with hard-negative mining explicitly, as it is close in spirit: hard-negative mining, in computer vision, consists of identifying training set examples (or portions thereof) which are given low probabilities by the model, but which should be high probability instead, cf. [5]. The training set distribution is then changed to emphasize such hard negatives and a further round of model training is performed. As shall be described, the optimization problem proposed in this work can also be used in a constructive way, similar to the hard-negative mining principle. 4.1 Formal description We denote by f : Rm ! {1 . . . k} a classifier mapping image pixel value vectors to a discrete label set. We also assume that f has an associated continuous loss function denoted by lossf : Rm ⇥ {1 . . . k} ! R+. For a given x 2 Rm image and target label l 2 {1 . . . k}, we aim to solve the following box-constrained optimization problem: • Minimize krk2 subject to: 1. f(x + r) = l2. x + r 2 [0, 1]m The minimizer r might not be unique, but we denote one such x + r for an arbitrarily chosen minimizer by D(x, l). Informally, x + r is the closest image to x classified as l by f. Obviously, 従来手法のノ D(x, f (x)) = f (x), so this イズの作り方 task is non-trivial only if f(x) 6= l. In general, the exact computation of D(x, l) is a hard problem, so we approximate it by using a box-constrained L-BFGS. Concretely, we find an approximation of D(x, l) by performing line-search to find the minimum c > 0 for whichhttps://arxiv.org/abs/the1312.6199 :minimizer あるクラ r of the スへと導く小さなノ following problem イズを算出 satisfies f(x + r) = l. ノイズ r は次式で算出: • Minimize c|r| + lossf (x + r, l) subject to x + r 2 [0, 1]mx は入力画像、l はラベル、f は DL モデル This penalty function method would yield the exact solution for D(X, l) in the case of convex losses, however neural networks are non-convex in general, so we end up with an approximation in this case. 4.2 Experimental results Our “minimimum distortion” function D has the following intriguing properties which we will sup- port by informal evidence and quantitative experiments in this section: 1. For all the networks we studied (MNIST, QuocNet [10], AlexNet [9]), for each sam- ple, we have always managed to generate very close, visually hard to distinguish, ad- versarial examples that are misclassified by the original network (see figure 5 andhttp://goo.gl/huaGPb for examples). 元画像 2. (a) ノ Cr イズ oss model g 変換画像eneralization: a relati 元画像 vely ノ lar (b) ge イズ fraction of e 変換画像 xamples will be misclassified by networks trained from scratch with different hyper-parameters (number of layers, regular- 変換画像は全て ostrich, Struthio camelus と認識 Figure 5: Adversarial examples ization or generated initial for Ale weights). xNet [9].(Left) is a correctly predicted sample, (center) dif- ference 10/32between correct 3. image, Cr and oss training-set image predicted generalization incorrectly a relati magnified v by ely 10xlarRe (vge f: htt fractionps://arxiv.org/aalues ofbs/shifted eby xamples1312.6199128 and will be misclassi- clamped), (right) adversarial e fied xample. by netw All orks images trained in the from right scratch column are on a disjoint predicted to tr be aining an set “ostric .h, Struthio camelus”. Average distortion based on 64 examples is 0.006508. Plase refer to http://goo.gl/huaGPb for full resolution The images. abo The e ve observ xamples ations are suggest strictly that randomly adversarial chosen. Thereexamples is not any are somewhat postselection in uni v v olv ersal ed. and not just the results of overfitting to a particular model or to the specific selection of the training set. They also suggest that back-feeding adversarial examples to training might improve generalization of the re- sulting models. Our preliminary experiments have yielded positive evidence on MNIST to support this hypothesis as well: We have successfully trained a two layer 100-100-10 non-convolutional neu- ral network with a test error below 1.2% by keeping a pool of adversarial examples a random subset of which is continuously replaced by newly generated adversarial examples and which is mixed into (a) (b) Figure 6: Adversarial examples for QuocNet [10]. A binary car classifier w5as trained on top of the last layer features without fine-tuning. The randomly chosen examples on the left are recognized correctly as cars, while the images in the middle are not recognized. The rightmost column is the magnified absolute value of the difference between the two images. the original training set all the time. We used weight decay, but no dropout for this network. For comparison, a network of this size gets to 1.6% errors when regularized by weight decay alone and can be improved to around 1.3% by using carefully applied dropout. A subtle, but essential detail is that we only got improvements by generating adversarial examples for each layer outputs which were used to train all the layers above. The network was trained in an alternating fashion, maintain- ing and updating a pool of adversarial examples for each layer separately in addition to the original training set. According to our initial observations, adversarial examples for the higher layers seemed to be significantly more useful than those on the input or lower layers. In our future work, we plan to compare these effects in a systematic manner.For space considerations, we just present results for a representative subset (see Table 1) of the MNIST experiments we performed. The results presented here are consistent with those on a larger variety of non-convolutional models. For MNIST, we do not have results for convolutional mod- els yet, but our first qualitative experiments with AlexNet gives us reason to believe that convolu- tional networks may behave similarly as well. Each of our models were trained with L-BFGS until convergence. The first three models are linear classifiers that work on the pixel level with various weight decay parameters . All our examples use quadratic weight decay on the connection weights: P lossdecay = w2i/k added to the total loss, where k is the number of units in the layer. Three of our models are simple linear (softmax) classifier without hidden units (FC10( )). One of them, FC10(1), is trained with extremely high = 1 in order to test whether it is still possible to generate adversarial examples in this extreme setting as well.Two other models are a simple sigmoidal neural network with two hidden layers and a classifier. The last model, AE400-10, consists of a single layer sparse autoencoder with sigmoid activations and 400 nodes with a Softmax classifier. This network has been trained until it got very high quality first layer filters and this layer was not fine-tuned. The last column measures the minimum average pixel level distortion necessary to reach 0% accuracy q P on the training set. The distortion is measure by (x0 x i i )2 between the original x and distorted n 6

Published as a conference paper at ICLR 2015 • Shallow softmax regression models are also vulnerable to adversarial examples. • Training on adversarial examples can Publishedregularize as a the model—ho conference we paper v at er, this ICLRwas not 2015 practical at the time due to the need for expensive constrained optimization in the inner loop. These results suggest that classifiers based on modern machine learning techniques, even those that obtain excellent performance on the test set, are not learning the true underlying concepts that determine the correct output label. Instead, these algorithms have built a Potemkin village that works well on naturally occuring data, but is exposed as a fake when one visits points in space that do not have high probability in the data distribution. This is particularly disappointing because a popular approach in computer vision is to use convolutional network features as a space where Euclidean distance approximates perceptual distance. This resemblance is clearly flawed + . if images 007 that have an ⇥ = immeasurably small perceptual distance correspond to completely different classes in the network’s representation.These results have often been interpreted as being a flaw in deep networks in particular, even though linear classifiers have the same problem. We regard the knowledge of this flaw as an opportunity to fix it. Indeed, Gu & Rigazio (2014) and Chalupka et al. (2014) have already begun the first steps x + x sign(r toward designing models that resist adversarial perturbation, though no model has yet succesfully xJ (✓, x, y)) ✏sign(rxJ(✓, x, y)) done so while maintaining state of the art accuracy on clean inputs. “panda” “nematode” “gibbon” 57.7% confidence 8.2% confidence 99.3 % confidence 3 THE LINEAR EXPLANATION OF ADVERSARIAL EXAMPLES We start with explaining the existence of adv Figure ersarial 1: A examples demonst for linear ration of f models. ast adversarial example generation applied to GoogLeNet (Szegedy et al., 2014a) on ImageNet. By adding an imperceptibly small vector whose elements are equal to In many problems, the precision of an individual input feature is limited. For example, digital images often use only 8 bits per pixel the so the sign y discard of the all information elements of the below gradient of the cost function with respect to the input, we can change 1/255 of the dynamic range. Because the precision of the features is limited, GoogLeNet’s it is not rational classification of for the the classifier image. to Here respond our ✏ of .007 corresponds to the magnitude of the differently to an 従来手法のノ input x than to an adversarial smallest input bit of ˜x = an 8 x + bit ⌘ if eイズの作り方 very image element encoding of the after perturbation GoogLeNet’s conversion to real numbers. ⌘ is smaller than the precision of the features. Formally, for problems with well-separated classes, we expect the classifier to assign the same class to x and ˜ x so long as ||⌘||1 < ✏, where ✏ is small enough to be discarded by the sensor or data storage apparatus associated with our problem. Let ✓ be the parameters of a model, x the input to the model, y the targets associated with x (for Consider htthe tpsdot ://arxivproduct .org/absbetween /a1412.6572weight vmachine :ector 目的関数の微分方向に微小量を積み上げる w and an adversarial example ˜ x:learning tasks that have targets) and J(✓, x, y) be the cost used to train the neural network. We can linearize the cost function around the current value of ✓, obtaining an optimal max-norm ノイズ部分を分解: w> ˜x = w>x + w>⌘. constrained pertubation of Published as a conference paper at ICLR 2015 The adv ノイズ ersarial η を次式 perturbation (fa causes st gra the di acti e v nt s ation ign m to gro et w hod) by w で算出 >⌘.We : can maximiz ⌘ = ✏ e this sign ( increase rxJ(✓, x, y)) . subject to the max norm constraint on ⌘ by assigning ⌘ = sign(w). If w has n dimensions and the average ε は微小な定数、 magnitude of an element θ はモデ of the We referル weight toパラ vthis メタ ector is as 、 m, the x “f は入力画像、 then the ast activation gradient si y gn はラベル will grow by method” ✏mn of . generating adversarial examples. Note that the Since ||⌘|| required gradient can be computed efficiently using backpropagation. 1 does not grow with the dimensionality of the problem but the change in activation caused by perturbation by ⌘ can grow linearly with n, then for high dimensional problems, we can make many infinitesimal changes to W the e input find that that add this up m to one ethod large change reliably to causes the a output. wide v We ariety of models to misclassify their input. See can think of this as a sort of “accidenta Fig. l1steganograph for a y,” where a demonstration linear on model is ImageNet. forced We to attend find that using ✏ = .25, we cause a shallow softmax exclusively to the signal that aligns most closely classifier towith havits e weights, an error even rate if multiple of 99.9% signals with are an apresent verage confidence of 79.3% on the MNIST (?) test and other signals have much greater amplitude. set1. In the+same .007 setting, ⇥ a maxout network misclassifies = 89.4% of our adversarial examples with This explanation shows that a simple linear an av model erage can have adv confidence ersarial of examples 97.6%. if its input Similarly, has using suf- ✏ = .1, we obtain an error rate of 87.15% and ficient dimensionality. Previous explanations an av for erage adversarial probability examples of in 96.6% voked hypothesized assigned to the prop- incorrect labels when using a convolutional maxout erties of neural networks, such as their supposed highly non-linear nature. Our hypothesis based network on a preprocessed version of the CIFAR-10 (Krizhevsky & Hinton, 2009) test set2. Other on linearity is simpler, and can also explain why softmax regression is vulnerable to adversarial simple methods of generating adversarial examples are possible. x + For example, we also found that examples. x rotating sign( ✓, x, y)) x by a small angle in r the xJ ( direction of the gradient ✏sign(r reliably xJ (✓, x, produces y)) adversarial examples. 4 LINEAR PERTURBATION OF “panda” NON-LINEAR MODELS “nematode” “gibbon” The fact that these simple, cheap algorithms are able to generate misclassified examples serves as 57.7% confidence 8.2% confidence 99.3 % confidence The linear view of adversarial examples e suggests vidence in faav fast or w of ay ourof generating them. interpretation of We adv hypothesize ersarial examples as a result of linearity. The algorithms that neural networks are too linear to areresist also linear adv useful asersarial a way perturbation. of speeding LSTMs up adv (Hochreiter ersarial & training or even just analysis of trained networks. Schmidhuber, 1997), Figure 1: ReLUs A (Jarrett demonst et al., ration 2009; of f Glorot ast adv et al., 2011), ersarial e and xamplemaxout netw generationorks (Good- applied to GoogLeNet (Szegedy fellow et al., et 2013c) al., are 2014a)all intentionally on ImageNet.designed By to beha adding anve in very linear imperceptibly ways, small sov that the ector y are 5 A whose elements are equal to DVERSARIAL TRAINING OF LINEAR MODELS VERSUS WEIGHT DECAY easier to optimize. the sign More of nonlinear the models elements of such the as sigmoid gradient of netw the orks cost are carefully function with tuned to respect spend to the input, Ref: https we://arxi canv.org/a changebs/1412.6572 11/32 most of their time in the non-saturating, GoogLeNet’s more classification of linear the regime image. for the Here same our reason. ✏ of This .007 linear behavior Perhaps the simplest possible model we can consider corresponds is to logistic the regression. magnitude of the In this case, the fast suggests that cheap, analytical perturbations of a linear model should also damage neural networks. smallest bit of an 8 gradient bit image sign method encoding afteris exact. We GoogLeNet’s can con use v this ersion tocase real to gain some numbe

DeepFool: a simple and accurate method to fool deep neural networks Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Pascal Frossard ´Ecole Polytechnique F´ed´erale de Lausanne {seyed.moosavi,alhussein.fawzi,pascal.frossard} at epfl.ch Abstract State-of-the-art deep neural networks have achieved im- pressive results on many image classification tasks. How-ever, these same architectures have been shown to be un-stable to small, well sought, perturbations of the images.Despite the importance of this phenomenon, no effectivemethods have been proposed to accurately compute the ro-bustness of state-of-the-art deep classifiers to such pertur-bations on large-scale datasets. In this paper, we fill thisgap and propose the DeepFool algorithm to efficiently com-pute perturbations that fool deep networks, and thus reli-ably quantify the robustness of these classifiers. Extensiveexperimental results show that our approach outperformsrecent methods in the task of computing adversarial pertur-bations and making classifiers more robust.1 1. Introduction Deep neural networks are powerful learning models that achieve state-of-the-art pattern recognition performance in many research areas such as bioinformatics [1, 16], speech [12, 6], and computer vision [10, 8]. Though deep net- works have exhibited very good performance in classifica- tion tasks, they have recently been shown to be particularly unstable to adversarial perturbations of the data [18]. In 従来手法のノ fact, very small and often imperceptible arXiv:1511.04599v3 [cs.LG] 4 Jul 2016 イズの作り方 perturbations of the data samples are sufficient to fool state-of-the-art classifiers and result in incorrect classification. (e.g., Figure 1). For- mally, for a given classifier, we define an adversarial per- DeepFool: a simple and accurate method to fool turbation deep neural as the netw minimal orks perturbation Figure 1: An example of adversarial perturbations. r that is sufficient tohttps://arxiv.org/abs/151change the 1.04599estimated : 識別境界をまたがる方向への変位を少しlabel ˆ First row: the original image x that is classified as k(x): ずつ足し上げる ˆ

increasing the robustness and convergence speed of the models [9, 13]. These deformations are, however, statistically inefficient, for a given example: they are highly correlated and are drawn from the same distribution throughout the entire training of the model. We propose a scheme to make this process adaptive in a way that exploits the model and its deficiencies in modeling the local space around the training data. Published as a conference W paper ate mak ICLR e the 2015 connection with hard-negative mining explicitly, as it is close in spirit: hard-negative mining, in computer vision, consists of identifying training set examples (or portions thereof) which are given low probabilities by the model, but which should be high probability instead, cf. [5]. The training set DeepF distrib ool: a ution is simple then and changed accurate to emphasize method to f such ool hard deep negati neural ves and netw a orks further round of • Shallow softmax re model gression training models are is also performed. vulnerable to As adv shall ersarial be e described, xamples. the optimization problem proposed in this work • Training on adv can ersarial e also xamples be can Published used reas a Se in a gularize constructi the conference yed-Mohsen v model—ho e paper Moosa w wev at ay er, , similar this ICLRwas 2015 vi-Dezfooli, to not the Alhusseinhard-ne practicalFawzi, g P ativ ascal e mining Frossard principle. at the time due to the need for expensive constrained optimization ´ in the inner loop. Ecole Polytechnique F´ed´erale de Lausanne These results suggest that 4.1 classifiers Formal based on description modern {seyed.moosavi,alhussein.fawzi,pascal.frossard machine learning techniques, even those } at epfl.ch that obtain excellent performance on the test set, are not learning the true underlying concepts that determine the correct output W label. e denote Instead, by these f : Rm algorithms hav! e b { uilt1a . . . k} a Potemkin classifier village that mapping works image pixel value vectors to a discrete well on naturally occuring data,label but is set. e W xposed eas also a fake assume when Abstract one that visits f has points an in associated space that do continuous not loss function denoted by lossf : have high probability in the R data m distribution. This is particularly disappointing because a popular approach in computer vision is to ⇥ use {1 . . State-of-the-art. k conv } olutional deep ! neural R+. For networks a have giv ac en hie x network features as a space ved 2 im- Rm image and target label l where Euclidean 2 {1 . . . k}, we aim to solve distance approximates prthe perceptual follo essive r wing distance. This esults on box-constrained resemblance many image is optimization clearly fla classification wed + . tasks. if 007 How- problem: images that have an ⇥ = immeasurably small perceptuale distance ver, these correspond same arc to completely hitectures have dif been ferent shown classes to be in un- the network’s representation. stable to • Minimize small, well Despite the importance of k sought, r this k perturbations of the images. 2 subject to: phenomenon, no effective These results have often been interpreted methods as have 1. been f being pr a(x fla + w oposed r in to ) = deep accur lnetw ately orks in particular compute the ro- , even though linear classifiers have the sameb problem. ustness of We regard the state-of-the-art kno deep wledge of classifiers this to fla such w as an pertur- opportunity to fix it. Indeed, Gu & Rigazio (2014) bations and on larg 2. x e-scale + Chalupka et r x + x2 al. datasets. [0 In, 1]m (2014) ha this ve already paper, we be fill gun this the first sign steps (r toward designing models that resist gap adv and pr ersarial opose the perturbation, DeepFool though algorithm to no ef model ficiently has com- yet succesfully xJ (✓, x, y)) ✏sign(rxJ(✓, x, y)) done so while maintaining stateThe of pute minimizer the art accurac perturbations y that r on might clean “panda” fool deep not inputs. be networks, unique, and thus r b eli- ut we denote one “nematode” such x + r for an arbitrarily “gibbon” chosen minimizer ably quantify the by rob D( 57.7%x, ustness l) of . Informally confidence these classifiers. , x + r Extensive is the closes 8.2% t image confidence to x classified 99.3 % as l by f. confidence Obviously, 3 THE LINE experimental results show that our approach outperformsrecent methods in the task of computing adversarial pertur- We start with e 従来手法のノ AR EXPLANATION OF ADVERSARIAL EXAMPLES D(x, f (x)) = f (x), so this xplaining the existence of adv Figure ersarial 1: A examples demonst for ration bations and making classifiers more rob イズの作り方 task is non-trivial only if f(x) 6= l. In general, the exact computation of of fast adversarial example generation applied to GoogLeNet (Szegedy D(x, l) is a hard problem, so ust.1 we linear approximate models. it by using a box-constrained L-BFGS. Concretely, et al., 2014a) on ImageNet. By adding an imperceptibly small vector whose elements are equal to In many problems, the we precision find of an an indiapproximation vidual input of feature D is (x, l) limited. by F performing or example, line-search digital to find the minimum c > 0 for which images often use only 8 bits per pixel the so the sign y discard of the all information elements of the below gradient of the cost function with respect to the input, we can change 1/255 of the dynamic range.httpsBecause ://the arxiv.org/precision aof1. bs/thethe 1312.6199featuresIntrisoduction :minimizerGoogLeNet’ sあるクラ r of the スへと導く小さなノ following problem classification of the image. イズを算出 satisfies f(x +limited, it is not rational for the classifier to Here respond our ✏ of r) = l. .007 corresponds to the magnitude of the differently to an input x than to an adversarial smallest input bit of ˜x = an 8 x + bit ⌘ if every image element encoding of the after perturbation GoogLeNet’s conversion to real numbers. ノイズ r は次式で算出 Deep : neural networks are powerful learning models that ⌘ is smaller than the precision of the • Minimize features. F c ormally, |r| for + lossf problems(x + with r, l) subject well-separated to x + classes, r 2 [0, 1]m we expect the classifier to assign the achieve same class to state-of-the-art pattern recognition performance in x and ˜ x so long as x は入力画像、l はラベル many research 、f areas は D such L as モデル ||⌘||1 < ✏, where ✏ is small enough to be discarded by the This sensor penalty or data function storage method apparatus bioinformatics w associated [1, ould with 16], yield our speech [12, 6], and computer vision [10, 8]. Though deep net- the problem. exact solution for D(X, l) in the case of convex Let ✓ be the parameters of a model, x the input to the model, y the targets associated with x (for Consider htthe tpsdot ://arxivproduct .org/aw bsbetween /aorks 1412.6572losses,have howeighte wevmachine :vector

Universal adversarial perturbations 14/32

versaries to break the classifier. Indeed, the perturbation single perturbation vector that fools the network on most process involves the mere addition of one very small pertur- natural images. Perturbing a new datapoint then only in- versaries to break the classifier bation to all . Indeed, natural the images, perturbation and can be relatively single perturbation straight- volves v the ector mere that fools addition of the the uninetw versalork on most perturbation to the process involves the mere forw addition ard to of one implement v by ery adv small pertur ersaries in - real-world natural environ- images. imagePerturbing (and does a not new datapoint require solving then an only in- optimization prob- bation to all natural images, ments, and while can being be relati relativelyvely dif straight- ficult to detect as volv such es per the - mere addition lem/gradient of the univ computation).ersal perturbation Finally, we to emphasize the that our forward to implement by adversaries turbations are very in real-w small and orld thus en do viron- not image significantly af (and fect does not notion of require universal solving an perturbation optimization differs from prob- the general- ments, while being relati datavely dif distrib ficult utions. to The detect as surprising such e per xistence -of univ lem/gradient ersal per- computation). ization of adv Finally ersarial , we emphasize perturbations studied that in [19 our ], where turbations further reveals new insights on the topology of perturbations computed on the MNIST task were shown to turbations are very small and thus do not significantly affect notion of universal perturbation differs from the general- the decision boundaries of deep neural networks. We sum- generalize well across different models. Instead, we exam- data distributions. The surprising marize the existence main contrib of uni utions v of ersal this per paper -as folloization ws: of adversarial ine the e perturbations xistence of universalstudied in [19 perturbations ], that where versaries to break the classifier. are common turbations Indeed, the further perturbationreveals new single insights on perturbation the v topology ector that of fools theperturbations network ontocomputed most most data on the points MNIST belonging totask the were data sho distrib wn to ution. process involves the mere addition of the one v decision ery small boundaries pertur- • of deep natural We sho neural w the e netw images. orks. xistence of W Perturbing ea univ sum- new ersal generalize datapoint then image-agnostic well only across in- different models. Instead, we exam- bation to all natural images, and can be perturbations for state-of-the-art deep neural networks. marize relativ the ely main contrib straight- utions volv of es this the paper mere as follo addition ws: of the universal ine the existence perturbation to 2. of the Uni uni v versal ersal perturbations perturbations that are common forward to implement by adversaries in real-world environ- image • We (and propose does an not require algorithm for solving finding an such to most perturba- data optimization points prob- We belonging formalize in to this the data section distrib the ution. ments, while being relatively difficult to • W detect e as show such the per- existence tions. of uni lem/gradient The versal algorithm image-agnostic computation). seeks a Finally universal , we emphasize perturbation for that our notion of universal per- turbations, and propose a method for estimating such per- turbations are very small and thus do not perturbations significantly affectfor state-of-the-art notion a set of of deep univ training neural ersal points, netw and orks. perturbation dif proceeds fers by 2. aggregUni from ating atomic perturbation vectors that send successive data- v theersal perturbations general- turbations. Let µ denote a distribution of images in Rd, and data distributions. The surprising existence of universal per- ization of adversarial perturbations studied in [19], where ˆ k define a classification function that outputs for each im- • We propose an algorithm points for to the finding decision such perturba- turbations further reveals new insights on the topology of perturbations computed on the boundary of MNIST the task classifier. were shown age to x Rd an estimated label ˆk(x). The main focus of this tions. The algorithm seeks a universal perturbation for We formalize in 2 this section the notion of universal per- the decision boundaries of deep neural networks. We sum- generalize • We show well that across universal different models. perturbations have a Instead, remark- we exam- paper is to seek perturbation vectors v 2 Rd that fool the a set of training points, able and proceeds generalization by aggre property, g as ating turbations, and propose a method for estimating such per- marize the main contributions of this paper as follows: ine the existence of universal perturbations perturbationsthat are com- common classifier ˆk on almost all datapoints sampled from µ. That atomic perturbation vectors puted forthat a send rather successi small set of ve tr data- turbations. Let µ denote a distribution of images in Rd, and to most data points belonging to the aining data points ˆdistrib fool ne ution. w is, we seek a vector v such that • We show the existence of universal points to the decision image-agnostic boundary images with of the k define a classification function that outputs for each im- high probability. ˆ perturbations for state-of-the-art deep neural networks. k(x + v) 6= ˆk(x) for “most” x ⇠ µ. 2. • Uni We v sho ersal w that such 普遍的な摂動ノ classifier. age x 2 Rd an estimated label ˆk(x). The main focus of this • We show that universal perturbations have a remark- perturbations perturbations are not paper only univ is er- to seek perturbation v イズの定式化 ectors v 2 Rd that fool the sal across images, but also generalize well across deep • We propose an algorithm for finding able such generalization perturba- property, as perturbations com- classifier ˆk on Wealmost coin all such datapoints a sampled perturbation univer from sal, as µ it . That represents a neural networks. Such perturbations are therefore dou- fixed image-agnostic perturbation that causes label change tions. The algorithm seeks a universal puted for a perturbation rather for small bly set We univ of tr ersal,aining formalize in this section the is, notion we of seek univ a v ersal ector per- both 次の points with re ように普遍的なノ fool spect to ne thew data and the イズを定式化 turbations, and propose a method for estimating net- such per for - most : v such that

images sampled from the data distribution µ. We a set of training points, and proceeds images by aggrewith g high ating probability work . turbations. Let architectures. µ denote a distribution of images in focus here on the case where the distribution µ represents ˆ atomic perturbation vectors that send successive data- k Rd(,x the + and set v) of 6= ˆ k(x natural ) for “most” images, hence x ⇠ µ. containing a huge amount • We show that such• perturbations ˆ k define We e a xplain and are not analyze only classification the uni high ver function - vulnerability of deep points to the decision boundary of the classifier. that outputs for each im- of variability. In that context, we examine the existence of sal across images, but also age x neural generalize 2 Rd netw an orks to uni ここで well estimated versal 、 は across label deep ˆ k(x). perturbations e bys The teimat main ed l xamin- abe focus l of で small x this は画像、 universal µ はデータ分布 perturbations (in terms We coin such a perturbation universal, as it represents of the `p a norm with ing the geometric correlation between different parts • We show that universal perturbations neural have a networks. remark- Such perturbations paper is to seek are therefore perturbation dou- vectors v 2 Rd that fool p 2the of the decision ノ boundary. イズ v を誤認識を引き起こす有効な摂動とし fixed image-agnostic [1, 1)) that misclassify most images. The goal is there- て実現するために、以下の制約を課す perturbation that causes label change

able generalization property, as bly univ

Algorithm 1 Computation of universal perturbations. 1: input: Data points X, classifier ˆk, desired `p norm of the perturbation ⇠, desired accuracy on perturbed sam- ples . 2: output: Universal perturbation vector v.3: Initialize v 0.4: while Err(Xv) 1 do 5: for each datapoint xi 2 X do 6: if ˆk(xi + v) = ˆk(xi) then 7: Compute the minimal perturbation that sends xi + v to the decision boundary: vi arg min krk2 s.t. ˆk(xi + v + r) 6= ˆk(xi). r Figure 2: Schematic representation of the proposed algo- 8: Update the perturbation: rithm used to compute universal perturbations. In this il- lustration, data points x1, x2 and x3 are super-imposed, and v Pp,⇠(v + vi). the classification regions Ri (i.e., regions of constant esti- mated label) are shown in different colors. Our algorithm 9: end if proceeds by aggregating sequentially the minimal perturba- 10: end for tions sending the current perturbed points x 11: end while i + v outside of the corresponding classification region Ri. Algorithm 1 Computation of universal perturbations. 1: input: Data points X, classifier ˆk, desired `p norm of standard the perturbation classifier (e.g., a ⇠, deepdesired neural accurac netw y ork), on sev perturbed eral ef- sam- mization problem: ficient ples . approximate methods have been devised for solving v this problem [19, 11, 7]. We use in the following the ap- i arg min krk2 s.t. ˆ k(xi + v + r) 6= ˆk(xi). (1) 2: output: Universal perturbation vector v. r proach 3: in [Initialize 11] for itsv ef 0. ficency. It should further be noticed that the objective of Algorithm 1 is not to find the smallest アルゴリTズム o ensure that the constraint kvkp ⇠ is satisfied, the up- 4: while Err(Xv) 1 do dated universal perturbation is further projected on the ` universal perturbation that fools most data points sampled p 5: for each datapoint xi 2 X do ball of radius ⇠ and centered at 0. That is, let P from the distribution, but rather to find one such perturba- p,⇠ be the 6: if ˆk(xi + v) = ˆk(xi) then projection operator defined as follows: tion with sufficiently small norm. In particular, different 7: Compute the minimal perturbation that random sends shufflings of the set x X naturally lead to a diverse i + v to the decision boundary: P set of universal perturbations p,⇠ (v) = arg min kv v0k2 subject to kv0kp ⇠. v satisfying the required con- Algorithm 1 Computation of universal perturbations. v0 straints. The v proposed algorithm can therefore be leveraged i arg min krk2 s.t. ˆ k(xi + v + r) 6= ˆk(xi). r Then, our update rule is given by v P to generate multiple universal perturbations for a deep neu- p,⇠ (v + vi). Sev- 1: input: Data points X, classifier ˆk, desired ` ral network (see next section for visual examples). p norm of Figure eral passes 2: on Schematic the data set representation X are of performed the to proposed improve the algo- 8: Update the perturbation: the perturbation ⇠, desired accuracy on perturbed sam- rithm quality of used the to univcomp ersal ute universal perturbation. perturbations. The algorithm is In terthis - il- ples . lustration, minated when data the points x empirical “fooling rate” on the perturbed 1, x2 and x3 are super-imposed, and v Pp,⇠(v + vi). the data classification set X 3. Universal perturbations for deep nets v := {x1 re + gions v, . . . ,R xm + v} exceeds the target 2: output: Universal perturbation vector v. i (i.e., regions of constant esti- mated threshold label) 1 . are Thatsho is, wn we in dif stop ferent the colors. algorithm Our whenev algorithm er 9: end if We now analyze the robustness of state-of-the-art deep 3: Initialize v 0. proceeds by aggregating m sequentially the minimal perturba- 10: end for 1 X neural network classifiers to universal perturbations using 4: while Err(X tions sending Err(X the current 1 perturbed points 1 x 11: end while v ) 1 do i + . v outside of v ) := Algorithm 1. m ˆ k(xi+v)6=ˆ k(xi) 5: for each datapoint x the corresponding classification i=1 region Ri. i 2 X do In a first experiment, we assess the estimated universal 誤認識率を計算 6: if ˆk(x The detailed algorithm is provided in Algorithm 1. Interest- perturbations for different recent deep neural networks on i + v) = ˆ k(xi) then standard classifier (e.g., a deep neural network), several ef- 7: Compute the minimal perturbation that mization ingly, in problem: practice, the number of data points m in X need the ILSVRC 2012 [15] validation set (50,000 images), and ficient approximate methods have been devised for solving not be large to compute a universal perturbation that is valid report the fooling ratio, that is the proportion of images that sends xi + v to the decision boundary: for the v this problem [19, 11, 7]. We use in the following the ap- i whole arg m distrib in kr ution k2 µ. s.t. ˆ In k(xi + v particular + , r) we 6= ˆk can (x seti). m (1)change labels when perturbed by our universal perturbation. r to be much smaller than the number of training points (see proach Results are in [11] reported for its p ef = ficenc 2 y and . It p should = further be noticed 各データ点に対して識別境界 1, where we

v Section 3). that respectiv the ely objecti set ⇠ = ve of 2000 Algorithm and ⇠ = 1 10. is not Theseto find the numerical smallest i arg min krk2 s.t. ˆ k(xi + v + r) 6= ˆk(xi). To ensure that the constraint kvkp ⇠ is satisfied, the up- r dated The universal proposed perturbation algorithm inv is olves までの最短のベクト further solving projected at ルを算出 most on m the in- ` values universal were perturbation chosen in order that to fools obtain a most data perturbation points whose sampled p Figure 2: Schematic representation of the proposed algo- ball stances of of radius the optimization problem in Eq. (1) for each pass. norm from is the distrib significantly ution, smaller but rather than the to find image one such norms, perturba- such 8: Update the perturbation: ⇠ and centered at 0. That is, let Pp,⇠ be the While this optimization problem is not convex when ˆk is a that tion the with sufficiently perturbation is small norm. quasi-imperceptible In particular when added ,to different rithm used to compute universal perturbations. In this il- projection operator defined as follows: random shufflings of the set X naturally lead to a diverse lustration, data points x set of universal perturbations 1, x2 and x3 are super-imposed, and v Pp,⇠(v + vi). Pp,⇠(v) = arg min kv v0k2 subject to kv0kp ⇠. v satisfying the required con- the classification regions v0 R straints. The proposed algorithm can therefore be leveraged i (i.e., regions of constant esti- mated label) are shown in different colors. Our algorithm 9: end if Then, our update rule is given by 識別境界までのベクト v P ルを可能な限り to generate multiple universal perturbations for a deep neu- p,⇠ (v + vi). Sev- eral passes on the data set ral network (see next section for visual examples). proceeds by aggregating sequentially the minimal perturba- 10: end for X are performed to improve the 保ちつつ、 quality of the universal L perturbation. p-norm を制限以下にする The algorithm is ter- tions sending the current perturbed points x 11: end while i + v outside of minated when the empirical “fooling rate” on the perturbed 16/32 the corresponding classification region 3. Universal perturbations for deep nets R data set X i. v := {x1 + v, . . . , xm + v} exceeds the target threshold 1 . That is, we stop the algorithm whenever We now analyze the robustness of state-of-the-art deep 1 m X neural network classifiers to universal perturbations using mization problem: standard classifier (e.g., a deep neural network), several ef- Err(X 1 1 . Algorithm 1. ficient approximate methods have been devised for solving v ) := m ˆ k(xi+v)6=ˆ k(xi) i=1 In a first experiment, we assess the estimated universal v this problem [19, 11, 7]. We use in the following the ap- i arg min krk2 s.t. ˆ k(xi + v + r) 6= ˆk(xi). (1) The detailed algorithm is provided in Algorithm 1. Interest- perturbations for different recent deep neural networks on r proach in [11] for its efficency. It should further be noticed ingly, in practice, the number of data points m in X need the ILSVRC 2012 [15] validation set (50,000 images), and To ensure that the constraint not be large to compute a universal perturbation that is valid report the fooling ratio, that is the proportion of images that kvk that the objective of Algorithm 1 is not to find the smallest p ⇠ is satisfied, the up- fo

Several experiments 17/32

各モデルの誤認識率 様々なモデルで誤認識率を算出 ・ILSVRC のデータを使用（ X: 10,000, Val.: 50,000 ） ・(p, ξ) = (2, 2000) と (∞, 10) という組み合わせで実験 L2-norm の方が全体的に良い結果だが、L∞-norm がよくマッチするモデルも存在 CaffeNet [8] VGG-F [2] VGG-16 [17] VGG-19 [17] GoogLeNet [18] ResNet-152 [6] X 85.4% 85.9% 90.7% 86.9% 82.9% 89.7% `2 Val. 85.6 87.0% 90.3% 84.5% 82.0% 88.5% X 93.1% 93.8% 78.5% 77.8% 80.8% 85.4% `1 Val. 93.3% 93.7% 78.3% 77.8% 78.9% 84.0% 普遍的な摂動ノイズによって誘発された Table 1: Fooling ratios G on oogl the set e X,Ne and t の誤認識 the validation set. natural images2. Results are listed in Table 1. Each result While the above universal perturbations are computed is reported on the set X, which is used to compute the per- for a set X of 10,000 images from the training set (i.e., in turbation, as well as on the validation set (that is not used average 10 images per class), we now examine the influence in the process of the computation of the universal pertur- of the size of X on the quality of the universal perturbation. bation). Observe that for all networks, the universal per- We show in Fig. 6 the fooling rates obtained on the val- turbation achieves very high fooling rates on the validation idation set for different sizes of X for GoogLeNet. Note set. Specifically wool , the universal perturbations Indian elephant computed for Indian elephant for example African grey that with a set tabby African grey X containing only 500 images, CaffeNet and VGG-F fool more than 90% of the validation we can fool more than 30% of the images on the validation set (for p = 1). In other words, for any natural image in set. This result is significant when compared to the num- the validation set, the mere addition of our universal per- ber of classes in ImageNet (1000), as it shows that we can turbation fools the classifier more than 9 times out of 10. fool a large set of unseen images, even when using a set This result is moreover not specific to such architectures, X containing less than one image per class! The universal as we can also find universal perturbations that cause VGG, perturbations computed using Algorithm 1 have therefore a common newt carousel grey fox macaw three-toed sloth macaw GoogLeNet and ResNet classifiers to be fooled on natural remarkable generalization power over unseen data points, 18/32 images Figure 3:with probability Examples of edging 80% perturbed . These images results and ha their ve an and corresponding can be labels. computed The first on 8 a very images small set belong of to training Ref: https: theimages.//arxiv.org/abs/1610.08401ILSVRC 2012 v element alidation of surprise, set, and the as the last 4 y sho are w the images existence taken by a of single mobile phone camera. See supp. material for the original images. universal perturbation vectors that cause natural images to be misclassified with high probability, albeit being quasi- Cross-model universality. While the computed pertur- imperceptible to humans. To verify this latter claim, we bations are universal across unseen data points, we now ex- show visual examples of perturbed images in Fig. 3, where amine their cross-model universality. That is, we study to the GoogLeNet architecture is used. These images are ei- which extent universal perturbations computed for a spe- ther taken from the ILSVRC 2012 validation set, or cap- cific architecture (e.g., VGG-19) are also valid for another tured using a mobile phone camera. Observe that in most architecture (e.g., GoogLeNet). Table 2 displays a matrix cases, the universal perturbation is quasi-imperceptible, yet summarizing the universality of such perturbations across this powerful image-agnostic perturbation is able to mis- six different architectures. For each architecture, we com- classify any image with high probability for state-of-the-art pute a universal perturbation and report the fooling ratios on classifiers. We refer to the supp. material for the original all other architectures; we report these in the rows of Table (unperturbed) images, as well as their ground truth labels. 2. Observe that, for some architectures, the universal pertur- We also refer to the video in (a) the Caf supplementary feNet material for (b) V bations GG-F generalize very well (c) V across GG-16 other architectures. For real-world examples on a smartphone. We visualize the uni- example, universal perturbations computed for the VGG-19 versal perturbations corresponding to different networks in network have a fooling ratio above 53% for all other tested Fig. 4. It should be noted that such universal perturbations architectures. This result shows that our universal perturba- are not unique, as many different universal perturbations (all tions are, to some extent, doubly-universal as they general- satisfying the two required constraints) can be generated for ize well across data points and very different architectures. the same network. In Fig. 5, we visualize five different It should be noted that, in [19], adversarial perturbations universal perturbations obtained by using different random were previously shown to generalize well, to some extent, shufflings in X. Observe that such universal perturbations across different neural networks on the MNIST problem. are different, although they exhibit a similar pattern. This Our results are however different, as we show the general- is moreover confirmed by computing the normalized inner izability of universal perturbations across different architec- products between two pairs (d) of V perturbation GG-19 images, as the (e) tures GoogLeNet on the ImageNet data (f) set. This ResNet-152 result shows that such normalized inner products do not exceed 0.1, which shows perturbations are of practical relevance, as they generalize that one Figure 4: can find Univ di ersal verse universal perturbations perturbations. computed for different deep well neural across netw data ork points and architectures.architectures. Images In particular generated with p ,=in1, ⇠ = 10. The pixel values are scaled for visibility. order to fool a new image on an unknown neural network, a 2For comparison, the average ` simple addition of a universal perturbation computed on the 2 and `1 norm of an image in the vali- dation set is respectively ⇡ 5 ⇥ 104 and ⇡ 250. VGG-19 architecture is likely to misclassify the data point. therefore suggests that the preferred fooling label for im- ages of class i is j. We construct this graph for GoogLeNet, Visualization of the effect of universal perturbations. and visualize the full graph in the supp. material for space To gain insights on the effect of universal perturbations on constraints. The visualization of this graph shows a very pe- natural images, we now visualize the distribution of labels culiar topology. In particular, the graph is a union of disjoint on the ImageNet validation set. Specifically, we build a di- components, where all edges in one component mostly con- rected graph G = (V, E), whose vertices denote the labels, nect to one target label. See Fig. 7 for an illustration of two and directed edges e = (i ! j) indicate that the majority connected components. This visualization clearly shows the of images of class i are fooled into label j when applying existence of several dominant labels, and that universal per- the universal perturbation. The existence of edges i ! j turbations mostly make natural images classified with such

各モデルの普遍的な摂動ノイズ wool Indian elephant Indian elephant African grey tabby African grey 様々なモデルで普遍的な摂動ベクトルを算出（これらは unique ではないことに注意） ・ILSVRC のデータを使用 ・(p, ξ) = (∞, 10) での結果 common newt carousel grey fox macaw three-toed sloth macaw Figure 3: Examples of perturbed images and their corresponding labels. The first 8 images belong to the ILSVRC 2012 モデル毎に異なる結果が得られる validation set, and the last 4 are images taken by a mobile phone camera. See supp. material for the original images. (a) CaffeNet (b) VGG-F (c) VGG-16 (d) VGG-19 (e) GoogLeNet (f) ResNet-152 Ref: https://arxiv.org/abs/1610.08401 19/32 Figure 4: Universal perturbations computed for different deep neural network architectures. Images generated with p = 1,⇠ = 10. The pixel values are scaled for visibility. therefore suggests that the preferred fooling label for im- ages of class i is j. We construct this graph for GoogLeNet, Visualization of the effect of universal perturbations. and visualize the full graph in the supp. material for space To gain insights on the effect of universal perturbations on constraints. The visualization of this graph shows a very pe- natural images, we now visualize the distribution of labels culiar topology. In particular, the graph is a union of disjoint on the ImageNet validation set. Specifically, we build a di- components, where all edges in one component mostly con- rected graph G = (V, E), whose vertices denote the labels, nect to one target label. See Fig. 7 for an illustration of two and directed edges e = (i ! j) indicate that the majority connected components. This visualization clearly shows the of images of class i are fooled into label j when applying existence of several dominant labels, and that universal per- the universal perturbation. The existence of edges i ! j turbations mostly make natural images classified with such

普遍的な摂動ノイズの非一意性 一つのモデルで学習データをシャッフルしながら５つの普遍的な摂動ノイズを算出 ・ILSVRC のデータを使用 ・モデルは GoogleNet を使用 見た目は似ているが、正規化した内積は任意のペアで 0.1 以下 これは普遍的な摂動ノイズが unique ではないことを示している Figure 5: Diversity of universal perturbations for the GoogLeNet architecture. The five perturbations are generated using different random shufflings of the set X. Note that the normalized inner products for any pair of universal perturbations does not exceed 0.1, which highlights the diversity of such perturbations. VGG-F CaffeNet GoogLeNet VGG-16 VGG-19 ResNet-152 VGG-F 93.7% 71.8% 48.4% 42.1% 42.1% 47.4 % Ref: https://arxiv.org/abs/1610.08401 20/32 CaffeNet 74.0% 93.3% 47.7% 39.9% 39.9% 48.0% GoogLeNet 46.2% 43.8% 78.9% 39.2% 39.8% 45.5% VGG-16 63.4% 55.8% 56.5% 78.3% 73.1% 63.4% VGG-19 64.0% 57.2% 53.6% 73.5% 77.8% 58.0% ResNet-152 46.3% 46.3% 50.5% 47.0% 45.5% 84.0% Table 2: Generalizability of the universal perturbations across different networks. The percentages indicate the fooling rates. The rows indicate the architecture for which the universal perturbations is computed, and the columns indicate the architecture for which the fooling rate is reported. 90 images. We use the VGG-F architecture, and fine-tune the 80 network based on a modified training set where universal perturbations are added to a fraction of (clean) training sam- 70 ples: for each training point, a universal perturbation is 60 added with probability 0.5, and the original sample is pre- 50 served with probability 0.5.3 To account for the diversity 40 of universal perturbations, we pre-compute a pool of 10 dif- ferent universal perturbations and add perturbations to the Fooling ratio (%) 30 training samples randomly from this pool. The network is 20 fine-tuned by performing 5 extra epochs of training on the 10 modified training set. To assess the effect of fine-tuning on 0 the robustness of the network, we compute a new universal 500 1000 2000 4000 perturbation for the fine-tuned network (using Algorithm 1, Number of images in X with p = 1 and ⇠ = 10), and report the fooling rate of the Figure 6: Fooling ratio on the validation set versus the size network. After 5 extra epochs, the fooling rate on the vali- of X. Note that even when the universal perturbation is dation set is 76.2%, which shows an improvement with re- computed on a very small set X (compared to training and spect to the original network (93.7%, see Table 1).4 Despite validation sets), the fooling ratio on validation set is large. this improvement, the fine-tuned network remains largely vulnerable to small universal perturbations. We therefore labels. We hypothesize that these dominant labels occupy 3In this fine-tuning experiment, we use a slightly modified notion of large regions in the image space, and therefore represent universal perturbations, where the direction of the universal vector v is good candidate labels for fooling most natural images. Note fixed for all data points, while its magnitude is adaptive. That is, for each data point x, we consider the perturbed point x+↵v, where ↵ is the small- that these dominant labels are automatically found by Algo- est coefficient that fools the classifier. We observed that this feedbacking rithm 1, and are not imposed a priori in the computation of strategy is less prone to overfitting than the strategy where the universal perturbations. perturbation is simply added to all training points. 4 Fine-tuning with universal perturbations. This fine-tuning procedure moreover led to a minor increase in the We now ex- error rate on the validation set, which might be due to a slight overfitting amine the effect of fine-tuning the networks with perturbed of the perturbed data.

Figure 5: Diversity of universal perturbations for the GoogLeNet architecture. The five perturbations are generated using different random shufflings of the set X. Note that the normalized inner products for any pair of universal perturbations does not exceed 0.1, which highlights the diversity of such perturbations. 学習データ数の依存性 VGG-F CaffeNet GoogLeNet VGG-16 VGG-19 ResNet-152 VGG-F 93.7% 71.8% 48.4% 42.1% 42.1% 47.4 % CaffeNet 74.0% 93.3% 47.7% 39.9% 39.9% 48.0% GoogLeNet 46.2% 43.8% 78.9% 39.2% 39.8% 45.5% 一つのモデルで学習データのサイズを変え VGG-16 ながら誤認識率を計算 63.4% 55.8% 56.5% 78.3% 73.1% 63.4% ・ILSVRC のデータを使用（ X: 変え V ながら実験 GG-19 , Va 64.0% l.: 50,000 57.2% ） 53.6% 73.5% 77.8% 58.0% ResNet-152 46.3% 46.3% 50.5% 47.0% 45.5% 84.0% ・モデルは GoogleNet を使用 Table 2: Generalizability of the universal perturbations across different networks. The percentages indicate the fooling rates. 少ないデータ数でも高い誤認識率を達成 The rows indicate the ar chitecture for which the universal perturbations is computed, and the columns indicate the architecture for which the fooling rate is reported. 多くのデータ点において近傍の識別境界は似たような幾何であることを示唆 90 images. We use the VGG-F architecture, and fine-tune the 80 network based on a modified training set where universal perturbations are added to a fraction of (clean) training sam- 70 ples: for each training point, a universal perturbation is 60 added with probability 0.5, and the original sample is pre- 50 served with probability 0.5.3 To account for the diversity 40 of universal perturbations, we pre-compute a pool of 10 dif- ferent universal perturbations and add perturbations to the Fooling ratio (%) 30 training samples randomly from this pool. The network is 20 fine-tuned by performing 5 extra epochs of training on the 10 modified training set. To assess the effect of fine-tuning on 0 the robustness of the network, we compute a new universal 500 1000 2000 4000 perturbation for the fine-tuned network (using Algorithm 1, Number of images in X Ref: https://arxiv.org/abs/1610.08401 21/32with p = 1 and ⇠ = 10), and report the fooling rate of the Figure 6: Fooling ratio on the validation set versus the size network. After 5 extra epochs, the fooling rate on the vali- of X. Note that even when the universal perturbation is dation set is 76.2%, which shows an improvement with re- computed on a very small set X (compared to training and spect to the original network (93.7%, see Table 1).4 Despite validation sets), the fooling ratio on validation set is large. this improvement, the fine-tuned network remains largely vulnerable to small universal perturbations. We therefore labels. We hypothesize that these dominant labels occupy 3In this fine-tuning experiment, we use a slightly modified notion of large regions in the image space, and therefore represent universal perturbations, where the direction of the universal vector v is good candidate labels for fooling most natural images. Note fixed for all data points, while its magnitude is adaptive. That is, for each data point x, we consider the perturbed point x+↵v, where ↵ is the small- that these dominant labels are automatically found by Algo- est coefficient that fools the classifier. We observed that this feedbacking rithm 1, and are not imposed a priori in the computation of strategy is less prone to overfitting than the strategy where the universal perturbations. perturbation is simply added to all training points. 4 Fine-tuning with universal perturbations. This fine-tuning procedure moreover led to a minor increase in the We now ex- error rate on the validation set, which might be due to a slight overfitting amine the effect of fine-tuning the networks with perturbed of the perturbed data.

GoogLeNet architecture. The five perturbations are generated using different random shufflings of the set X. Note that the normalized inner products for any pair of universal perturbations does ２．一つのノイズで様々なモデルに適用可能 not exceed 0.1, which highlights the diversity of such perturbations. VGG-F CaffeNet GoogLeNet VGG-16 VGG-19 ResNet-152 VGG-F 93.7% 71.8% 48.4% 42.1% 42.1% 47.4 % CaffeNet 74.0% 93.3% 47.7% 39.9% 39.9% 48.0% GoogLeNet 46.2% 43.8% 78.9% 39.2% 39.8% 45.5% VGG-16 63.4% 55.8% 56.5% 78.3% 73.1% 63.4% VGG-19 64.0% 57.2% 53.6% 73.5% 77.8% 58.0% ResNet-152 46.3% 46.3% 50.5% 47.0% 45.5% 84.0% Table 2: Generalizability of the universal perturbations across different networks. The percentages indicate the fooling rates. The rows indicate the architecture for which the universal perturbations is computed, and the Ref: httpscolum :n //sarxiv.org/absindicate /1610.08401the architecture 22/32 for which the fooling rate is reported. 90 images. We use the VGG-F architecture, and fine-tune the 80 network based on a modified training set where universal perturbations are added to a fraction of (clean) training sam- 70 ples: for each training point, a universal perturbation is 60 added with probability 0.5, and the original sample is pre- 50 served with probability 0.5.3 To account for the diversity 40 of universal perturbations, we pre-compute a pool of 10 dif- ferent universal perturbations and add perturbations to the Fooling ratio (%) 30 training samples randomly from this pool. The network is 20 fine-tuned by performing 5 extra epochs of training on the 10 modified training set. To assess the effect of fine-tuning on 0 the robustness of the network, we compute a new universal 500 1000 2000 4000 perturbation for the fine-tuned network (using Algorithm 1, Number of images in X with p = 1 and ⇠ = 10), and report the fooling rate of the Figure 6: Fooling ratio on the validation set versus the size network. After 5 extra epochs, the fooling rate on the vali- of X. Note that even when the universal perturbation is dation set is 76.2%, which shows an improvement with re- computed on a very small set X (compared to training and spect to the original network (93.7%, see Table 1).4 Despite validation sets), the fooling ratio on validation set is large. this improvement, the fine-tuned network remains largely vulnerable to small universal perturbations. We therefore labels. We hypothesize that these dominant labels occupy 3In this fine-tuning experiment, we use a slightly modified notion of large regions in the image space, and therefore represent universal perturbations, where the direction of the universal vector v is good candidate labels for fooling most natural images. Note fixed for all data points, while its magnitude is adaptive. That is, for each data point x, we consider the perturbed point x+↵v, where ↵ is the small- that these dominant labels are automatically found by Algo- est coefficient that fools the classifier. We observed that this feedbacking rithm 1, and are not imposed a priori in the computation of strategy is less prone to overfitting than the strategy where the universal perturbations. perturbation is simply added to all training points. 4 Fine-tuning with universal perturbations. This fine-tuning procedure moreover led to a minor increase in the We now ex- error rate on the validation set, which might be due to a slight overfitting amine the effect of fine-tuning the networks with perturbed of the perturbed data.

誤認識にはパターンが存在 誤認識のパターンを把握するためにグラフ構造を調べる ・モデルは GoogleNet を使用 ・ILSVRC のデータを使用 ・クラス i のデータが j に誤認識されるときにエッジを貼り有向グラフを構築 ある少数の特定のノードに集中する傾向が観測された これらのクラスはデータ空間に占める割合が多いと考えられる（直感とは合わないが） window shade leopard nematode microwave dining table cash machine television slide rule dowitcher refrigerator mosquito net tray space shuttle pillow great grey owl computer keyboard platypus pencil box quilt fountain wardrobe plate rack digital clock Arctic fox envelope medicine chest Ref: https://arxiv.org/abs/1610.08401 25/32 Figure 7: Two connected components of the graph G = (V, E), where the vertices are the set of labels, and directed edgesi ! j indicate that most images of class i are fooled into class j. repeated the above procedure (i.e., computation of a pool of versal perturbation computed using Algorithm 1 achieves 10 universal perturbations for the fine-tuned network, fine- a fooling rate of 85% when the `2 norm is constrained to tuning of the new network based on the modified training ⇠ = 2000, while other perturbations achieve much smaller set for 5 extra epochs), and we obtained a new fooling ra- ratios for comparable norms. In particular, random vec- tio of 80.0%. In general, the repetition of this procedure tors sampled uniformly from the sphere of radius of 2000 for a fixed number of times did not yield any improvement only fool 10% of the validation set. The large difference over the 76.2% fooling ratio obtained after one step of fine- between universal and random perturbations suggests that tuning. Hence, while fine-tuning the network leads to a mild the universal perturbation exploits some geometric correla- improvement in the robustness, we observed that this sim- tions between different parts of the decision boundary of the ple solution does not fully immune against small universal classifier. In fact, if the orientations of the decision bound- perturbations. ary in the neighborhood of different data points were com- pletely uncorrelated (and independent of the distance to the 4. Explaining the vulnerability to universal decision boundary), the norm of the best universal perturba- perturbations tion would be comparable to that of a random perturbation. Note that the latter quantity is well understood (see [4]), The goal of this section is to analyze and explain the high as the norm of the random perturbation required to fool a p vulnerability of deep neural network classifiers to univer- specific data point precisely behaves as ⇥( dkrk2), where sal perturbations. To understand the unique characteristics d is the dimension of the input space, and krk2 is the dis- of universal perturbations, we first compare such perturba- tance between the data point and the decision boundary (or tions with other types of perturbations, namely i) random equivalently, the norm of the smallest adversarial perturba- perturbation, ii) adversarial perturbation computed for a tion). For the considered ImageNet classification task, this p randomly picked sample (computed using the DF and FGS quantity is equal to dkrk2 ⇡ 2⇥104, for most data points, methods respectively in [11] and [5]), iii) sum of adversar- which is at least one order of magnitude larger than the uni- ial perturbations over X, and iv) mean of the images (or versal perturbation (⇠ = 2000). This substantial difference ImageNet bias). For each perturbation, we depict a phase between random and universal perturbations thereby sug- transition graph in Fig. 8 showing the fooling rate on the gests redundancies in the geometry of the decision bound- validation set with respect to the ` aries that we now explore. 2 norm of the perturba- tion. Different perturbation norms are achieved by scaling For each image x in the validation set, we com- accordingly each perturbation with a multiplicative factor to pute the adversarial perturbation vector r(x) = have the target norm. Note that the universal perturbation is arg minr krk2 s.t. ˆk(x + r) 6= ˆk(x). It is easy to see computed for ⇠ = 2000, and also scaled accordingly. that r(x) is normal to the decision boundary of the clas- Observe that the proposed universal perturbation quickly sifier (at x + r(x)). The vector r(x) hence captures the reaches very high fooling rates, even when the perturbation local geometry of the decision boundary in the region is constrained to be of small norm. For example, the uni- surrounding the data point x. To quantify the correlation

Discussions about the decision boundaries 26/32

識別境界の特徴 Figure 8: Comparison between fooling rates of different perturbations. Experiments performed on the CaffeNet ar- Figure 9: Singular values of matrix N containing normal 単一の普遍的な摂動ノイズで多くの画像が誤認識されることが示された chitecture. vectors to the decision decision boundary. → データ点近傍の識別境界は少数のベクトルで span される低次元空間で記述されそう between different regions of the decision boundary of the これを調べるために各デ classifierータ点における識別境界への法線ベクト , we define the matrix ルの性質を調べる r(x r(x N = 1) . . . n) kr(x1)k2 kr(xn)k2 of normal vectors to the decision boundary in the vicinity of n data points in the validation set. For binary linear classifiers, the decision boundary is a hyperplane, and N CaffeNet で行列 N を求め、単位球からランダムにサンプルした場合と特異値分布を比較 is of rank 1, as all normal vectors are collinear. To capture Figure 10: Illustration of the low dimensional subspace more generally the correlations in the decision boundary of complex classifiers, we compute the singular values of the S containing normal vectors to the decision boundary in regions surrounding natural images. For the purpose of matrix N. The singular values of the matrix N, computed this illustration, we super-impose three data-points , for the CaffeNet architecture are shown in Fig. 9. We fur- {xi}3i=1 and the adversarial perturbations that send the re- ther show in the same figure the singular values obtained {ri}3i=1 spective datapoints to the decision boundary are when the columns of {B N are sampled uniformly at random i}3 i=1 shown. Note that all live in the subspace from the unit sphere. Observe that, while the latter singu- {ri}3i=1 S. lar values have a slow decay, the singular values of N de- Ref: https://arxiv.org/abs/1610.08401 27/32 cay quickly, which confirms the existence of large corre- that such perturbations can only fool 10% of the data). Fig. lations and redundancies in the decision boundary of deep 10 illustrates the subspace S that captures the correlations networks. More precisely, this suggests the existence of a in the decision boundary. It should further be noted that the subspace S of low dimension d0 (with d0 ⌧ d), that contains existence of this low dimensional subspace explains the sur- most normal vectors to the decision boundary in regions prising generalization properties of universal perturbations surrounding natural images. We hypothesize that the exis- obtained in Fig. 6, where one can build relatively general- tence of universal perturbations fooling most natural images izable universal perturbations with very few images. is partly due to the existence of such a low-dimensional sub- Unlike the above experiment, the proposed algorithm space that captures the correlations among different regions does not choose a random vector in this subspace, but rather of the decision boundary. In fact, this subspace “collects” chooses a specific direction in order to maximize the over- normals to the decision boundary in different regions, and all fooling rate. This explains the gap between the fooling perturbations belonging to this subspace are therefore likely rates obtained with the random vector strategy in S and Al- to fool datapoints. To verify this hypothesis, we choose a gorithm 1. random vector of norm ⇠ = 2000 belonging to the subspaceS spanned by the first 100 singular vectors, and compute its 5. Conclusions fooling ratio on a different set of images (i.e., a set of images that have not been used to compute the SVD). Such a pertur- We showed the existence of small universal perturba- bation can fool nearly 38% of these images, thereby show- tions that can fool state-of-the-art classifiers on natural im- ing that a random direction in this well-sought subspace S ages. We proposed an iterative algorithm to generate uni- significantly outperforms random perturbations (we recall versal perturbations, and highlighted several properties of

識別境界の特徴 CaffeNet から構築した N では少数の特異値が支配的であることが観測される → データ点近傍の識別境界は確かにデータの次元より遥かに低い次元で記述されそう → データ点近傍の識別境界の局所的な幾何は個別の点に依らない共通の性質がある？ 5 Random 4.5 Normal vectors 4 3.5 3 2.5 2 Singular values 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Index 4 Figure 8: Comparison between fooling rates of different 10 perturbations. Experiments performed on the CaffeNet ar- Figure 9: Singular values of matrix N containing normal 普遍的な摂動ノイズが作れるかどうかは非自明だが、データ点近傍の識別境界を詳しく chitecture. vectors to the decision decision boundary. 調べることで、その存在が証明された ( https://arxiv.org/abs/1705.09554 ) Ref: https://arxiv.org/abs/1610.08401 between different regions of the decision boundary 28/32 of the classifier, we define the matrix r(x r(x N = 1) . . . n) kr(x1)k2 kr(xn)k2 of normal vectors to the decision boundary in the vicinity of n data points in the validation set. For binary linear classifiers, the decision boundary is a hyperplane, and N is of rank 1, as all normal vectors are collinear. To capture Figure 10: Illustration of the low dimensional subspace more generally the correlations in the decision boundary of complex classifiers, we compute the singular values of the S containing normal vectors to the decision boundary in regions surrounding natural images. For the purpose of matrix N. The singular values of the matrix N, computed this illustration, we super-impose three data-points , for the CaffeNet architecture are shown in Fig. 9. We fur- {xi}3i=1 and the adversarial perturbations that send the re- ther show in the same figure the singular values obtained {ri}3i=1 spective datapoints to the decision boundary are when the columns of {B N are sampled uniformly at random i}3 i=1 shown. Note that all live in the subspace from the unit sphere. Observe that, while the latter singu- {ri}3i=1 S. lar values have a slow decay, the singular values of N de- cay quickly, which confirms the existence of large corre- that such perturbations can only fool 10% of the data). Fig. lations and redundancies in the decision boundary of deep 10 illustrates the subspace S that captures the correlations networks. More precisely, this suggests the existence of a in the decision boundary. It should further be noted that the subspace S of low dimension d0 (with d0 ⌧ d), that contains existence of this low dimensional subspace explains the sur- most normal vectors to the decision boundary in regions prising generalization properties of universal perturbations surrounding natural images. We hypothesize that the exis- obtained in Fig. 6, where one can build relatively general- tence of universal perturbations fooling most natural images izable universal perturbations with very few images. is partly due to the existence of such a low-dimensional sub- Unlike the above experiment, the proposed algorithm space that captures the correlations among different regions does not choose a random vector in this subspace, but rather of the decision boundary. In fact, this subspace “collects” chooses a specific direction in order to maximize the over- normals to the decision boundary in different regions, and all fooling rate. This explains the gap between the fooling perturbations belonging to this subspace are therefore likely rates obtained with the random vector strategy in S and Al- to fool datapoints. To verify this hypothesis, we choose a gorithm 1. random vector of norm ⇠ = 2000 belonging to the subspaceS spanned by the first 100 singular vectors, and compute its 5. Conclusions fooling ratio on a different set of images (i.e., a set of images that have not been used to compute the SVD). Such a pertur- We showed the existence of small universal perturba- bation can fool nearly 38% of these images, thereby show- tions that can fool state-of-the-art classifiers on natural im- ing that a random direction in this well-sought subspace S ages. We proposed an iterative algorithm to generate uni- significantly outperforms random perturbations (we recall versal perturbations, and highlighted several properties of

being quasi- imperceptible to the human eye. We further empirically an- • 少ないデータで強力なノ alyze these universal イズが作成可 perturbations and show, in

particular, that they generalize very well across neural networks. The Thresher Labrador surprising existence of universal perturbations reveals im-portant geometric correlations among the high-dimensionaldecision boundary of classifiers. It further outlines poten- Flagpole Labrador • データ点から識別境界への法線方向のベク tial security breaches with the existence of single directionsin the input space that adversaries can possibly exploit to トルを足し上げることでノ break a classifier on most イズを構築可 natural images.1

Tibetan mastiff Tibetan mastiff 1. Introduction Lycaenid Brabancon griffon • 識別境界への法線ベクト Can we find a ルは多くのデ single small image perturbation ータ that fools a state-of-the-art deep neural network classifier on all nat- 点で共通の方向を向い ural images? W ており強い相関 e show in this paper the existence of such Balloon Labrador quasi-imperceptible universal perturbation vectors that lead to misclassify natural images with high probability. Specif- arXiv:1610.08401v3 [cs.CV] 9 Mar 2017 （異なる識別境界領域が低次元で記述可） ically, by adding such a quasi-imperceptible perturbation to natural images, the label estimated by the deep neu- Whiptail lizard Border terrier ral network is changed with high probability (see Fig. 1). Such perturbations are dubbed universal, as they are image- agnostic. The existence of these perturbations is problem- Figure 1: When added to a natural image, a universal per- atic when the classifier is deployed in real-world (and pos- turbation image causes the image to be misclassified by the 32/32 sibly hostile) environments, as they can be exploited by ad- deep neural network with high probability. Left images: Original natural images. The labels are shown on top of ⇤The first two authors contributed equally to this work. each arrow. Central image: Universal perturbation. Right † ´Ecole Polytechnique F´ed´erale de Lausanne, Switzerland images: Perturbed images. The estimated labels of the per- ‡ENS de Lyon, LIP, UMR 5668 ENS Lyon - CNRS - UCBL - INRIA, turbed images are shown on top of each arrow. Universit´e de Lyon, France 1To encourage reproducible research, the code is available at gitHub. Furthermore, a video demonstrating the effect of universal perturbations on a smartphone can be found here. 1