ARTICLE

Discussion and Survey of Adversarial Examples and Robustness in Deep Learning

Adversarial examples are test images which have been perturbed slightly to cause misclassification. As these adversarial examples are usually unproblematic for us humans, but are able to easily fool deep neural networks, their discovery has sparked quite some interest in the deep learning and privacy/security communities. In this article, I want to provide a rough overview of the topic including a brief survey of relevant literature and some ideas on future research directions.

The topic of robust deep learning is receiving considerable attention in the last few years.
The observation of adversarial examples was first reported in [43]. In general, it describes the existence of small perturbations — unrecognizable to humans — of testing samples that result in a mis-classification, see Figure 1.

In this article, I want to provide a rough overview and survey of the literature concerned with constructing adversarial examples, defending against them as well as the theory behind their existence. Additionally, I includes some thoughts on future research directions. The article is, however, not intended to be complete; every week, new papers on adversarial examples are uploaded to ArXiv such that it is impossible to keep this article perfectly up-to-date. However, this article will cover the most important papers until Februrary 2018.

Figure 1 (click to enlarge): Illustration of an adversarial example taken from [43]. The original image is "attacked" using the adversarial perturbation shown in the middle; as a result, the classifier mis-classifies the example although the change is &dash; for human eyes &dash; not perceivable.

Detailed comments on individual papers can also be found in my Reading Notes.

Survey

Recent literature can roughly be divided into five lines of work: attacks, defenses, detection, theory and applications. Most published work is considering either attacks or defenses/detection — and, currently, it looks like the literature on defending against or detecting adversarial examples is "loosing the battle". In contrast, only few papers are considering the theoretical background of why and where adversarial examples exist, or are able to give guarantees on robustness. Finally, adversarial examples have been shown to exist for a variety of tasks including object detection, semantic segmentation, image generation or reinforcement learning to name just a few.

First, several attacks have been proposed [5, 6, 7, 8] — mostly based on a white-box scenario where the attacker has access to the full model including gradients. However, black-box attacks have also been considered, e.g. using zero order optimization [16]. Interesting attack scenarios are physical attacks, usually evaluated by printing adversarial examples [11, 12]. Some works also try to draw the line between randomness and adverseness [62] or craft adversarial geometric transformations [13] which essentially asks the question where generalization starts and robustness ends. Overall, most of the white-box attacks are first-order gradient-based methods that usually try to directly maximize the classifiers loss in the vicinity of a testing sample — or use some surrogate loss instead. Thus, Madry et al. [6] use projected gradient descent as the strongest first-order gradient-based adversary.

Second, several defenses against proposed the attacks have been developed [22, 23, 24, 25, 4, 26, 27, 28, 29, 29, 26, 33, 34, 35]. While attacks are comparably easy to benchmark — even in the light of existing defense mechanisms —, defenses are relatively hard to develop and study. This is the essence of any security problem: the adversary has to find at least one adversarial example; the defender, in contrast, has to make sure that no such example can be found at all — a considerably harder task. As a result, some defense mechanisms have been proven wrong. Generally, defense mechanisms include adversarial training (i.e. training on adversarial examples) and its variants [26, 27, 28, 29], ensemble training [22, 23, 24, 25, 4] adaptations of the architecture (e.g. saturating networks [35], bounded ReLUs [26] etc.), pre-processing [34, 33] or distillation [3]. However, there have been studies questioning (in terms of experimental evaluation) the effectiveness of distillation [7], adversarial training [10, 4] and saturating networks [17]; for other defense mechanisms, independent studies of their effectiveness are missing.

Third, several mechanisms to efficiently detect and avoid adversarial examples have been published [51, 52, 53]. However, their effectiveness has also been questioned in the literature [52]. After all, these detection schemes are also machine learning systems that try to classify adversarial examples; as such, in a white-box scenario, the idea of developing systems for detecting and subsequently avoiding adversarial examples seems to be doomed from the start.

Fourth, there has been some work on theoretical guarantees and bounds regarding robustness against adversarial examples. An important question was foreshadowed by Szegedy et al. [85] and explicitly posed by Goodfellow et al. [5]: why do adversarial examples exist? Goodfellow et al. provide the "linear explanation" arguing that dot-products (as well as convolutions) are susceptible to adversarial noise due to the summation operation, which amplify small per-pixel noise in high dimensions. Tanay et al. [61], however, argue that the "linear explanation" is not convincing and provide an alternative "boundary tilting" perspective — also referred to as the "manifold assumption". In their paper, the data is assumed to lie on a sub-manifold; while the classifier might be robust on this manifold, i.e. for most samples it is hard to find a decision boundary within a small vicinity, robustness is lost when leaving the manifold. In a slightly different direction [42, 60], upper bounds on the robustness are given; Hein et al. [42] present an argument based on local Lipschitz continuity, while Fawzi et al. [60] present an upper bound based on generalization risk and the data distributions. In [62], Fawzi further provide an interesting perspective on the transition between random noise and adversarial examples; and in [67], Bastani et al. provide theoretical metrics for measuring robustness.

Finally, several works [74, 75, 11, 76, 77, 78, 79] address applications. These include structured problems such as semantic segmentation and object detection as well as reinforcement learning or generative models. Further applications such as face recognition or tasks in natural language processing are discussed in surveys [71, 72]. Physical adversarial examples [11, 12] could generally also be viewed as interesting applications.

On a final note, there are some papers (experimentally) studying properties of adversarial examples and their relationship to deep neural network properties. An important property is transferability, which is also the basis of many black-box attacks. Similarly, an important question is whether network architectures or complexity influence transferability as well as robustness to adversarial examples. Some works [9, 10] try to study these problems in more detail; also focussing on the relationship between capacity and robustness/tranferability. Similar to transferability, it has been shown that adversarial examples might be universal, i.e., image-agnostic [8].

Research Directions

Personally, I found that there are several aspects regarding adversarial examples and robustness that are not fully understood and explored yet. This includes a common understanding of relevant threat models (e.g., white-box, gray-box, black-box as well as attacks at training time), consistent evaluation methodologies (especially for defenses) as well as theoretical guarantees. In the following, I want to provide some thoughts on possible future research directions.

First, there is no clear notion of relevant threat models. Only some works [7, 34, 16, 52], especially from the security and privacy community, try to specify the knowledge and abilities of the adversary. Especially [34] and [52] try to make the abilities of the adversary in white- and black-box scenarios explicit. In the computer vision and machine learning communities, in contrast, the threat model is usually defined implicitly through the attacks used in evaluation — commonly limited to first-order gradient-based adversaries. These attacks usually correspond to adversaries with limited knowledge: the adversary has access to the model, its parameters, outputs and gradients (at test time) and knows about employed defenses. Other notions of gray-box adversaries, adversaries operating at training time, or more powerful adversaries in terms of higher-order optimization techniques are rarely considered.

Second, given relevant threat models, an agreed-upon benchmark specifying datasets (including splits), attacks and defenses as well as evaluation metrics is missing. For proposed defenses for example, there is no common and clearly defined metric of "robustness". While [67] proposes several theoretical measures of robustness, their computation in practice is cumbersome. Additionally, the effectiveness of attacks against a wide variety of defense mechanisms, training protocols and network architectures is missing (although some hints are given in [9, 10]). The same holds for most defenses; often, it is unclear how effective these defense are in practice — some attacks have already been shown to be less effective as originally advocated [7, 10, 4, 17].

Third, there is a significant gap between theoretical insights regarding robustness and attacks/defenses studied in practice. Attacks are generally based on iterative gradient ascent and can, as such, be studied under the umbrella of optimization. Similarly, adversarial training is commonly studied in terms of robust optimization; although other interpretations [29, 27, 28] have also been proposed. Recently, some papers also try to provide robustness guarantees (e.g., [42, 30, 59]) or trade-offs (e.g., robustness-generalization trade-off [63,66]). The experimental or theoretical settings are, however, usually constraint to simple models and datasets — and the conclusions might be contradicting [63,66].

Finally, I want to conclude with some problems that I couldn't find addressed in the literature. For example, perfect knowledge adversaries with access to the training procedure including training data (partly known as poisoning attacks [20]). Or problems of authenticated or verifiable use of deep learning including the generation of "proofs" for predictions of rejecting input. Another interesting class of problems include model theft or whitening/re-engineering (e.g., [21]).

ABOUTTHEAUTHOR

In September, I was honored to receive the MINT-Award IT 2018, sponsored by ZF and audimax, for my master thesis on weakly-supervised shape completion. For CVPR 2019, however, I am working on a different topic: adversarial robustness and generalization of deep neural networks.
18thOCTOBER2018 , David Stutz

What is your opinion on this article? Did you find it interesting or useful? Let me know your thoughts in the comments below: