READING

Liu et al. provide a comprehensive study on the transferability of adversarial examples considering different attacks and models on ImageNet. In their experiments, they consider both targeted and non-targeted attack and also provide a real-world example by attacking clarifai.com. Here, I want to list some interesting conclusions drawn from their experiments:

Non-targeted attacks easily transfer between models; targeted-attacks, in contrast, do generally not transfer – meaning that the target does not transfer across models.

The level of transferability does also seem to heavily really on hyperparameters of the trained models. In the experiments, the author observed this on different ResNet models which share the general architecture building blocks, but are of different depth.

Considering different models, it turns out that the gradient directions (i.e. the adversarial directions used in many gradient-based attacks) are mostly orthogonal – this means that different models have different vulnerabilities. However, the observed transferability suggests that this only holds for the “steepest” adversarial direction; the gradient direction of one model is, thus, still useful to craft adversarial examples for another model.

The authors also provide an interesting visualization of the local decision landscape around individual examples. As illustrated in Figure 1, the region where the chosen image is classified correctly is often limited to a small central area. Of course, I believe that these examples are hand-picked to some extent, but they show the worst-case scenario relevant for defense mechanisms.

Figure 1: Decision boundary showing different classes in different colors. The axes correspond to one pixel differences; the used images are computed using $x' = x +\delta_1u + \delta_2v$ where $u$ is the gradient direction and $v$ a random direction.