The detection of adversarial
samples remains an open problem. Interestingly, the universal
approximation theorem formulated by Hornik et al. states one
hidden layer is sufficient to represent arbitrarily accurately a
function [21]. Thus, one can intuitively conceive that improving
the training phase is key to resisting adversarial samples.

High regularization gives smoother templates, but at some point starts to works worse. However, it is more resistant to fooling. (The fooling images look noticeably different from their original)
Low regularization gives more noisy templates but seems to work better that all-smooth templates. It is less resistant to fooling.
Intuitively, it seems that higher regularization leads to smaller weights, which means that one must change the image more dramatically to change the score by some amount. It’s not immediately obvious if and how this conclusion translates to deeper models.

One might hope that ConvNets would produce all-diffuse probabilities in regions outside the training data, but there is no part in an ordinary objective (e.g. mean cross-entropy loss) that explicitly enforces this constraint. Indeed, it seems that the class scores in these regions of space are all over the place, and worse, a straight-forward attempt to patch this up by introducing a background class and iteratively adding fooling images as a new background class during training are not effective in mitigating the problem.

It seems that to fix this problem we need to change our objectives, our forward functional forms, or even the way we optimize our models. However, as far as I know we haven’t found very good candidates for either.

Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples

we demonstrated how all of these findings could
be used to target online classifiers trained and hosted by
Amazon and Google, without any knowledge of the model
design or parameters, but instead simply by making label
queries for 800 inputs. The attack successfully forces these
classifiers to misclassify 96.19% and 88.94% of their inputs.

In 1961, J.R. Lucas published “Minds, Machines and Gödel,” in which he formulated a controversial anti-mechanism argument. The argument claims that Gödel’s first incompleteness theorem shows that the human mind is not a Turing machine, that is, a computer.

Proposition 2. For a deep learning hierarchy to avoid the brittleness and random
images pathologies (on a corpus generated from an image grammar, or on
a corpus of natural images), there would need to be a reasonably straightforward
mapping from recognizable activity patterns on the different layers, to elements
of a reasonably simple image grammar, so that via looking at the activity patterns
on each layer when the network was exposed to a certain image, one could
read out the “image grammar decomposition” of the elements of the image. For
instance, if one applied the deep learning network to a corpus images generated
from a commonsensical image grammar, then the deep learning system would
need to learn an internal state in reaction to an image, from which the imagegrammar
decomposition of the image was easily decipherable.

This paper shows that even in such physical world scenarios, machine learning systems are vulnerable to adversarial examples. We demonstrate this by feeding adversarial images obtained from cell-phone camera to an ImageNet Inception classifier and measuring the classification accuracy of the system. We find that a large fraction of adversarial examples are classified incorrectly even when perceived through the camera.

We showed the existence of small universal perturbations
that can fool state-of-the-art classifiers on natural images.
We proposed an iterative algorithm to generate universal
perturbations, and highlighted several properties of
such perturbations. In particular, we showed that universal
perturbations generalize well across different classification
models, resulting in doubly-universal perturbations (imageagnostic,
network-agnostic). We further explained the existence
of such perturbations with the correlation between
different regions of the decision boundary. This provides
insights on the geometry of the decision boundaries of deep
neural networks, and contributes to a better understanding
of such systems.

Many machine learning models are vulnerable to adversarial examples: inputs that are specially crafted to cause a machine learning model to produce an incorrect output. Adversarial examples that affect one model often affect another model, even if the two models have different architectures or were trained on different training sets, so long as both models were trained to perform the same task. An attacker may therefore train their own substitute model, craft adversarial examples against the substitute, and transfer them to a victim model, with very little information about the victim. Recent work has further developed a technique that uses the victim model as an oracle to label a synthetic training set for the substitute, so the attacker need not even collect a training set to mount the attack. We extend these recent techniques using reservoir sampling to greatly enhance the efficiency of the training procedure for the substitute model. We introduce new transferability attacks between previously unexplored (substitute, victim) pairs of machine learning model classes, most notably SVMs and decision trees. We demonstrate our attacks on two commercial machine learning classification systems from Amazon (96.19% misclassification rate) and Google (88.94%) using only 800 queries of the victim model, thereby showing that existing machine learning approaches are in general vulnerable to systematic black-box attacks regardless of their structure.

Empirically we find that the Siamese architecture can intuitively help DNN models approach topological equivalence between the two feature spaces, which in turns effectively improves its robustness against AN.

We showed the existence of small universal perturbations that can fool state-of-the-art classifiers on natural images. We proposed an iterative algorithm to generate universal perturbations, and highlighted several properties of such perturbations. In particular, we showed that universal perturbations generalize well across different classification models, resulting in doubly-universal perturbations (imageagnostic, network-agnostic). We further explained the existence of such perturbations with the correlation between different regions of the decision boundary. This provides insights on the geometry of the decision boundaries of deep neural networks, and contributes to a better understanding of such systems.

In summary, we have shown that a simple, biologically inspired
strategy for finding highly nonlinear networks operating
in a saturated regime provides interesting mechanisms
for guarding DNNs against adversarial examples
without ever computing them. Not only do we gain improved
performance over adversarially trained networks on
adversarial examples generated by the fast gradient sign
method, but our saturating networks are also relatively robust
against iterative, targeted methods including secondorder
adversaries.

In this paper, we show experiments that suggest that current constructions of physical adversarial examples do not disrupt object detection from a moving platform. Instead, a trained neural network classifies most of the pictures taken from different distances and angles of a perturbed image correctly. We believe this is because the adversarial property of the perturbation is sensitive to the scale at which the perturbed picture is viewed, so (for example) an autonomous car will misclassify a stop sign only from a small range of distances.
Our work raises an important question: can one construct examples that are adversarial for many or most viewing conditions? If so, the construction should offer very significant insights into the internal representation of patterns by deep networks. If not, there is a good prospect that adversarial examples can be reduced to a curiosity with little practical impact.

we propose to utilize embedding space for both classification and low-level (pixel-level) similarity learning to ignore unknown pixel level perturbation.

We proposed adversarial training regularized with a unified embedding for classification and lowlevel
similarity learning by penalizing distance between the clean and their corresponding adversarial
embeddings. The networks trained with low-level similarity learning showed higher robustness
against one-step and iterative attacks under white box attack.

Generating adversarial examples is a critical step for evaluating and improving the robustness of learning machines. So far, most existing methods only work for classification and are not designed to alter the true performance measure of the problem at hand. We introduce a novel flexible approach named Houdini for generating adversarial examples specifically tailored for the final performance measure of the task considered, be it combinatorial and non-decomposable. We successfully apply Houdini to a range of applications such as speech recognition, pose estimation and semantic segmentation. In all cases, the attacks based on Houdini achieve higher success rate than those based on the traditional surrogates used to train the models while using a less perceptible adversarial perturbation.

In this paper, we show that these physical adversarial stop signs do not fool two standard detectors (YOLO and Faster RCNN) in standard configuration. Evtimov et al.'s construction relies on a crop of the image to the stop sign; this crop is then resized and presented to a classifier. We argue that the cropping and resizing procedure largely eliminates the effects of rescaling and of view angle.

we investigate the robustness to adversarial attacks of new Convolutional Neural Network architectures providing equivariance to rotations. We found that rotation-equivariant networks are significantly less vulnerable to geometric-based attacks than regular networks on the MNIST, CIFAR-10, and ImageNet datasets.