Abstract

Deep learning has recently become state-of-the-art in many computer vision applications and in image classification in particular.
It is now a mature technology that can be used in several real-life tasks.
However, it is possible to create adversarial examples, containing changes unnoticeable to humans, which cause an incorrect classification by a deep convolutional neural network.
This represents a serious threat for machine learning methods.
In this paper we investigate the robustness of the representations learned by the fooled neural network.
Specifically, we use a kNN classifier over the activations of hidden layers of the convolutional neural network, in order to define a strategy for distinguishing between correctly classified authentic images and adversarial examples.
The results show that hidden layers activations can be used to detect incorrect classifications caused by adversarial attacks.

A low kNN score indicates that the adversarial is correctly detected while a high score means that our approach is wrongly confident about the prediction of the CNN. The results show that high scoring adversarials examples often share some common visual aspects and semantic with the predicted (adversarial) class, resulting in a more challenging detection.

Adversarial Image

Generation Algorithm

Actual Class

Fooled Class

Nearest Neighbor

kNN score {{adversarials.asd}}

{{adv.type}}

{{adv.actual_text}}

{{adv.pred_text}}

{{adv.knn_score | number : 2}}

This work was partially supported by Smart News, Social sensing for breaking news, co-founded by the Tuscany region under the FAR-FAS 2014 program, CUP CIPE D58C15000270008.
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.