Notes on “Intriguing properties of neural networks” paper

I've recently encountered a great paper describing interesting properties of neural networks. It tries to go beyond the "black box" view on ANNs and shows that single neurons and layers can have own meaning that is often comprehensible even to a human. Briefly, each layer produces a space (with neurons set as a basis) where each vector has semantic information associated with it.

There is one more idea described in the paper. Neural networks (especially deep ones) as a function in inputs space is not always smooth. By smoothness here I don't mean existence of derivatives, but rather the fact that inputs in the vicinity of train set samples could have unexpected classification labels assigned. Authors describe a way, how to obtain visually almost indistinct pairs of images that will be classified by network to different classes (e.g. bus classified as ostrich). By applying this procedure it is possible to modify a dataset so that error rate will go up from say 5-10% to 100%! Moreover, this "corrupted" dataset also leads to high error rates for networks that were trained on different samples of data or have different architecture then network used for images modification.

Two improvements are proposed for training routines:

train an ANN, get modified images and add them to the dataset, train the ANN again on updated data;

add a regularization term to loss function so that network output instabilities will be compensated.

After first reading it wasn't clear for me, what form regularization term should have. Yet expressions for upper bounds of instability are provided, so it shouldn't be hard to come up with some solution.

As for quick and dirty term I'd give a try to gradient regularization, e.g. penalizing high values of gradient. Yet this could possibly slow down learning, because computation of Hessian becomes necessary.