Playing with adversarial attacks on Machines Can See 2018 competition

Or how I ended up in a team that won in the Machines Can See 2018 adversarial competition

TLDR;

I happened to participate in Machines Can See 2018 adversarial competition. I was very late to the competition (I entered ~1w before its end), but I ended in a team of 4 people where contributions by 3 of us (including myself) proved to be essential to the victory (remove one of them - we would be outsiders).

The goal of the competition was to change faces of people (limited by 0.95 SSIMlimit), so that the black-box CNN could not tell a source person from a target person.

The idea of the competition in a nutshell - modify a face so that a black-box cannot tell 2 faces apart (at least in terms of L2 / Euclidean distance)

Adding momentum to FGVM (it worked for team that ranked lower, so maybe just ensembling + heuristics worked better than momentum?);

C&W attack (essentially an end-to-end attack focusing on logits in the white box model) - worked for WhiteBox (WB) but did not work for BlackBox (BB);

End-to-end Siamese LinkNet (an architecture similar to UNet, but based on ResNet) based approach. It as well worked on WB but dit not work on BB;

What we did not try (lacking time, effort or willpower):

Testing augmentations properly for student learning (we would have to modify the descriptors as well);

Doing augmentations when attacking;

About this competition in general:

It featured a small-ish dataset of 1000 5+5 image combinations;

The dataset for Student net teaching was relatively big - 1M+ images;

The BB was provided as a number of pre-compiled Caffe models (as it goes with such stuff, ofc they did not work with reasonably recent software versions - but it was resolved by the hosts in the end). It was a bit of pain - because this BB did not accept images in batches;

The competition featured a stellar baseline (to be honest, I believe that w/o this there would be no people on the LB);

1. Machines Can See 2018 competition overview and how I ended up there

Competition and approaches

To be honest I was lured by interesting domain, what I assumed to be Founders' edition GTX 1080Ti in prizes and relatively "low" competition level (it's nowhere near to competing against 4000 people on Kaggle + whole ODS team).

As mentioned above, the goal of the competition was to fool the BB model into failing to tell the images of 2 different people apart (in terms of L2 norm or Euclidean distance). The competition was a "black-box" competition, so we had to distill our Student networks on the provided data and hope that gradients of the BB and WB will be similar enough to perform the attack.

Essentially if you read the scholarly literature (thisand this for example, even though such papers for not really tell what works in REAL life) and distill what the top teams achieved you can easily spot the following patters:

The easiest to implement attacks (on modern frameworks) involve white-boxes or knowing the internal structure of the CNNs (or just an architecture) you are attacking;

Someone in the chat suggested ... timing the inference time of the BB and hence deducing its architecture...lol;

Given access to enough data you can emulate a BB with a properly trained WB;

To be honest we were baffled, because for 2 different people in our team, who implemented completely different end-to-end solutions (w/o knowing about this, i.e. separately), both of them did not work on BB. This essentially may meant that in our task setting there was some hidden leak in our settings, that we did not notice. As with many modern CV applications going fully end-to-end may give you stellar results (like with style transfer, deep watershed, image generation, noise and artifact reduction, etc) of just not work. Meh.

How gradient methods work

Essentially you mimic a BB with a WB via distillation, then you just calculate the gradient of the input images w.r.t. the model output. The secret as usual lies in the heuristics.

Target metric

The target metric was an average L2 norm (Euclidean distance) between all 25 combinations of source and target images (5*5 = 25).

Due to CodaLab's limitations, I believe that private scores (and team mergers) were done manually by the admins, which is kind of cringe.

Team

I joined the team after training the Student nets better than anyone on the LB (AFAIK) and after some discussion with Atmyre (she helped me with using the right compiled BB, as she faced such problems herself). We shared our local scores w/o sharing the approaches or code and 2-3 days before the finish line:

My end-to-end models failed (hers as well);

I had superior Student models;

They had superior variation of FGVM's heuristics (their code was based on the baseline);

I just started tackling the gradient-based models and achieved somewhere around ~1.1+ locally - initially I was reluctant just to use the baseline code for personal reasons (little challenge);

They did not have plenty computational power at that moment;

In the end we took a gamble and joined forces - I contributed my devbox / CNNs / ablation experiments and observations and they contributed their code-base they polished for a couple of weeks;

Once again a great shout-out to her for her organization skills as well as priceless advice.

The team members were:

https://github.com/atmyre - she was a team captain then (from what I guessed from her actions). She contributed genetic differential evolution attack to the final submission;

https://github.com/mortido - his was the best implementation of the FGVM attack with stellar heuristics + he trained 2 models using baseline code;

https://github.com/snakers4 - apart from some ablation tests I contributed 3 Student models with superior scores + computing power + I had to step up during presentation and final submission phase;

https://github.com/stalkermustang;

In the end we learned a lot from each other and I am glad that we took this gamble. Without any of the above three contributions, we would not win.

2. Student CNN distillation

I achieved the best score in training the Student models, because I used my own code instead of the the baseline.

Key takeaways / what worked:

Invent a LR regime for each architecture separately;

At first just train with Adam + LR decay;

Then use folding and / or something even more clever (I did not do it here) like Cyclic Learning Rate or weight ensembling;

Monitor under- and over-fitting and model capacity carefully;

Tune your schedule manually, do not rely on fully automatic schemes. They can work as well, but if you tune everything properly your training time can be 2-3x shorter. Especially it matters in case of gradient heavy models like DenseNet;

Best architectures are reasonably heavy;

Training with L2 loss instead of MSE also worked, but it was less precise. When running tests models trained with MSE showed closer L2 distance to BB model outputs than models trained with L2 loss. Probably this is because MSE, if used our-of-the-box treats each Bx512 item in the batch kind-of-separately (it allows more fine-tuning and shares information between images), whereas L2 norm treats each 2x512 vector combination separately;

What did not work:

Inception-based architectures (not-suitable due to high down-sampling and higher required resolution). Third place managed to use Inception-v1 and full-resolution images (~250x250) somehow though;

VGG based architectures (over-fitting);

”Light” architectures (SqueezeNet / MobileNet - under-fitting);

Image augmentations (w/o modifying descriptors - though people from 3rd place pulled this off);

Working with full-size images;

Also there was a batch-norm layer in the end of networks provided by challenge hosts. It did not help my colleagues, and I used my own code as I did not quite get why it was there;

Using saliency maps together with one-pixel attacks. A assume this is more useful for full-sized images (just compare 112x112xsearch_space vs. 299x299xseach_space);

Our best models - note that the best score is 3 * 1e-4. Also judging by model complexity you can kind of guess that the BB was ResNet34. In my tests ResNet50+ performed worse than ResNet34...

MSE losses from the first place

3. Final scores and ablation analysis

Our ablation analysis looked like this:

Top solution looked like this (yes, there were jokes about just stacking ResNets, he guessed that ResNet was BB architecture):