I believe the following Semseg CNN related techniques to be standard in 2018, but I will list them anyway for completeness:

Using UNet / LinkNet style CNNs;

Using pre-trained encoders on Imagenet;

Trying various combinations of encoder-decoder architectures;

Trying augmentations of vastly different "heaviness" (you should tune capacity of your model + augmentations for each dataset);

(Funnily enough, they rarely write about this in papers).

Open solution

From an ML perspective it is great. No, really. Their write-up is amazing.

Weight illustrations in the open solution. Note that they do not show the final weights. If they did - the need to calculate n^2 distances will be less apparent.

These are my corresponding weights

An this is how the CNN sees the weights after parametrization / squeezing - no real difference on how to calculate the distances

But I have a number of serious bones to pick with it and I have voiced my concerns in the gitter chat here, but I will also repeat them here:

This solution is a blatant marketing of a US$100-per-month-minimum solution for ML, which I guess is supposed to solve scaling / experimentation problems, but I cannot see why Tensorboard + simple logs + bash scripts cannot solve this for free without the code bulk + introducing extra dependencies. Ofc, using this platform is not mandatory, but why would you otherwise ;

The code bulk in the repo - it contains 3-4 levels of abstractions - decorators on decorators. It's really ok, if you invested in a team of at least several people that compete in interesting competitions as their job, but when you are publishing your code for the public - it looks a bit like those obfuscated repositories with TF code. Their code is really good, but you have to dig through all of there extra layers to get to the "meat";

The advertised neptune.ml experiment tracking capabilities. Yeah, it tracks all of the hyper-params. But if you follow a link to their open spreadsheet with experiments, you cannot really understand much from there => then what is the tracking advantage? Yeah, running your model on amazon with 2 lines of code is great, but also extremely expensive. Maybe it's better to just assemble a devbox? It looks like they wrote a stellar write-up, but used this tracking only for "illustration", not like a real tool that is crucial;

The idea about distance weighting is really good - but they implemented line by line as written in UNet paper it by calculating n^2 distances, which in my opinion is over-engineering. Just visually, after "squishing" even one distance transform provides similar mask weights;

From these points my biggest concern is that new ML practitioners will see this and possibly may draw the following conclusions, which ARE HIGHLY DETRIMENTAL TO THE ML COMMUNITY FROM AN EDUCATIONAL STANDPOINT:

You need some paid proprietary tools to track experiments. No, you do not. It starts to make sense when you have a team of at least 5 people and you have a LOT of different pipelines;

What worked? Which internal structure did I explore?

Current state / my simple ablation analysis

Currently, I did not implement the second stage of the pipeline, as the second stage is delayed and last time I checked the submits were frozen.

But here is a very brief table with my major tests (I ran ~100 different test, training ca 20-30 models till convergence).

Architecture

Histogram based F1 score (0.5 IOU) (**)

Hard DICE (0.5) (+)

Proper F1 score (0.5 IOU)

LB / polygon F1 (*)

Comment

Best ResNet + LinkNet

.823

~0.9+

NA

NA

Best ResNet + UNet

.816

~0.9+

NA

NA

2-3x slower than LinkNet

Best InceptionResNet-based model

.786

~0.9+

NA

NA

Best DenseNet-based model

.803

~0.9+

NA

NA

Bes LinkNext-based model

.820

~0.9+

NA

NA

Best ResNet + LinkNet + weighting tricks (***)

.874

~.94

~.9

TBD

epoch = 0.1 of the dataset (random)

5 epochs with lr 1e-3, unfreeze encoder

10 epochs with lr 1e-4

20 epochs with lr 1e-5

then increase DICE weight x10

Best ResNet + LinkNet + weighting tricks + faster schedule

+ higher size-based weight

.880+

~.93

~.9

TBD

No encoder freeze, start with lr 1e-4

10 epochs with lr 1e-4

20 epochs with lr 1e-5

then increase DICE weight x10

(*) Challenge hosts provide a tool for local evaluation. It is based on jsons, that have to be filled with polygons + some confidence metric. The authors of the open solution claim that this weighting has a really major impact on the score. I did not test this yet, but I believe it can easily add 2-3 percentage points to the score (i.e. reduce False Positives + make the score .9 => .92-.93 for example);

(**) I just adopted a histogram based F1 score from DS Bowl. I guess it works best when you have many objects. We did a naive implementation of a proper F1 score together with visualization - it is slow, but shows much higher score;

Best model - the fattest ResNet152I did some ablation tests and:-- ResNet101 was a bit worse-- ResNext was close, but heavier-- VGG-family models over-fitted heavily-- Inception family models were good, but worse than ResNet

(1)Model architecture

LinkNet and UNet based models were close with fat ResNet encoders, but UNet is 3-4x times slowerJust for lulz, maybe it's worth leaving a UNet for a couple of days? =) But that borders on remembering the dataset ...

Note, that ideally (0) and (1) should be also tested with the whole pipeline, which I did not do yet (my friend will do the second pipeline part).

(2) Augs

Played with different levels of augs, small augs were the best, model does not overfit (therefore it is very interesting to see the second stage data - maybe the data will be different => all the training know-hows will become useless, as it always happens with the competitions on Kaggle ..)

(3) Training regime

Freeze encoder, tune the decoder with lr 1e-3 and adam for 0.1 of the dataset (randomly ofc)Unfreeze, train with lr 1e-4 and adam for 1.0 of the dataset (randomly ofc)Train with lr 1e-5 for 1.0 of the datasetIncrease DICE loss 10x and train as long as you want - this possibly may be very fragile (!) if the delayed test dataset is different

(4) Loss weighting

Visually I saw no difference in using only one distance transform vs. calculating distances between each object => they are squished anywayUNet weighting worked best with --w0 5.0 --sigma 10.0 , which means that the weights are distributed [1;5]Size weighting worked, and gave +3-5% F1 scoreDistance weighting did not improve the result, but together with size weighting there was a slight improvement