1. Why this particular challenge?

Now (in late 2017) Kaggle (after being bought by Google) is dominated by teams of people usually winning by stacking 10-30 models. Otherwise you will not get necessary +0.5% boost;

There have been a lot of not very interesting competitions lately with table data (either stacking 30 XGBoost models or industry expertise is required), i.e. this experience is not directly transferable to other domains and not interesting itself;

On recent plain-vanilla CV competition (classification of 5000+ online items with 300x300 pictures) - stacking started 1 month before the end after a guy posted his 0.7 scoring model weights with a note "stack at your pleasure". This more or less turns such competitions into casinos with stacks of GPUs instead of stacks of chips / money;

Most competitions' domains are plain boring and / or too "business" related;

When I was just listening to the fist fast.ai course, they focused heavily on the last instance of similar competiion, which raised my interest in this field;

I am a novice in this field (ca ~1 year of experience, whereas many of current people on Kaggle have 3-4 years worth of track record);

So I really liked the domain, high barriers of entry and steel-nerves required to even make a submit (300+GBs of unpacked data => of 400+ participants only a few dozens managed to make a reasonable pipeline).

In a nutshell in this competition there are ca. 1100 training videos and 667 test videos. We needed to:

Count the fish properly = 0.6;

Classify fish species = 0.3;

Measure fish length - 0.1;

The target metric (randing from 0 to 1), in my view, was unnecessarily complicated. Nobody in the community (of people who took the metric implementation from the forum) could match their local validation with the leaderboard scores. As usual there was an opinion that public portion and the private portion of the LB are heavily unbalanced and that their scoring is bugged (casino?). But what makes participating in such competitions worthwhile - is the interesting domain and the motivation to learn more about the current sota methods.

To make things worse, a non-expert cannot easily tell one flatfish species from another - they look really similar:

2. Model selection criteria. Key considerations

Well, without prior knowledge about paragraph 8, my decision process was the following:

We have ca. 100+GB of videos, which when unpacked into pictures (I did not know then that you can source pictures directly from videos) weigh ca. 300+GB;

The basic task - finding and classifying fish is well covered by well-known image classification and object detection architectures;

I worked extensively with u-net on this challenge (I scored 66th w/o resorting to stacking and such) - and it has amazing powers, but works very very slowly - ca. 1 frame per second;

Since the task is really complicated and success is not guaranteed, ideally I need to use the simplest end-to-end pipeline, consisting of no more than 1 model;

Also I decided to learn pytorch as a key part of this competition (if I fail the task, I will learn more about pytorch);

All you need to know about current sota object detection algorithms

Also after reading up on the subject and assessing the available implementations on the target frameworks - keras and pytorch (YOLO, YOLOv2,SSD) - I decided to try Yolov2 in keras due to its simplicity and try SSD in pytorch for a challenge.

For the sake of simplicity, the convolution part of the architecture of these models is really simple (which cannot be said about preprocessing, augmentations and evaluation parts, but they are helpfully provided by the available implementations).

3. Data EDA + how to analyze 300+GB of data efficiently?

Well, the data is videos - so the easiest approach to EDA is watching videos and adding bounding boxes to them. I know a couple of basic ways to do that:

At this stage we can directly and easily see which data was provided to us:

As you can see, the fish was not annotated really properly (so that it mostly fits into the bounding box) - it was annotated head-to-tail. It may be useful if you want to have a multi stage pipeline, which will first find fish head and tail location, but I wanted to have a more or less end-to-end solution. It's really easy to fix - these fishes are mostly as wide as they are long, so we can just turn bounding boxes into squares that fit all the fish.

When I was first downloading the implementation, author did not provide weights in Keras format and his conversion script failed, so I had to use this one instead. Now he seems to have provided the weights in keras format;

Basically, most of these implementations base their pre-processing pipelines around Pascal VOC or MS COCO, which at this moment are terribly outdated (you will see obsolete python 2 libraries and awful long xml parsing scripts). So the first part of exercise is bracing yourself and writing your own data generator on the basis of the provide one.

Feeding the generator (after reverse engineering the xml parsing scripts) turned out to be really easy (a link is to final train script):

Also in case of keras YOLO - I just had to infer the format supported by their generator and make sure I utilized 2 GPUs. In keras it is not straight-forward and kind of hacky, and also their generator I believe does not support multiple workers.

This is the implementation of hacky multi-gpu keras implementation I tested and used.

#Save all the outputs for merging back together later for l in range(len(outputs)): outputs_all[l].append(outputs[l]) # merge outputs on CPU with tf.device('/cpu:0'): merged = [] for outputs in outputs_all: merged.append(merge(outputs, mode='concat', concat_axis=0))

return Model(input=model.inputs, output=merged)

First you create a model, then you do the model = make_parallel(model, gpu_count=2) bit, and then you load the weights. Note that the weights from the original model and the new model are not directly interchangeable - because the above script basically replaces your model with 2 models, which is not ideal for loading back the weights (you have to load weights to the modified model).

It's tempting to use the provided predict or predict_batch in Keras, but in this case due to custom data manipulations I had to improvise a quick and dirty single-threaded queue-based solution.

Pytorch takeaways

The pytorch SSD implementation had ample augmentations which also serve as a basis for my current work;

Also pytorch provides multi-node and multi-gpu boilerplate out of the box - no need to invent the bicycle;

I basically just needed to rewrite the Data generator classes and add basic validation to the train script to ensure that there was no over-fitting;

The original implementation had no proper runtime validation in the pipeline + no batch-based inference scripts (also their inference post-processing is not multi-GPU friendly, but I should not really complain =) );

5. Choosing post-processing heuristics

With the technical part out of the way, now we can handle the post processing heuristics. The thing is - SSD and YOLO can predict bounding boxes and class probabilities, but the cannot really predict fish sequences and count fishes,

Fish length is easy - I tried using simple linear regressions (95% accuracy), regression forests (90% due to overfitting) and CNNs (97-98% on binned data, but too complicated for a simple tasks). Given that fish length is only 0.1 of the score and local validation does not really worked - I decided to focus more on post-processing and filtering fish counts and class probabilities.

In the hindsight - it turned our that dividing (and conquering) the task into a series of simpler tasks (find ruler, extract it, rotate it, classify fish, use RNN to count fish) was the way to go, but in my case there was a steep learning curve in learning both SSD / YOLO / pytorch at the same time (as well as having first 2-3 weeks on a new job) - so I just stuck to naive post processing techniques and grid searching for the best solution, until ZFTurbo suggested that I could use train dataset to validate my post-processing technique by controlling the number of fishes.

Post processing in a nutshell - we need to decide which probabilities to keep and how to separate fish N from fish N+1

Also it turned out that not only me, but also some of the people who managed to submit a reasonable solution, used more or less this technique:

Predict fish location and class probabilities;

Apply some kind of threshold - i.e. 0.4,0.5,0.9 to the data (e.g. of the max probability of the row is less than threshold, then there is no fish);

Iterate over predictions sorted by video and frame number;

If there is a gap in the data (N rows of no fish) increment the fish number;

Repeat;

I also used a trick (kudos to Konstantin Lopuhin) - feeding 3 square crops of the images to SSD300 instead of resizing the whole picture. It gave me a 10x lower cost function on the train, but no real impovement on the LB.

In the hindsight - going for a more end-to-end solution was key to getting 0.7+ score.

6. Experiment log, or am I just unlucky, or is all of this a hoax?

When I just entered the competition, it looked optimistic.

All started very promising

Then I really struggled with tons of experiments, but I just marginally improved my original score (0.55 => 0.57).

As you can see above I was disappointed that a better model (SSD300 + crops) did not really work despite all the tricks. Maybe it's a glitch somewhere, but it is kind of dis-hearting.

Probably I should also pay some attention to the trick proposed by ZFTurbo - to validate my post-processing script by running the model on the train dataset and comparing the number of predicted fishes. I used MSE and simple histograms to match my values - but they did not really work the LB.

Also almost in the end I noticed that SSD with 3 crops gave strange predictions - it usually predicted at least 2 classes with probabilities > 0.9, which I tried to mitigate by manually selecting only the top class. Overall SSD had higher probabilities in the outputs.

As a finishing touch, enjoy a couple of videos:

Yolov2 + 0.5 threshold

SSD300 - NO threshold at all

7. End result, advice, limitations, take-aways

On the last day of competition I was 14th on the public LB. I expect a major shake-up, though. Since my best model is simple enough, I expect that some overfitters (if there were any) woud lose some positions, though my thresholding may be naive;

End-to-end solutions rule;

Using all the additional information possible explicitly - may give you a proper boost;

Steel nerves and balls are necessary till the end;

When extending the framework / model classes - invest a lot of time into testing;

Try to use as simple tools as possible, if you want to win money. Try do divide and conquer;

Do not be afraid to test some naive end-to-end solutions as a part of your pipeline (unless it takes more than 1-2 hours to implement), even if they increase overall complexity;

8. Alternative pipelines

These are other guys' pipelines, that scored in the top results:

First approach - inference takes 1-2 days:

Unet to locate the fish;

Classic classification methods to classify the fish;

Naive post processing;

Update - also used XGBoost with outputs of previous preprocessing steps to do post-processing;

Second approach (similar to mine):

SSD300+square frame cropping:

Naive post processing;

Third approach:

Manual annotation of the ruler on which the fish is laid;

Manual annotation of the fact whether the fish is seen fully / obstructed;

SSD to locate the ruler, Dense-net to classify the fish on the manually rotated predictions;