compile(optimizer=adam, loss='binary_crossentropy') return modelI used concatenation of google news and glove embedding with additional features added to each word’s representation based on the relative position in a sentence (order number), relative position in a question (order number), is_upper, is_lower, is_ number, is_punctuation, mean of upper letters in word (“WORD” → 1.

0, “Word” → 0.

25), frequency of a word in train dataset.

Snapshots EnsemblingSnapshot ensembling is a commonly used technique in Kaggle competition previouslydescribed in Snapshot Ensembles: Train 1, get M for free.

The idea behind it is very simple: we train a single model with cyclic learning rate, saving the weight of the model at the end of each cycle (end of a cycle is usually is in a local minimum).

In the end, we will get several models instead of just one.

Averaging predictions of the ensemble will give you a better score than a single model.

During training, we are converging to and escaping from multiple local minima.

A snapshot is taken in each local minimumPseudo LabelingThe idea of pseudo labeling is to increase the amount of available data for model training.

It is a common approach in Kaggle competition, but unfortunately, it is not so commonly used in the industry (here are some papers on this topic: Training Deep Neural Networks on Noisy Labels with Bootstrapping, Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks).

The more confident our predictions are — the bigger weight we assign during trainingOptimal Threshold SelectionThe metric of this competition is the F1 score, which means that we have to submit classes (0/1, nontoxic/toxic).

The output of the model is probabilities, so we need to select the appropriate threshold to maximize the score.

There is a number of ways to do that:Select threshold before training.

I think this way will lead to non-optimal threshold and score will be lower.

I had it tested, but never used in the final submit.

Although, the winners of competition used a variation of this approach.

Fit model, make a prediction for the train part, select threshold optimizing the score on the train part.

Straightway to overfit.

Tested, but did not use in the final submit.

Fit model, make a prediction for the train part, select threshold optimizing the score on the train part.

Straightway to overfit.

I had it tested, but never used in the final submit.

· Fit model, make a prediction for the validation part, select threshold optimizing the score on the validation part.

In my pipeline, the validation part is used to select the optimal number of epochs (which is a slight overfit of the validation data) and optimizing hyperparameters of the Neural Net (a major overfit).

Select a threshold on already “used” data “as is” is not a good idea.

We might completely overfit and get a low score on unseen data.

Instead, I decided to select a threshold on subsets of the validation part, repeat it several times and then aggregate results (this idea was motivated by the idea of subsampling in statistics).

It gave me good results both in this competition and a few others where I had to select a threshold as well [code].

Make out of fold (OOF, means a separate model for each fold) predictions, find a threshold.

This is a good way, but there are two problems about it: (1) We don’t have this much time to make OOF predictions because we want to fit as many diverse models (CNN/LSTM) as possible.

(2) It might be a case, that our validation split is biased and we will get our predictions “shifted”.

Threshold search is very sensitive and we will not get an optimal solution.

However, I used this approach for ranking the probabilities on each fold to reduce the influence of the “shift”.

It worked good enough for me.

What Did Not WorkDuring the competition, I’ve tested tons of ideas, while only a small part of these was used in the final pipeline.

In this section, I provide an overview of some techniques which did not work for me.

A lot of stuff did not work during this competition (picture by Schmitz)Data Augmentation and Test Time Augmentation (TTA)The idea is to increase the training dataset.

There are several ways to do that, most commonly used are repeated translating (translate the English sentence into French and then back into English), and synonyms.

Repeated translating was not an option because Internet access was not available during the 2nd stage, so I decided to focus on synonyms.

I tested two approaches:· Split sentence into words, replace a word with the closest word by w2v embedding with a pre-defined probability.

Repeat a few times to get different sentences.

· Add random noise to random words in the sentence.

Both approaches did not work well.

Also, I tried to combine questions together, i.

e.

non-toxic + toxic = toxic.

It didn’t worked either.

Non toxic: Is there an underlying message that can be read into the many multilateral agreement exits of the Trump administration during its first year and a half?Toxic: Lol no disrespect but I think you are ducking smart?____________________________________________________________________New toxic: Is there an underlying message that can be read into the many multilateral agreement exits of the Trump administration during its first year and a half?.Lol no disrespect but I think you are ducking smart?Additional Sentence FeaturesI tried different features based on the sentence, but they did not help much.

Some of them reduced the score, so I decided not to include them in the final model.

Here are some of the features I’ve tested: number of words, number of upper words, number of numbers, sum/mean/max/min of numbers, number of punctuation, sum/mean/max/min of frequencies of words, number of sentences in a question, starting and ending character, etc.

Neural Net Inner Layers OutputThe idea is simple: we take the output of the net’s concat layer and train tree-based model on top of it.

I tested this approach in recent competitions and it always increased the score of the final ensemble.

It allows us to use diverse models (Neural Net and tree-based models) for blending, which will increase the score of the final ensemble.

But in this case, it was a very tiny increase, especially given the time it took due to high output dimension (dimensionality of concat layer ~400 depending on the model).