While the winners of the competition used a mixture of machine learning, human intuition, and brute force, Damien Soukhavong (Laurae), a Competitions and Discussion Expert on Kaggle, explains in this interview how factors like the limited number of training samples which deterred others from using pure machine learning methods appealed to him. Read how he ingeniously minimized overfit by testing how his XGBoost solution generalized on new image sets he created himself. Ultimately his efforts led to an impressive leap in 242 positions from the public to private leaderboard.

The basics

What was your background prior to entering this challenge?

I hold an MSc in Auditing, Management Accounting, and Information Systems. I am self-taught in Data Science, Statistics, and Machine Learning since 2010. This autodidactism also helped me earn my MSc with distinctions with a thesis on data visualization. I worked with data for about 6 years from now. Although data science can be technical, I feel more with a creative and designing mind than a technical mind!

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

I love manipulating images and generating code to process them, along with helping researchers making reproducible research in the image manipulation field. It helped me to look at convenient features that may at least generalize on unknown images unrelated to this competition.

How did you get started competing on Kaggle?

I found Kaggle randomly when a laboratory researcher asked me to look up for online competitions involving data. Kaggle delighted me quickly, as they provide different competition datasets, along with a convenient interface, straightforward to use even for newcomers. There are even tutorial competitions, which are like fast introductions to data science and machine learning techniques to get you productive immediately.

What made you decide to enter this competition?

Three reasons that may look as disadvantages to enter this competition:

It is an image-based competition, requiring preprocessing of images, along the selection of features which are not apparent at first sight.

It is about ordering images in time with a tiny dataset. Hence, one would not throw a Deep Learning model out of the box and expect it to work.

Overfitting is an issue "thanks to" the 70 training sets provided: I personally like leakage and overfitting issues, as fighting them is like avoiding the mine in a minefield (think: Minesweeper).

Let's get technical

What preprocessing and supervised learning methods did you use?

A visual overview of my methods is on the following picture:

A visual overview of methods used. Click to open up a larger view.

For the preprocessing method, I started by registering and masking each set of images, so it aligns them and so they contain the information that are mutual on all the image of a same set. Then, I used a feature extraction technique to harvest general describing features from each image. I generated all the permutations of each set of five images along with their theoretical performance metric, raising the training sample size to 8400 (120 permutations per set, 72 sets), and reducing the overfitting issue to a very residual issue. This also turned the (5-class) ranking problem into a regression problem.

For the supervised learning method, I used Extreme Gradient Boosting (XGBoost) with a custom objective and custom evaluation function.

What was your most important insight into the data?

Hand labeling images was easy once you trained yourself to recognize the related features (think: the neural network in your brain), and I could go for a (near) perfect score if I wanted to. However, my interest was to use pure machine learning techniques, which are generalizing on unknown samples. Thus, I did not explore the manual way for too long.

A simple example for hand-labeling using objects:

Simple example for hand-labeling.

I made a tutorial for recognizing the quantifying the apparition and removal of objects:

Were you surprised by any of your findings?

I have found several interesting findings that surprised me:

Leave-one-out cross-validation (LOOCV) might be a good method to handle the tiny training set to validate a supervised machine learning model, however some image sets are leaking into others which make the cross-validation invalid right at the beginning!

Using the file size of pictures was leaking information out of the box… I believe one can get 0.30 (or more) using only the image file sizes with a simple predictive model.

Working with JPEG (1GB) pictures instead of TIFF (32GB) worked better than expected (I was expecting around 0.20 only).

Using a scoring model from the three highest scoring predictions gave a light boost on the performance metric. However, using predicted scores from all the 120 permuted sets per set to create the scoring model was giving out worse results than random predictions (overfitting).

This shows the model used can generalize to Draper's test sets. However, one must test the model in a real situation. Does it generalize? To test this hypothesis, I took two different image sets:

250 pictures from different areas I took myself when commuting during workdays (250 pictures, 50 sets)

On Google Earth, computing the performance metric from my predictions gave a score of 0.112, which is better than pure randomness. There is a slight bias towards a good prediction, though. Having access to the predictions and to the right order of the pictures, I assumed a uniform distribution of the predictions and ran a Monte-Carlo simulation over it. After 100,000 trials, here are the results:

Results of Monte-Carlo simulation.

I will not say it is bad, but it is not that good either. The mean is 0.09, while the standard deviation is 0.07. Assuming a normal distribution, we have about 90% chance to predict better than random numbers. Reaching 0.442 (the maximum after 100,000 random tries) is clearly not straightforward.

Now coming for my personal pictures, I got... 0.048 only, which is disappointing. To ensure this is not a mistake, I ran the same Monte-Carlo simulation I used previously but on the new predictions:

Results of second Monte-Carlo simulation

Clearly, having 64% chance to predict better than random is on the low-end. The mean is 0.02, and the standard deviation is 0.07. With more samples to train the model on, overfitting should be less an issue than it is on non-aerial imagery.

Which tools did you use?

I used macros in ImageJ for the feature extraction (more specifically: Fiji version, a supercharged ImageJ version), and R + XGBoost for the final preprocessing and supervised learning. When using XGBoost, I used my custom objective and evaluation functions to not only get the global performance metric, but also account for the ranking of predictions in the set (for voting which picture order is the most probable).

XGBoost was in version 0.47. The current XGBoost version 0.60 refuses to run my objective and evaluation functions properly.

How did you spend your time on this competition?

I spent about 20% of the time reading the threads on the forum, as they are a wealth of information you may not find yourself. 60% of the time was spent on preprocessing and feature engineering, and the 20% left on predictive modeling, validating, and submitting.

I also tested different models and features after my first idea (ironically, it consumed over 1 day):

Deep Learning gives no result, as it predicts random numbers and never seems converge even after 500 epochs (I tried data augmentation, image manipulation, and many architectures such as CaffeNet, GoogLeNet, VGG-16, ResNet-19, ResNet-50…).

Neural networks on the file sizes… are incredible as they abuse leakage. They will not generalize on a real scenario.

Random Forests are… overfitting severely and hard to control. They do overfit with enough noise, and this dataset is a good one for such issue (in pictures there can be… clouds!).

Data Visualization using Tableau, to lookup for the best interactions between features. It is clearly hard to notice the interactions, although XGBoost managed a great score for such hard task.

What was the run time for both training and prediction of your solution?

Early on entering the competition, I set a macro in ImageJ very quickly to extract features and to save them in a CSV file. It took about 1 hour, which allowed me to setup the final preprocessing and XGBoost code properly in R. Afterwards, it took only 30 minutes to make the code work properly, preprocess the data, train the model, and make my first submission. I spent 10 more minutes to cross-validate using five targeted folds by categorizing pictures by theme, and make a new submission (that was slightly better than the former).

In total, this took me about 1 hour and a half. Knowing my cross-validation method was correct, I did not care about seeing such low performing score on the public leaderboard with only 17% of testing samples (thanks to the forum threads). This ensured a push by over 242 places from the public to the private leaderboard (ending 32nd), the former being a sample feedback on unknown samples, and the latter being the one where we must maximize our performance (but we cannot see this private leaderboard until ending of the competition).

Words of wisdom

The future of satellite imagery and technology will allow us to revisit locations with unprecedented frequency. Now that you've participated in this competition, what broader applications do you think there could be for images such as these?

There are many applications for satellite imagery and its associated technology. As a spatial reconstruction on a time dimension, one may:

Analyze the land usage over time

Plan the development of urban and rural areas

Manage resources and disasters in a better way (like analyzing the damage of a fire)

Work on a very large database usable for virtual reality (so you can feel how San Francisco looks from your own town for instance)

Initiate statistical studies

Regulate the environment (in a legal way)

Pinpoint tactical and safe areas for humans (moving hospitals, etc.)

Currently, there are satellites capable of recording things we cannot see as humans. One of the most known is Landsat, whose pictures are usable for finding minerals or many other diverse elements (gas, oil...). Using data science and machine learning should give a hefty boom in predictive modeling from satellite imagery for businesses.

What have you taken away from this competition?

The benefits I have taken from this competition were:

Working with a tiny training set, as there were only 70 training sets.

Fighting overfitting, as learning observations is easier than learning to generalize with such tiny training set!

Rationally transforming the ranking problem into a regression problem, a skill that requires regular practice and smart ideas.

A minor benefit was using ImageJ not for pure research, but for a data science competition. I was not expecting to use it at all. I provided an example starter script for using ImageJ in this competition:

Do you have any advice for those just getting started in data science?

Several key pieces of advice for newcomers in data science:

I cannot say it enough times: efficiency is key.

Another important piece of advice: learn to tell stories (storytelling). Non-technical managers will not care about "but my XGBoost model had 92% accuracy!" when they will ask you the value to get from such model in their business environment. A good story is worth thousands of words in less than ten seconds!

More specific advice about predictive modeling for newcomers:

Knowing your features and understanding how to validate your models, allows you to fight overfitting and underfitting appropriately.

Understanding the "business" behind what you are doing in data science when working with a dataset has high value: domain knowledge is a starting key to success.

Creating (only) (good) predictive models in a business environment does not mean you are a good data scientist: statistical and business knowledge must be learnt a way or another.

Thinking you can go the XGBoost way in any industry is naïve. Try to explain all the interactions caught between [insert 100+ feature names] in a 100+ tree XGBoost model.

Your Manager will not like seeing you lurk for improving your model accuracy from 55.01% to 55.02% during one week of work. Where is the value?

N.B: Many ideas you may have can fail miserably. It is valuable to be brave and scrape what you did to start from scratch. Otherwise, you might stick forever the wrong way in a specific project, trying to push something you cannot push any further. Learning from mistakes is a key factor for self-improvement.

Choosing the appropriate way to validate a model depending on the dataset, and this comes with experience.

Looking up for potential leakage, as it may be what invalidates your model at the end when dealing with unknown samples.

Engineering features pushes your score higher than tuning hyperparameters in 99.99% of cases, unless you are using a horrible combination of hyperparameters right at the beginning.

Just for fun

If you could run a Kaggle competition, what problem would you want to pose to other Kagglers?

I would run a data compression problem. It would open the way for finding novel methods to optimize the loss of information to a minimum, while decreasing the feature count to a bare minimum, using supervised methods.

What is your dream job?

Being a data science evangelist and managing talents!

Bio

Damien Soukhavong is a data science and an artificial intelligence trainer. Graduated from an MSc in Auditing, Accounting Management, and Information Systems, he seeks the maximization of value of data in companies using data science and machine learning, while looking for efficiency in performance. He mentors individually creative professionals and designers who are investing time to push data science for their daily creative/design job.

Congratulations Damien. Good job and interesting interview. You said that engineering features are the key for a good prediction performance. What are the most important feature engineering tricks when predicting using xgboost? Transformations are nor supposed to have much influence for tree-ensemble methods. But it could be important to create interaction terms like ratios for xgboost?
Thank you very much