Engineering

A few months ago, Yelp partnered with Kaggle to run an image classification competition, which ran from December 2015 to April 2016. 355 Kagglers accepted Yelp’s challenge to predict restaurant attributes using nothing but user-submitted photos. We’d like to thank all the participants who made this an exciting competition!

Dmitrii Tsybulevskii took the cake by finishing in 1st place with his winning solution. In this blog post, Dmitrii dishes on the details of his approach including how he tackled the multi-label and multi-instance aspects of this problem which made this problem a unique challenge.

The Basics

I hold a degree in Applied Mathematics, and I’m currently working as a software engineer on computer vision, information retrieval and machine learning projects.

Dmitrii on Kaggle

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

Yes, since I work as a computer vision engineer, I have image classification experience, deep learning knowledge, and so on.

How did you get started competing on Kaggle?

At first I came to Kaggle through the MNIST competition, because I’ve had interest in image classification and then I was attracted to other kinds of ML problems and data science just blew up my mind.

What made you decide to enter this competition?

There are several reasons behind it:

I like competitions with raw data, without any anonymized features, and where you can apply a lot of feature engineering.

Quite large dataset with a rare type of problem (multi-label, multi-instance). It was a good reason to get new knowledge.

Let’s get technical:

What preprocessing and supervised learning methods did you use?

Outline of my approach depicted below:

Photo-level feature extraction

One of the most important things you need for training deep neural networks is a clean dataset. So, after viewing the data, I decided not to train a neural network from scratch and not to do fine-tuning. I’ve tried several state-of-the-art neural networks and several layers from which features were obtained. Best performing (in decreasing order) nets were:

The best features were obtained from the antepenultimate layer, because the last layer of pretrained nets are too “overfitted” to the ImageNet classes, and more low-level features can give you a better result. But in this case, dimensions of the features are much higher (50176 for the antepenultimate layer of “Full ImageNet trained Inception-BN”), so I used PCA compression with ARPACK solver, in order to find only few principal components. In most cases feature normalization was used.

How did you deal with the multi-instance aspect of this problem?

In this problem we only needed in the bag-level predictions, which makes it much simpler compared to the instance-level multi-instance learning. I used a paradigm which is called “Embedded Space”, according to the paper: Multiple Instance Classification: review, taxonomy and comparative study. In the Embedded Space paradigm, each bag X is mapped to a single feature vector which summarizes the relevant information about the whole bag X. After this transform you can use ordinary supervised classification methods.

Fisher Vector was the best performing image classification method before “Advent” of deep learning in 2012. Usually FV was used as a global image descriptor obtained from a set of local image features (e.g. SIFT), but in this competition I used them as an aggregation of the set of photo-level features into the business-level feature. With Fisher Vectors you can take into account multi-instance nature of the problem.

This network shares weights for the different label learning tasks, and performs better than several BR or ECC neural networks with binary outputs, because it takes into account the multi-label aspect of the problem.

Classification

Neural network has much higher weight(6) compared to the LR(1) and XGB(1) at the weighing stage. After all, 0, 1 labels were obtained with a simple thresholding, and for all labels a threshold value was the same.