Q&A With Job Salary Prediction First Prize Winner Vlad Mnih

I just completed a PhD in Machine Learning at the University of Toronto, where Geoffrey Hinton was my advisor. Most of my work is on applying deep learning techniques to aerial image analysis, so I have a lot of experience in training neural networks with tens of millions of parameters on big datasets.

Why did you enter?

I had a bit more spare time after completing my thesis so I decided to do a quick project before leaving Toronto. I chose this particular competition because it involved text data and, while that is not something I had a lot of experience with, it seemed like a problem where neural nets should do well (and indeed the 2nd place finisher also used a neural net).

What preprocessing and supervised learning methods did you use?

I did relatively little preprocessing and feature engineering. I used separate bags of words for the job title, description, and the raw location. I also found that stemming the words in the title and description using the Porter stemmer and encoding them using tf-idf slightly improved the performance. The other fields, like the category, contract, and source, were represented using a 1-of-K encoding. The resulting input representation had between 10000 and 15000 features depending on how many of the top words I used. I did experiment with a number of alternative features and encodings but I did not get any noticeable improvements.

For the supervised learning part, I used deep neural networks implemented on a GPU. I trained the neural nets by optimizing mean absolute error (the evaluation metric for this contest) using minibatch stochastic (sub)-gradient descent and used dropout in order to help avoid overfitting. My best single neural network achieved a score of about 3475 on the public leaderboard, but my final submission averaged the predictions of three neural networks to get down to about 3435. I did not combine neural networks with any other learning methods.

This approach might sound familiar to readers of this blog because my office mates, George Dahl and Navdeep Jaitly, and their team mates recently used a nearly identical architecture in their winning entry for the Merck Molecular Activity Challenge, although there are some differences due to the particulars of that contest.

What was your most important insight into the data?

My most important insight was to simply train a powerful and flexible model by directly optimizing the loss function used to determine the winner. Some competitors used complicated ensembles of many disparate models, most of which were not optimizing the correct objective. These people needed to use leaderboard and validation error feedback much more heavily than I did since their model selection process was the only part of their pipeline that directly optimized the evaluation metric.

Were you surprised by any of your insights?

I was somewhat surprised by how little improvement I got from my attempts to engineer better features. For example, I didn't get any improvement from using bigrams or from adding information derived from the normalized location or location tree. Since other competitors have reported noticeable gains in performance from using these features on the competition forum, I suspect that the deep nets I trained were able to learn some of these features automatically. While this is definitely a pleasing result, it is a little surprising even to neural network experts because neural nets are generally considered to be quite sensitive to the input representation.

Which tools did you use?

I used Python along with a number of open-source Python packages. I used pandas for loading and exploring the data and scikit-learn for its feature extraction pipeline, although I ended up implementing my own text vectorizers for improved memory efficiency. I also used NLTK for its implementation of the

Porter stemmer. Finally, I used my own implementation of deep neural networks which relies on Tijmen Tieleman'sgnumpy library and my own cudamat library for GPU support.

What have you taken away from this competition?

I learned quite a bit about how feature engineering interacts with different neural network architectures. In particular, I thought it was really interesting that Vlado Boza placed 2nd with a completely different neural network architecture and set of features.

Vlad Mnih is a machine learning researcher based in London, England. He holds a PhD in Machine Learning from the University of Toronto and an MSc in Machine Learning from the University of Alberta.