Indiahacks Machine Learning competition is an All India machine learning competition conducted once in a year. I participated in the qualification round and secured 6th position(out of 6000 participants), which is Top 1%. Only top 60 participants were selected to participate in offline zonal round. However, I was unable to participate in zonal round since I was traveling.

Note: Code is not production ready yet, so not sharing it on github. Will share it when I get some free time.

The challenge was to predict the segment(pos, neg) based on the given features:

ID: unique identifier variable

titles: format “title:watch_time”, titles of the shows watched by the user and watch_time on different titles

genres: same format as titles

cities: same format as titles

tod: total watch time of the user spread across different time of days (24 hours format)

dow: total watch time of the user spread across different days of week (7 days format)

Model Pipeline:

Features extracted from text variables:

Titles variable: I used word embedding using word2vec, a deep learning technique which maps similar words to context after trying Bag of Words. Word embedding improved my validation score significantly.

Tod variable: Several features were extracted from total watch time column out of which I ended up using the following features:

tod_median_time

tod_min_time

dow_max_time

watch time counts at hours (0 to 23) mapped from t0 to t23

tod_start

tod_end

Days

Cities variable: Several features were extracted out of which I ended up using the following features:

cities_min_time

cities_count

Genres variable: Extracted the genres from this column and mapped each genre as a binary feature.

Apart from these, I extracted the following features:

genres_min_time

genres_max_time

genres_mean_time

genres_median_time

Other text related features which improved by validation score are:

titles length

titles count

cities strings

Features extracted from numeric columns:

dow variable:

Binary features for each day of the week starting from Monday to Sunday.

Watch time spent on each day from Monday to Sunday.

Model Evaluation:

I tried several different models which produced the following scores:

Model

Score

Logistic Regression

0.70

Linear Discriminant Analysis

0.79

K Nearest Neighbors

0.59

CART

0.55

AdaBoost

0.79

Gradient Boosting

0.80

Random Forests

0.69

Extra Trees Regressor

0.70

LightGBM(after hyper parameter tuning)

0.822

Xgboost(after hyper parameter tuning)

0.821

I ended up using LightGBM to generate my predictions.

Hyper parameter tuning:

Used hyperopt to automatically find the right hyper parameters which improved my validation score for LightGBM model