Kaggle competition rewards

Do larger competition rewards increase participation?

Posted on January 14, 2018

Kaggle competition rewards

Do larger competition rewards increase participation?

Posted on January 14, 2018

Kaggle is a popular online platform for data science competitions. Competition hosts can post their data online, and data scientist from around the world compete to build the best algorithm for the host’s needs.

Competition hosts benefit when they receive a large volume of high-quality submissions and walk away from the competition with a world-class model. One way they can foster engagement is by setting a competition reward. Let’s analyze the Meta Kaggle data in Python 3 to see how reward size impacts competition engagement, as proxied by submission volume.

Analysis

First, some setup:

1
2
3
4
5
6
7
8
9
10
11

# importsfromosimportpathimportpandasaspdimportnumpyasnpimportsqlite3# config# make sure sqlite db is saved to the `data` directory first# db can be downloaded from https://www.kaggle.com/kaggle/meta-kaggle/datapath_data='data'con=sqlite3.connect(path.join(path_data,'database.sqlite'))

# create features and clean up datausd_competitions['date_enabled']=pd.to_datetime(usd_competitions.DateEnabled)usd_competitions['deadline']=pd.to_datetime(usd_competitions.Deadline)usd_competitions['competition_year']=usd_competitions.date_enabled.dt.yearusd_competitions['ln_submission_count']=np.log(usd_competitions.submission_count.fillna(1))usd_competitions['duration']=(usd_competitions.deadline-usd_competitions.date_enabled).dt.daysusd_competitions=usd_competitions[usd_competitions.RewardQuantity>1]usd_competitions['ln_reward']=np.log(usd_competitions.RewardQuantity)# exclude competition `flight2-final` because it doesn't sound or look like a real competitionusd_competitions=usd_competitions[usd_competitions.CompetitionName!='flight2-final']print('Cleaned dataset has {:,} records with {:,} columns.'.format(*usd_competitions.shape))

Plotting competition submission count over reward value, we can see a positive relationship between submission count and reward value. This is a fairly intuitive relationship – we’d expect larger rewards to draw in more competitors, with those competitors putting in more effort on average, thereby increasing competition submission volumes.

If we’re not careful, though, we might mistakenly think reward has a much greater impact on submission volumes than is warranted. Let’s investigate further by plotting submission volumes over time along with a few other factors.

We can see a strong positive relationship between submission count and competition deadline here: average submissions per competition has doubled every 1-2 years historically (remember our y-axis is on the log scale). Much of this growth is likely due to the growing popularity of Kaggle’s platform amongst data scientists (i.e., as opposed to growth in submissions per user); word of mouth, UX enhancements from new features such as kernels, and other enhancements have undoubtedly contributed to this growth.

Looking at the color (competition duration) and size (reward) of the plotted bubbles, we can also see some positive relationship with submission count (and with time), though to a significantly smaller extent. Let’s formalize these observations with a simple linear regression.

First, we’ll import the relevant methods from scikit learn and define a simple utility function to fit a model and report on the model score:

Looking at the R-squared (the default metric used in the scikit learn linear regression implementation), we find that we can explain about a quarter of the variation in submission volume through competition reward. Looking at the regression coefficient, we estimate that doubling the reward will increase expected submission volumes by roughly 39%.

However, we know that submission volume is also positively correlated with time. We can expand the regression feature-set to build a more accurate model and better understand the relationship between reward and submission volumes after accounting for other factors:

Once we adjust for deadline, competition duration, and daily submission limits, we see that we can explain a much larger share of the variation in submission volume through our simple model (a little more than 50%). Furthermore, under this new model, we estimate that doubling the reward will increase expected submission volumes by roughly 17% – smaller than before, but definitely not insubstantial. Similarly, we estimate that delaying a proposed competition by a year will increase expected submission volumes by roughly 54%. Other factors, such as competition duration and maximum daily submissions appear to play a negligible role after accounting for time and reward value.

Lessons for hosts

So what’s the lesson for potential competition hosts? Firstly, we found that increasing the competition prize will likely generate additional submissions (roughly 17% more submissions for a doubling of the reward), in turn increasing the likelihood of attaining a better model. We’re tempted to feel sorry for those hosts who hosted a competition in Kaggle’s early years; they did, after all, receive a fraction of the submission volume per dollar of prize money spent. At the same time, had they held off hoping for higher volumes as the platform increased in popularity, they would have foregone years of having a world-class model – a high cost indeed!

Next steps

This analysis constitutes a very simple look at the effect of competition reward on submission volume. There are likely other important factors which I didn’t look at, including:

competition domain: computer vision, traditional ML, etc.

organization type: for-profit, public entity, non-profit

data structure and general ease-of-use (e.g., kernel-only)

Similarly, submission counts are only a proxy for what hosts actually care about, which is probably a combination of several things like final model quality, public awareness and perception, etc. Future work might investigate alternative metrics better reflecting these qualities.

Lastly, the model we used to quantify the effect of various factors on submission volumes could be further validated (e.g., verify OLS regression assumptions hold, k-fold cross performance validation).