Real-time Leaderboards

Automated Submission Scoring

Kernels

Forums

Preloaded Evaluation Metrics

Résumé Capture

Portfolio Profiles

Candidate Score Report

IP License Options

Built-in Benchmarks

Product Tutorial Integrations

What is supervised machine learning?

Kaggle competitions are a very powerful tool in a specific area of data science: Supervised Learning. This is a type of machine learning in which algorithms are trained to infer the relationship between input variables and the actual outcome value (or the ‘ground truth’). With sufficient supervisory examples, the algorithm develops an inferred function that can be applied to new examples in order to generate predicted outcomes.

On a practical level, it means that to run a Kaggle competition a host must provide a dataset for which the target predicted value already has a ground truth outcome available. Kaggle will use this ground truth as an answer key to score the accuracy of participants’ submissions in real time over the course of the competition.

For example, let’s say a company that manages a chain of hundreds of restaurants is interested in predicting the revenue of each location on a daily basis. They believe revenue is impacted by factors like location, day of the week, holidays, sporting events, and clientele, but they don’t know what impact, if any, these specific factors have on each location’s revenue. In order to develop a more accurate forecasting tool, they would provide a dataset in which those factors (like location, etc.) have been transformed into ‘features’ for each of their restaurants, and they’d include the actual daily revenue for each location over a several-year time-period. Kaggle would take all of this data and split it into two groups:

Training - the input and outcome data (features + actual revenue) that data scientists will use to train algorithms to infer relationships
Test - the input data only for the remainder of the dataset, which data scientists will run their models against to predict daily revenue, based on those factors, for the given time period.

Competition participants submit the predictions for their test datasets to Kaggle for scoring. While the competition is live, Kaggle gives participants a score on the accuracy of their predictions that’s calculated on half of the data in the test set. The other half of the data generates a score that’s used to determine the final standing of the competition. Kaggle uses these two scores (a public and private score) to protect hosts from receiving over fit models from winners.

Use cases for supervised learning exist in every industry, and Kaggle has direct experience running competitions in many of them. Some common applications of supervised learning include:

Insurance: Actuaries are considered some of the earliest adopters of machine learning in their work assessing risk. Liberty Mutual asked Kagglers to develop a model that predicts the risk of insuring properties based on a dataset of pre-existing hazards.

Health Care: The advent of deep learning opened the door for significant innovation in the way doctors use medical imaging and algorithms to diagnose disease. For example, in the Diabetic Retinopathy challenge Kagglers developed algorithms to detect the disease in images of the eye. The winning algorithm performs with the same level of accuracy as doctors.

Ecommerce: With the plethora of data generated by users browsing, buying, and sharing content online, there is no limit to what we can learn, and then predict, about consumers’ behavior. Expedia wanted to know what types of hotels travelers would book next based on their browsing habits, and The Home Depot wanted to make browsing easier for customers by improving their site search engine.

Manufacturing: Given the many nuances and complexities to most manufacturing processes, there’s ample opportunity for algorithms to anticipate anything from materials demand to production errors. Caterpiller manages an extensive supply chain for the manufacturing of their machines, so in their competition they asked Kagglers to predict the input cost of one core component.

In order to host a supervised learning competition, you must provide:

A dataset of features and variables

A target variable (the variable the algorithms should predict)

A ground truth label for that target variable

For all of the above, a dataset of sufficient volume and depth to support machine learning. The required volume will vary based on the nature of the competition and dataset, but as a general rule of thumb it means tens to hundreds of thousands of rows of data, and tens to hundreds of features for each row.

Pricing

Here’s a basic look at how Kaggle competitions are priced. The cost of your competition will vary based on a number of factors. We’ll only be able to share a detailed price proposal with you after learning more about your dataset and business problem.