Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

4.
Structure of a Data Science Competition
Build model using Training Data to
predict outcomes on Private LB Data
Training Public LB
(validation)
Private LB
(holdout)
Quick but often misleading feedback
Data Science Competitions remind us that the purpose of a
predictive model is to predict on data that we have NOT seen.

5.
A little “philosophy”
● There are many ways to overfit
● Beware of “multiple comparison fallacy”
○ There is a cost in “peaking at the answer”,
○ Usually the first idea (if it works) is the best
“Think” more, “try” less

8.
Technical Tricks -- GBM needs TLC too
● Tuning parameters
○ Learning rate + number of trees
■ Usually small learning rate + many trees work
well. I target 1000 trees and tune learning rate
○ Number of obs in leaf
■ How many obs you need to get a good mean
estimate?
○ Interaction depth
■ Don’t be afraid to use 10+, this is (roughly) the
number of leaf nodes

9.
Technical Tricks -- when GBM needs help
● High cardinality features
○ Convert into numerical with preprocessing -- out-of-
fold average, counts, etc.
○ Use Ridge regression (or similar) and
■ use out-of-fold prediction as input to GBM
■ or blend
○ Be brave, use N-way interactions
■ I used 7-way interaction in the Amazon
competition.
● GBM with out-of-fold treatment of high-cardinality
feature performs very well

10.
Technical Tricks -- feature engineering in GBM
● GBM only APPROXIMATE interactions and non-linear
transformations
● Strong interactions benefit from being explicitly
defined
○ Especially ratios/sums/differences among
features
● GBM cannot capture complex features such as
“average sales in the previous period for this type of
product”

11.
Technical Tricks -- Glmnet
● From a methodology perspective, the opposite of
GBM
● Captures (log/logistic) linear relationship
● Work with very small # of rows (a few hundred or
even less)
● Complements GBM very well in a blend
● Need a lot of more work
○ missing values, outliers, transformations (log?),
interactions
● The sparsity assumption -- L1 vs L2

13.
Technical Tricks -- Blending
● All models are wrong, but some are useful (George
Box)
○ The hope is that they are wrong in different ways
● When in doubt, use average blender
● Beware of temptation to overfit public leaderboard
○ Use public LB + training CV
● The strongest individual model does not necessarily
make the best blend
○ Sometimes intentionally built weak models are good blending
candidates -- Liberty Mutual Competition

15.
Apply what we learn outside of competitions
● Competitions give us really good models, but we also need to
○ Select the right problem and structure it correctly
○ Find good (at least useful) data
○ Make sure models are used the right way
Competitions help us
● Understand how much “signal” exists in the data
● Identify flaws in data or data creation process
● Build generalizable models
● Broaden our technical horizon
● …

16.
Acknowledgement
● My fellow competitors and data scientists at large
○ You have taught me (almost) everything
● Xavier Conort -- my colleague at DataRobot
○ Thanks for collaboration and inspiration for some
material
● Kaggle
○ Thanks for the all the fun we have competing!