How to ensure model obsolescence (part 1)

Fitting your test set and other terrible practices in ML

Posted on July 23, 2018

Throughout my tenure in the field of data science and machine learning, I’ve had the privilege to interview a lot of candidates. Some interviews have gone splendidly; others have gone horrendously. And, to the credit of those who have bombed, sometimes you’re just nervous and trip up, I get it. It happens to all of us. But other times—more often than I’d care to admit—I have been witness some truly bad practices that outright scare me.

For my part, I’ve participated in my own fair share of really bad interviews from the candidate side, and I always appreciate when I’m treated gently. That said, I want to learn where I’ve gone wrong, and so I always try to make interviews—whether I’m the participant or the proctor—a learning experience. As data scientists, we should strive to learn from those who have been around longer than us, and teach those who haven’t been in the field as long.

Therefore, in this series of posts, I’m going to talk about the most common mistakes I witness data science candidates making.

Fitting your test set

If you’ve taken any number of ML classes, you know this is a no-no. And yet, it’s such a common mistake that I have to wonder where the confusion lies.

Perhaps the blame lies in the classroom, as many academics don’t have to worry about models drifting over time, as those in the industry do, or maybe it’s just a concept that doesn’t stick until you’ve been bitten by the fallout. Nevertheless, here’s a common pattern I see in interview code (applied to everyone’s favorite dataset, Iris):

1
2
3
4
5
6
7
8
9
10
11
12
13

fromsklearn.preprocessingimportStandardScalerfromsklearn.model_selectionimporttrain_test_splitfromsklearn.datasetsimportload_iris# load and split the dataX,y=load_iris(return_X_y=True)X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)# scale the dataX_train_scaled=StandardScaler().fit_transform(X_train)X_test_scaled=StandardScaler().fit_transform(X_test)

This is a huge no-no! Notice the very last line, where we have fit our test set in order to transform it. If it’s not quite landing as to why this should be avoided like the plague, let’s consider the following scenario…

Example: covariate shift

Imagine you’ve fit a model to estimate whether a mechanical part is going to give out (an actual problem I worked on several years ago). Let’s say you have the following setup:

importnumpyasnpfromsklearn.model_selectionimporttrain_test_splitfromsklearn.preprocessingimportStandardScalerfromsklearn.linear_modelimportLogisticRegressionfromsklearn.metricsimportaccuracy_score# initialize a random state, create several "normal" featuresrs=np.random.RandomState(42)X=rs.normal(25.,7.,(1000,3))# if the last feature is >= 35, the part WILL FAILy=X[:,-1]>=35.X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=rs,test_size=0.25,stratify=y)# transform your data, and fit/predict with a logistic regression# ScaleX_train_scaled=StandardScaler().fit_transform(X_train)X_test_scaled=StandardScaler().fit_transform(X_test)# Fit, predictclf=LogisticRegression().fit(X_train_scaled,y_train)preds=clf.predict(X_test_scaled)print("Test accuracy: %.3f"%accuracy_score(y_test,preds))

On the surface, this doesn’t look too bad. In fact, if you run this exact snippet of code, you’ll get 98.8% accuracy on your test set, and that didn’t even require any significant pre-processing or hyper-parameter tuning! However, there’s a very dangerous, hidden assumption in this code. It assumes that your distributions will never change.

What are you doing? You’re fitting your test set with a scaler. That means when new data passes into your system, you’ll have to do the same. Consider the following excerpt, where you have new data flowing into your model in your production environment:

1
2
3
4
5
6
7
8

# here's the new data coming in. let's pretend the last feature# has drifted upward for some reasonX_new_data=np.random.RandomState(100).normal(25.,7.,(500,3))X_new_data[:,-1]*=5.# and now let's score the modelX_new_data_scaled=StandardScaler().fit_transform(X_new_data)preds=clf.predict(X_new_data_scaled)

What do you expect will happen? Well, we know from our previous analysis (and even our problem setup) that when the last feature is >= 35, we should be predicting True. And yet, if you examine your predictions here, you’ll see that we only predicted True 7% of the time, even though nearly every value was greater than 35.

Why is this?

Well, our model was fit on the Z-scores of the training data (that is, the centered and scaled features). We can visualize those like so:

1
2

importseabornassnssns.pairplot(pd.DataFrame(X_train_scaled))

The problem is, our scaled test set resembles the same, since it was scaled with respect to itself rather than the training set:

1

sns.pairplot(pd.DataFrame(X_new_data_scaled))

If we look at the distributional differences in the offending features between our training set and new data, we should immediately have recognized this problem, but with the current setup, we’ll never diagnose it properly in time; the mechanical parts will likely fail without us having issued a warning:

Preventing this problem

Scikit-learn makes this very simple with the Pipeline object. Here’s how we can rework our model to catch the covariate shift properly:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

fromsklearn.pipelineimportPipeline# stuff it all into a pipepipe=Pipeline([('std',StandardScaler()),('clf',LogisticRegression())]).fit(X_train,y_train)# apply the pipe directly to your test datatest_preds=pipe.predict(X_test)new_preds=pipe.predict(X_new_data)# we should have roughly the same test accuracy:print("Test accuracy: %.5f"%accuracy_score(y_test,test_preds))# and we should have predicted True for more# than 7% of the "new" dataprint("{0}% predicted True".format(new_preds.sum()/new_preds.shape[0]*100))

By using a pipeline, we can ensure that our model is only ever fit on our training data, and can be used later to transform our test data. If you run the snippet above, we should get roughly the same test accuracy (99.2%), and you’ll see that we predict True for >99% of the “new” data—way up from the previous 7%! You may just be able to warn someone about the impending mechanical failure in time!

In summary

Hopefully it should be clear now (if it wasn’t before) why you should never fit anything on your test set. You can avoid this headache by using the Pipeline from scikit-learn. The full code for this example is located in the code folder.