Author: Tatyana Kudasova, ODS Slack @kudasova

Tutorial

Nested cross-validation

Often we want to tune the parameters of a model. That is, we want to find the value of a parameter that minimizes our loss function. The best way to do this, as we already know, is cross-validation.

However, as Cawley and Talbot pointed out in their 2010 paper, since we used the test set to both select the values of the parameter and evaluate the model, we risk optimistically biasing our model evaluations. For this reason, if a test set is used to select model parameters, then we need a different test set to get an unbiased evaluation of that selected model. Mainly, we can think of model selection as another training procedure, and hence, we would need a decently-sized, independent test set that we have not seen before to get an unbiased estimate of the models’ performance. Often, this is not affordable. A good way to overcome this problem is to use nested cross-validation.

The nested cross-validation has an inner cross-validation nested in an outer cross-validation. First, an inner cross-validation is used to tune the parameters and select the best model. Second, an outer cross-validation is used to evaluate the model selected by the inner cross-validation.

Imagine that we have N models and we want to use L-fold inner cross-validation to tune hyperparameters and K-fold outer cross validation to evaluate the models. Then the algorithm is as follows:

Divide the dataset into K cross-validation folds at random.

For each fold k=1,2,…,K: (outer loop for evaluation of the model with selected hyperparameter)

2.4.1 Let val be fold l
2.4.2 Let train be all the data except those in test or val
2.4.3 Train each of N models with each hyperparameter on train, and evaluate it on val. Keep track of the performance metrics

2.5. For each hyperparameter setting, calculate the average metrics score over the L folds, and choose the best hyperparameter setting.
2.6. Train each of N models with the best hyperparameter on trainval. Evaluate its performance on test and save the score for fold k

For each of N models calculate the mean score over all K folds, and report as the generalization error.

In the picture above and the code below we chose L = 2 and K = 5, but you can choose different numbers.

The data for this tutorial is breast cancer data with 30 features and a binary target variable.

In [2]:

# Load the datadataset=datasets.load_breast_cancer()# Create X from the featuresX=dataset.data# Create y from the targety=dataset.target

In [3]:

# Making train set for Nested CV and test set for final model evaluationX_train,X_test,y_train,y_test=train_test_split(X,y,train_size=0.8,test_size=0.2,random_state=1,stratify=y)# Initializing Classifiersclf1=LogisticRegression(solver='liblinear',random_state=1)clf2=KNeighborsClassifier()clf3=DecisionTreeClassifier(random_state=1)clf4=SVC(kernel='rbf',random_state=1)# Building the pipelinespipe1=Pipeline([('std',StandardScaler()),('clf1',clf1)])pipe2=Pipeline([('std',StandardScaler()),('clf2',clf2)])pipe4=Pipeline([('std',StandardScaler()),('clf4',clf4)])# Setting up the parameter gridsparam_grid1=[{'clf1__penalty':['l1','l2'],'clf1__C':np.power(10.,np.arange(-4,4))}]param_grid2=[{'clf2__n_neighbors':list(range(1,10)),'clf2__p':[1,2]}]param_grid3=[{'max_depth':list(range(1,10))+[None],'criterion':['gini','entropy']}]param_grid4=[{'clf4__C':np.power(10.,np.arange(-4,4)),'clf4__gamma':np.power(10.,np.arange(-5,0))}]# Setting up multiple GridSearchCV objects as inner CV, 1 for each algorithmgridcvs={}inner_cv=StratifiedKFold(n_splits=2,shuffle=True,random_state=1)forpgrid,est,nameinzip((param_grid1,param_grid2,param_grid3,param_grid4),(pipe1,pipe2,clf3,pipe4),('Logit','KNN','DTree','SVM')):gcv=GridSearchCV(estimator=est,param_grid=pgrid,scoring='accuracy',n_jobs=1,cv=inner_cv,verbose=0,refit=True)gridcvs[name]=gcv