Machine Learning: Model Selection & Cross Validation

In this post will go over why cross-validation is important understand how it works and see how it can be applied in many different ways.

Suppose we need to build a machine learning system for the following problem. Given a photograph we would like to predict is a person or it's a bomb. Clearly this is an important problem as well as a public safety issue. As machine learning scientists we represent the input as an X and the output as a Y. Y can be one for a person or 1 for a bomb. In order to build our system we need to collect data from the real world to learn from.

Our dataset consist of many pairs of photographs and labels. Either people or bombs. Once we've collected our data we will train our system and then put it to test in the real world protecting our nation's shopping malls, schools and airports. Let’s look at the training process in more detail. as it turns out we know many ways training machine learning systems each with different parameters and settings for example we can learn one nearest neighbor system (1-NN), 3-NN, 5-NN, a kernel regression system with a sigma of one, a kernel regression system with a sigma of two , Naive Bayes, support vector machines and many others.

The problem of choosing which method to use from pool of possible methods is known as model selection. We want to choose the model that will work the best at test time in the real world. When all we have is a fixed dataset. One way to choose is to train each method on our data and then test on the same data that we have. This is a terrible idea, you can't give your students the answers key before giving them an exam.

Instead we will do the following we will split our data into sections each section is called a fold. In this example we have four folds.

Next we'll iterate over the folds as follow. First iteration we train on fold 1, 2, 3 and then we test our method unfold one. In this case the algorithm has never seen fold one before just like we will test our bomb detector in the real world. We measure the error of our method on this fold we then swap places with fold 1 and 2. Now we train on fold 1, 3, 4 and we test with fold 2. We could repeat this process for each fold with holding that fold from training and then computing error on that fold and test. Some folds are easier to learn than others.

Finally we combine the four errors into a single average. This average is known as the cross-validation error for any single method, cross-validation error is an estimated of how the method would perform if the data we collected is an accurate representation of the real world.

We repeat the cross-validation procedure for each method we might select during training then we can select the model with minimum cross-validation error.

In this case 5 nearest neighbors is our best guess for which model will be the best bomb detector in the world. Now that we have chosen our model, we can evaluate it on a real world. But what sort of performance do we expect?!. Do we expect exactly the same performance as a cross-validation estimate maybe our estimate was optimistic or maybe was too conservative? in fact the 17 percent error we found during the model selection process is almost certainly optimistic this is because model selection has biased our estimate tester. We chose the best cross-validation error out of many possibilities.

So even if we had a pool at 1 million random classifiers we would still expect at least one of them have low cross validation error duo purely the random chance. So we need to take another look at our data. We will still use cross-validation but this time you apply cross-validation twice.

First we separate our data into two parts the first part will be used for model selection. The second will be used for testing to represent the unseen world. The important point is that the world data is never touched by our model selection procedure to perform on the selection we divide the data into folds just like before.

In this case we have six folds we then perform cross validation for each of our methods to determine an error rate. In this case three nearest neighbors is a method with lowest cross-validation now we can evaluate the result of the model selection on our held out test data. This time we use all folds of the training data during training.

Now what does this final number estimate? In fact it is the estimate of our entire learning process. We took data we train multiple methods and then we selected the best according to cross validation. Finally we tested on held out data not seen by the algorithm. In other words we achieved an estimate at how our entire learning procedure which includes model selection as part of training will perform on unseen data. Again if the world happens to be well represented by our dataset, this time our estimate of 16 percent is most likely conservative since we're only using a portion of the data that we have in order to train a model.

In conclusion what to be learned today, Cross-validation is a simple and useful method of model selection more importantly cross-validation is also necessary to obtain an estimate the error of our model selection method.