When I search for the answer online, it seems like there is a disagreement on what cross-validation is. Some say k-fold cross validation is used to get an estimate of how well a model will perform before building the model on all of the data. See below:

Some say k-fold cross validation is used on the training set of data to obtain an optimal model before applying the model to the testing set. If this way of k-fold cross validation is correct, how is the optimal instance of the model picked? k-fold cross validation will create k instances of the model, so what would be the way to pick one to use on the testing set to get an accuracy measurement.

$\begingroup$Why do you think there is a disagreement here?$\endgroup$
– The LaconicApr 16 at 0:03

$\begingroup$One is splitting the dataset into a training and testing set first before k-fold cross validation on the training set for tuning of parameters. The other is doing k-fold cross validation on the entire dataset to get an estimate of performance before building the model using the entire dataset.$\endgroup$
– Dexter LuuApr 16 at 3:21

1 Answer
1

I think the accepted answer in your link provides highly valuable insights. I'd just like to point out the two uses of CV, which you seem to think as a disagreement in the community (despite that it's not):

Some say k-fold cross validation is used to get an estimate of how
well a model will perform before building the model on all of the data.

This is just model checking, and is applied when you've decided on your model, and want to have a statistically better estimate of what you'd achieve with this model, compared to using only one test set.

Some say k-fold cross validation is used on the training set of data to obtain an optimal model before applying the model to the testing set.

This is model selection. You can choose the best model or hyper-parameter set via this procedure. But, note that, when you use your data for this purpose, i.e. exploited it to see if which configurations best fit your data, and later if you use the above procedure to check the test performance of your model, the performance will be optimistic since you've tuned your model(s) with the data you aim to test on.