If you want to learn what is K-fold cross-validation and how is it done in R,then please follow along.Open your RStudio and have fun!!

Course for Beginners:

What is Cross-validation

A model is usually given a known data set(training data set) on which training is done and unknown dataset(testing data set) against which the model is tested. Crossvalidation is a technique which gives the insight on how the model will generalize to an unknown or unseen dataset(test data set) by reducing the problems like overfitting.

Types of Validation Methods

Holdout method : The data set is partitioned into two parts, one is called the training set and other is the testing set.Then the model predicts the target variable for the testing data.

K-fold cross validation : The data set is divided into k subsets. Each time, one of the k subsets is used as the test set and the other k-1 subsets alltogether forms our training set. Then the average error across all k trials is computed.That means in K-fold cross-validation model is fitted K times and also tested K-times against the left-out subset of data.

Leave-one-out cross validation :It’s a K-fold cross validation where K is equal to the number of data points in the set(i.e number of rows).That implies the model will be fitted N number of times where N is equal to number of rows.So if the number of rows is very large then this method will run many times and so it is very computationally expensive.

Summary of all validation Techniques:

1. In holdout method: We test the model only one time and that’s also against one same subset of whole data set.Ofcourse you can choose subset according to your choice but its best to choose randomly.

2. K-fold crossvalidation:In this model runs K times . a.If K=1 then that is same as holdout method. b.If K=N(number of rows in data) then that is same as Leave-one-out crossvalidaton.

3. Best Value of K: Choosing the best number of folds depends on data size,keeping in mind about computational expenses,etc.

4. Lower K would result in :

a.computationally cheaper,
b.less error due to variance
c.more error due to bias(model mismatch).

Higher K would result in :

a.more expensive
b.more error due to variance
c.lower error due to bias(model mismatch).

How to reduce Variance without increasing bias?

Repeat the cross-validation with the same K but different random folds and then averaging the results but cons is that this is even more expensive.

How to do Cross-validation in R

Now let’s have a look on how to do crossvalidation in R using the package caret.

Importing Libraries

Importing the library MASS for iris dataset and library caret for crossvalidation

Cross-validation ( model Parameters )

Let’s choose the model parameters.Here we are choosing mtry of Random forest and taking three values.You can choose other model also and its parameters in the function expand.grid which will create a grid of all combinations of parameters.

parameterGrid <- expand.grid(mtry=c(2,3,4))

Model fitting in Cross-validation

We will put the above paramter in the model below in trControl argument Let’s now fit the model using train function To know more about the train function type and run ?train in the console

analyticsdataexploration.com is your Data Analytics,Machine Learning and Artificial intelligence website. We provide you with the information and tutorials on latest Machine Learning technologies and videos straight from the data analytics industry.