Ensemble Methods in R : Practical Guide

This tutorial explains various ensemble methods in R. Ensembling is one of the most popular method to build accurate predictive models.

What is Ensembling?

Ensembling is a procedure in which we build multiple models based on similar or dissimilar techniques and later combine them in order to gain improvement in accuracy. The idea is to make a more robust predictive model which absorbs predictions from different techniques. In layman terms, it is considering opinion from all relevant people and later applying voting system or giving equal or higher weightage to some people.

Ensemble Methods

There are various methods to ensemble models. Some of the popular methods are as follows -

Simple Average

Weighted Average

Majority Voting

Weighted Voting

Ensemble Stacking

Boosting

Bagging

Average and Voting

The above first four ensemble methods fall under broader 'Average and Voting' method. In all these methods, we mainly perform the following tasks -

1. Build multiple models using same or different algorithms on training data. We can either use same training data with different algorithms or we can use different splits of the same training data and same algorithm.

2. Make predictions on test dataset using multiple techniques and save them.

3. At the last step, we make final prediction which is based on either voting or averaging. This step is explained in detail below.

1. Simple Average

In this method, we take simple average (or mean) of the predicted probability on test data in case of classification model. If it is a regression model, calculate by taking mean of predicted values.

Ensemble : Simple Average

2. Weighted Average

Unlike 'Simple Average' method, we do not assign equal weights. Instead we apply different weights to each of the algorithms. One of the way to calculate weights is to build logistic regression. See the steps below -

Step I : Multiple different algorithms are trained on training data. For example, Boosting Trees and Single Decision Tree were trained on a data set. These are the two classifiers.

In this step, we are calculating the overall importance of a variable. It is important to take absolute value of coefficients before calculating linear weights.Step VII : Predict on test data using trained models.

W1 : Weight of First Algorithm, W2 : Weight of Second Algorithm, P1 : Predicted Probability of First Algorithm, P2 : Predicted Probability of Second Algorithm

3. Majority Voting

Every model returns predicted probability on test data and the final prediction is the one that receives majority of the votes. If none of the predictions get more than half of the votes, we may say that the ensemble method could not make a stable prediction for this observation or case.

Ensemble : Majority Voting

4. Weighted Voting

In this case we give higher weightage to the votes of one or more models. To find which models to assign higher weightage can be calculated using the logic we used for weighted average method.

5. Ensemble Stacking (aka Blending)

Stacking is an ensemble method where the models are combined using another data mining technique. Follow the steps below -

Train multiple algorithms on training data. These models are known as bottom layer models

Perform k-fold cross-validation using training data on each of these algorithms and save cross-validated predicted probabilities from each of these algorithms

Train logistic regression or any machine learning algorithm on the cross- validated predicted probabilities in step 2 as independent variables and original target variable as dependent variable. In this case, trained model is a top layer model

Make prediction from multiple trained models on test data

Predict using the top layer model with the predictions of bottom layer models that has been made for testing data

6. Bagging

It is also called Bootstrap Aggregating. In this algorithm, it creates multiple models using the same algorithm but with random sub-samples of the dataset which are drawn from the original dataset randomly with random with replacement sampling technique (i..e. bootstrapping). This sampling method simply means some observations appear more than once while sampling. For example, Random Forest is a bagging algorithm.7. Boosting

Boosting refers to boosting performance of weak models (decision tree). It involves the first algorithm is trained on the entire training data and the subsequent algorithms are built by fitting the residuals of the first algorithm, thus giving higher weight to those observations that were poorly predicted by the previous model. Adaboost, Gradient Boosting and Extreme Gradient Boosting are examples of this ensemble technique.

Ensembling : Weighted Average using Logistic Regression in R

In the code below, we are combining various models such as random forest, extremely randomized trees, Gradient Boosting Model, Support Vector Machine and Rotation Forest. Later we are applying linear weights which are calculated from logistic regression model.

We can use neural network to find optimal weights for stacking. It would calculate weights from input nodes to the output node. To accomplish this task, we can limit the number of hidden nodes to 1. It automatically adjusts the total sum of weights as 1. We can implement the same with R via deepnet package. See the code below for reference.

In R, there is a package called caretEnsemble which makes ensemble stacking easy and automated. This package is an extension of most popular data science package caret. In the program below, we perform ensemble stacking manually (without use of caretEnsemble package).

#Predicting the out of fold prediction probabilities for training data
#In this case, level2 is event
#rowindex : row numbers of the data used in k-fold
#Sorting by rowindex
training$OOF_dt<-dt$pred$level2[order(dt$pred$rowIndex)]
training$OOF_logit<-logit$pred$level2[order(logit$pred$rowIndex)]
training$OOF_knn<-knn$pred$level2[order(knn$pred$rowIndex)]

We can also use logistic Regression for stacking. It uses simple linear classifier as compared to GBM. The sophistical models such as GBM are much more susceptible to overfitting while stacking.

We should use Trees instead of Logistic Regression for an ensemble when we have :

Lots of data

Lots of models with similar accuracy scores

Your models are uncorrelated (Accuracy/ROC in various samples of cross-validation). In case of regression, check correlation of residuals from multiple algorithms

Popularity of Ensemble Learning - Stacking

The use of ensemble learning is very common in data science competitions such as Kaggle. Most of kagglers already know about this algorithm. They generally use it to improve their score. Also if you look at the solution of Kaggle competition winners, you would find the ensemble stacking being the top algorithm to combine multiple models. Ensemble stacking does not only improve accuracy of the model but also increase robustness of the model.

Endnotes

In the past, I have used this algorithm several times in real world data science projects. It helped to improve accuracy by 10 to 20%. But we need to be very cautious about it. Sometimes it leads to overfitting so we need to make sure we cross validated the result before implementing it in production. Please share your experience about it in the comment box below.

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 7 years of experience in data science and predictive modeling. During his tenure, he has worked with global clients in various domains like banking, Telecom, HR and Health Insurance.

While I love having friends who agree, I only learn from those who don't.

Hi. First, I like to thank you for the great and easy to follow walkthrough. I have been able adapt this to several other algorithms and it works fine with expected results. However, I have challenges understanding what exactly this training$OOF_dt and this testing$dt are.

My understanding is that, one is just a sorted form of the other and I know they form new columns in the training and testing sets.The confusing part is that you referred to "training$OOF_dt" as another prediction.