Machine Learning Basics – Random Forest

A few colleagues of mine and I from codecentric.ai are currently working on developing a free online course about machine learning and deep learning. As part of this course, I am developing a series of videos about machine learning basics – the first video in this series was about Random Forests.

You can find the video on YouTube but as of now, it is only available in German. Same goes for the slides, which are also currently German only.

I did however translate my script:

Random Forest (RF) is one of the many machine learning algorithms used for supervised learning, this means for learning from labelled data and making predictions based on the learned patterns. RF can be used for both classification and regression tasks.

Decision trees

RF is based on decision trees. In machine learning decision trees are a technique for creating predictive models. They are called decision trees because the prediction follows several branches of “if… then…” decision splits – similar to the branches of a tree. If we imagine that we start with a sample, which we want to predict a class for, we would start at the bottom of a tree and travel up the trunk until we come to the first split-off branch. This split can be thought of as a feature in machine learning, let’s say it would be “age”; we would now make a decision about which branch to follow: “if our sample has an age bigger than 30, continue along the left branch, else continue along the right branch”. This we would do until we come to the next branch and repeat the same decision process until there are no more branches before us. This endpoint is called a leaf and in decision trees would represent the final result: a predicted class or value.

At each branch, the feature thresholds that best split the (remaining) samples locally is found. The most common metrics for defining the “best split” are gini impurity and information gain for classification tasks and variance reduction for regression.

Single decision trees are very easy to visualize and understand because they follow a method of decision-making that is very similar to how we humans make decisions: with a chain of simple rules. However, they are not very robust, i.e. they don’t generalize well to unseen samples. Here is where Random Forests come into play.

Ensemble learning

RF makes predictions by combining the results from many individual decision trees – so we cal them a forest of decision trees. Because RF combines multiple models, it falls under the category of ensemble learning. Other ensemble learning methods are gradient boosting and stacked ensembles.

Combining decision trees

There are two main ways for combining the outputs of multiple decision trees into a random forest:

Bagging, which is also called Bootstrap aggregation (used in Random Forests)

Boosting (used in Gradient Boosting Machines)

Bagging works the following way: decision trees are trained on randomly sampled subsets of the data, while sampling is being done with replacement. Bagging is the default method used with Random Forests. A big advantage of bagging over individual trees is that it decrease the variance of the model. Individual trees are very prone to overfitting and are very sensitive to noise in the data. As long as our individual trees are not correlated, combining them with bagging will make them more robust without increasing the bias. The part about correlation is important, though! We remove (most of) the correlation by randomly sampling subsets of data and training the different decision trees on this subsets instead of on the entire dataset. In addition to randomly sampling instances from our data, RF also uses feature bagging. With feature bagging, at each split in the decision tree only a random subset of features is considered. This technique reduces correlation even more because it helps reduce the impact of very strong predictor variables (i.e. features that have a very strong influence on predicting the target or response variable).

Boosting works similarly but with one major difference: the samples are weighted for sampling so that samples, which were predicted incorrectly get a higher weight and are therefore sampled more often. The idea behind this is that difficult cases should be emphasized during learning compared to easy cases. Because of this difference bagging can be easily paralleled, while boosting is performed sequentially.

The final result of our model is calculated by averaging over all predictions from these sampled trees or by majority vote.

Hyperparameters to be tuned

Hyperparameters are the arguments that can be set before training and which define how the training is done. The main hyperparameters in Random Forests are

The number of decision trees to be combined

The maximum depth of the trees

The maximum number of features considered at each split

Whether bagging/bootstrapping is performed with or without replacement

Training Random Forest models

Random Forest implementations are available in many machine learning libraries for R and Python, like caret (R, imports the randomForest and other RF packages), Scikit-learn (Python) and H2O (R and Python).

Other tree-based machine learning algorithms

The pros of Random Forests are that they are a relatively fast and powerful algorithm for classification and regression learning. Calculations can be parallelized and perform well on many problems, even with small datasets and the output returns prediction probabilities.

Downsides of Random Forests are that they are black-boxes, meaning that we can’t interpret the decisions made by the model because they are too complex. RF are also somewhat prone to overfitting and they tend to be bad at predicting underrepresented classes in unbalanced datasets.

Other tree-based algorithms are (Extreme) Gradient Boosting and Rotation Forests.