​Decision Trees

Just as the name suggests, the random forest is made up of decision trees. Therefore, to understand how the random forest algorithm works, you first need to know how the decision tree algorithm works.​ See slide 1

So, what we do here is, we put our training data (which in this case has a binary label) into the decision tree algorithm to create a decision tree. And that tree we then use to classify a new, unknown example from the test data set. And in this case, the tree predicts the label of that example to be a 1.

​Random Forest

And now, for the random forest algorithm, the idea is simply to use many different decision trees to classify a new, unknown example (instead of just having one tree).​ See slide 2

So here, decision tree 1 predicts the example to be a 0, decision tree 2 predicts the example to be a 1 and decision tree 3 predicts the example to be a 0. And then, we simply take the majority vote of all the trees as our final prediction for the random forest (when doing a regression task, you take the mean or median of the predictions of the trees).

See slide 3

In this case then, the prediction of our random forest model would be that the example is a 0.

So, that’s how random forests generally work. But the question now obviously is: How can we create different trees from the same training data set?

And the answer to that question (also as the name suggests) is to introduce randomness into our training data. And there are 2 ways in which we can introduce randomness, namely either via the examples (i.e. rows) in the data set or via the features (i.e. columns) in the data set.

See slide 4

​Bootstrapping

And the approach with which we introduce randomness via the examples is called “bootstrapping”.​ See slide 5

Here, we randomly sample with replacement from our training data set to create a bootstrapped data set of the same size (i.e. same number of examples). And because we are sampling with replacement, there are most likely duplicates in the bootstrapped data set and some examples from the training data set will be missing. So, the bootstrapped data sets will be different from each other which should lead to the creation of different decision trees.

Side note: If the training data set is big (i.e. there are many examples), then the bootstrapped data set can also have less examples than the training data set. And if the training data set is really big, then the sampling can even be done without replacement. The critical point is just that the bootstrapped data sets should be sufficiently different from each other so that they result in different-looking trees.

​Random Subspace Method

So, that’s the first approach with which we can introduce randomness into our training data.​ See slide 6

The other approach, which introduces randomness via the features, is called “random subspace method”. And here the idea simply is, instead of using all the features in the data set to build the tree, we only use a random subset of the features.

And one way in which we could implement this, is at the level of our bootstrapped data sets.

See slide 7

So, for each bootstrapped data set, we only consider a random subset of the features. And then, we put that data set into the decision tree algorithm. And this will give us then different-looking trees.But we can even increase the effectiveness of the random subspace method. Namely, instead of applying it at the level of the bootstrapped data sets, we can apply at the level of the decision tree algorithm itself.

See slide 8

So, we put the respective bootstrapped data set with all the features into the algorithm. And then, at the stage where we determine all the potential splits, we only consider a random subset of the features. And this means that we will consider different features in each iteration (i.e. recursive call) of the decision tree algorithm.

So, for example, when we create the first fork in a tree, the algorithm might only consider potential splits based on “feature 1” and “feature 2. And then, in the next iteration, it might only consider potential splits based on “feature 1” and “feature 3”. This way, we can increase the degree of randomness in the algorithm. And therefore, the resulting trees are even more likely to be different.

​Conclusion

So, that’s how the random forest algorithm works.​ See slide 9

And if you want to see how to implement that from scratch, you can check out this series of posts.