Random Forest is a bunch of Decision Trees, each with a random subset of the data (the name gives it away). A Decision Tree first compares the features and calculates which one has the most information gain (most reduction in entropy or most homogenous branches). This determines the root node, or the first node in the tree where the data is split based on an internal threshold (eg. <=).

This is applied recursively, continually splitting data on branch nodes that integrate other features until it reaches a leaf node or a decision, where entropy equals to 0. If the entropy of a node is more than 0, it needs to be split further.

Training a model several times (which can be done in parallel) on random samples of the dataset is what gives a Random Forest good prediction performance. The final prediction is taken from voting on the results of the individual trees or going with decisions that appear most times in the trees. For example, if only 10 trees decide a person belongs to the ‘good credit loan applicant’ class label, but 30 decide the person is a ‘high risk’, then the final prediction would be the latter as it’s majority rules.

Randomness makes machine learning models more robust to noise, which is another reason why Random Forest is effective. It goes a step further then the simple bagging method, which is random sampling with replacement or having a value appear in more than one sample. It randomly selects a subset of features to choose from each time a node splits, decorrelating the trees and ensuring it’s random.

Random Forest is not only used for classification tasks such as determining if a new credit loan applicant is likely of high risk, a patient is likely to develop a chronic disease, a potential buyer is likely to appeal to a premium product, or a mechanical part is likely to fail or break, and so on. It is also used for regression tasks such as predicting the average numeric values of temperature, social media shares, performance scores, and so on. In regression, the results of the trees are averaged to make the final prediction. The algorithm has also made its way into text and image classification tasks, as well as predicting patterns in speech.

A Random Forest can be mathematically written as { h (x, Θ k ), k = 1, . . . } The algorithm can be implemented in R using the ‘randomForest’ package and in Python using ‘RandomForestClassifier’ from sklearn.ensemble.

However, a known drawback of Random Forest is it can be a black box approach to machine learning, where it’s difficult to get insights into each feature’s importance, and to go through each tree to understand how it came up with its prediction.

Copyright 2018 IDG Communications. ABN 14 001 592 650. All rights reserved. Reproduction in whole or in part in any form or medium without express written permission of IDG Communications is prohibited.