Machine Learning with Random Forests

Machines, working on our commands, for each step, they need guidance where to go what to do. This pattern is like a child who doesn’t understand the surrounding facts to make decisions for a situation. The grown ups usually do it for children. Same goes for machines. The developer writes the commands for the machine to be executed. But here in Machine Learning, we talk about making the machine learn which will enable the functionality of making decisions without any external help. That means a mature mind with the ability to understand facts and situations and choosing the right action for it.

To know the machine learning in a little more deeper I’d suggest you go through this introductory blog for Machine Learning.

In our previous blogs, we learned about Decision Tree algorithm (Link) for and its implementation (Link). Now in this blog, we will move on to the next algorithm for Machine Learning called Random Forest. Please go through these blogs before moving forward as Random Forest algorithm is based on Decision tree.

What is Random Forest

‘Another algorithm for Machine Learning‘ would be one liner for it, But as it’s been said by scholars, explaining things is necessary at each step in the process of knowledge sharing. So let’s go deeper in this algorithm.

‘Random Forest‘ as the name suggests is a forest and forest consists of trees. Here the trees being mentioned are Decision Trees. So the full definition will be “Random Forest is a random collection of Decision Trees”. Hence this algorithm is basically just an extension of Decision Tree algorithm.

Under The Hood

In this algorithm, we create multiple decision trees to the full extent (???). Yes, here we do not need to prune our decision trees. There is no such limitation for the trees in Random Forest. The catch here is that we don’t provide all the data for each decision tree to consume. We provide a random subset of our training data to each decision tree. This process is called Bagging or Bootstrap Aggregating.

Bagging is a general procedure that can be used to reduce the variance for those algorithms that have high variance. In this process, sub-samples are created for the data set and a subset of attributes, that we use to train our decision models and then we consider every model and choose the decision by voting -(classification) or by taking the average (regression). For the random forest, we usually take two third of the data with replacement (data can be repeated for every other decision tree, no need to be unique data). And the subset of the attributes m

In Random Forest each decision tree predicts a response for an instance. And the final response is decided based on voting. That means (in classification) the response which is received by the majority of decision trees becomes the final response. (In regression the average of all the responses becomes the final response).

Advantages

Works better for both classification and regression.

can handle large data set with a large number of attributes as these are divided among trees.

It can model the importance of attributes. Hence it is used for dimensionality reduction also.

Works well while maintaining accuracy even when data is missing

It also works for unlabeled data (unsupervised learning) for clustering, data views and outlier detection.

Random Forest uses the sampling of input data called as bootstrap sampling. In this one-third of the data is not used for training but for testing. These samples are called out of bag samples. And error regarding these is call out of bag error.

Out of BagError shows more or less the same error rate as a separate data set for training shows. Hence it removes the need of a separate test data set.

Disadvantages

Classification is good with Random Forest but Regression…Not so much.

Works as a black box. One can not control the inside functionality other than changing the input values etc.

Implementation

Now it’s time to see the implementation of Random Forest algorithm in Scala. Here we are gonna use Smile library to use Random Forest like we did for the implementation of Decision Trees

We are going to use the same data for this implementation as we did for Decision Tree. Hence we get here Array of Array of Double as the training instances and Array of Int as response value for these instances.

Like this:

Anuj Saxena is a software consultant having more than 1.5 years of experience. Anuj has worked on functional programming languages like Scala and functional Java and is also familiar with other programming languages such as Java, C, C++, HTML. He is currently working on reactive technologies like Spark, Kafka, Akka, Lagom, Cassandra and also used DevOps tools like DC/OS and Mesos for Deployments. His hobbies include watching movies, anime and he also loves travelling a lot.