Exercise

What is a Random Forest

A detailed study of Random Forests would take this tutorial a bit too far. However, since it's an often used machine learning technique, a general understanding and an illustration in R won't hurt.

In layman's terms, the Random Forest technique handles the overfitting problem you faced with decision trees. It grows multiple (very deep) classification trees using the training set. At the time of prediction, each tree is used to come up with a prediction and every outcome is counted as a vote. For example, if you have trained 3 trees with 2 saying a passenger in the test set will survive and 1 says he will not, the passenger will be classified as a survivor. This approach of overtraining trees, but having the majority's vote count as the actual classification decision, avoids overfitting.

Before starting with the actual analysis, you first need to meet one big condition of Random Forests: no missing values in your data frame. Let's get to work.

Instructions

100 XP

The code to clean your entire dataset from missing data and split it up in training and test set is provided in the sample code. Study the code chunks closely so you understand what's going on. Just click Submit Answer to continue.

If you want to know how all_data itself was built from train and test, have a look at this R script.