Random Forest in Python with scikit-learn

12/12/2018

The random forest algorithm is the combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. It can be applied to different machine learning tasks, in particular, classification and regression. Random Forest uses an ensemble of decision trees as a basis and therefore has all advantages of decision trees, such as high accuracy, easy usage, and no necessity of scaling data. Moreover, it also has a very important additional benefit, namely perseverance to overfitting (unlike simple decision tree).

In this tutorial, we will use the Diamonds dataset and predict the price of the diamonds with the help of Random Forest Regressor. Then, we will visualize and analyze the obtained results. Also, we will consider the hyperparameters tuning and the importance of variables.

Now, we have a pre-trained model and can estimate it by making the prediction of the diamonds prices and comparing them with the real prices from test data. To make this comparison more illustrative, we will show it both in the forms of table and plot.

In [5]:

importwarningswarnings.filterwarnings('ignore')# Make predictionpredictions=regr.predict(X_test)result=X_testresult['price']=y_testresult['prediction']=predictions.tolist()result.head()

As you can conclude from this figure, predicted prices (red scatters) coincide well with the real ones (blue scatters), especially in the region of small carat values. But to estimate our model more precisely, we will look at Mean absolute error (MAE), Mean squared error (MSE), and R-squared scores.

The R-squared value is rather good, but the errors are high. To improve this situation, we should tune the hyperparameters of the algorithm a little. We can do this manually, but it will take a lot of time. Special tools from sklearn library can help us perform the tuning faster and more effective. One of such tools is GridSearchCV method which will obtain the best parameters for the algorithm.

# Import GridSearchCVfromsklearn.model_selectionimportGridSearchCV# Find the best parameters for the modelparameters={'max_depth':[70,80,90,100],'n_estimators':[900,1000,1100]}gridforest=GridSearchCV(regr,parameters,cv=3,n_jobs=-1,verbose=1)gridforest.fit(X_train,y_train)gridforest.best_params_

Fitting 3 folds for each of 12 candidates, totalling 36 fits

[Parallel(n_jobs=-1)]: Done 36 out of 36 | elapsed: 16.7min finished

Out[142]:

{'max_depth': 70, 'n_estimators': 1100}

If you pass the obtained parameters to the algorithm, you will see that errors decreased and R-squared scores increased which means that the algorithm with the tuned hyperparameters has higher prediction accuracy.

For this algorithm, we used all the diamond features, but some of them influence the price greater than the others. If we define the most important features, we will be able to use only those in calculations and in such way improve the performance of the algorithm.

To sum up, we can say that the Random Forest algorithm has some advantages in comparison with Lasso, Ridge or OLS regressions. It doesn't require data scaling and has higher prediction accuracy. Random Forest algorithm is also less prone to overfitting and easier for hyperparameters tuning. Linear regression methods could be better only if you are assured that your function is linear.