Is Random Forest better than Logistic Regression? (a comparison)

(a comparison)Delving into the nature of random forest, walking through an example, and comparing it to logistic regression.

Andrew HershyBlockedUnblockFollowFollowingJul 6SourceIntroduction:Random Forests are another way to extract information from a set of data.

The appeals of this type of model are:It emphasizes feature selection — weighs certain features as more important than others.

It does not assume that the model has a linear relationship — like regression models do.

It utilizes ensemble learning.

If we were to use just 1 decision tree, we wouldn’t be using ensemble learning.

A random forest takes random samples, forms many decision trees, and then averages out the leaf nodes to get a clearer model.

In this analysis we will classify the data with random forest, compare the results with logistic regression, and discuss the differences.

Take a look at the previous logistic regression analysis to see what we‘ll be comparing it to.

Table of Contents:1.

Data Understanding (Summary)2.

Data Exploration/Visualization(Summary)4.

Building the Model5.

Testing the Model6.

ConclusionsData Background:We have a sample of 255 patients and would like to measure the relationship between 4 proteins levels and cancer growth.

We know:The concentration of each protein measured per patient.

Whether or not each patient has been diagnosed with cancer (0 = no cancer; 1= cancer).

Our goal is:To predict whether future patients have cancer by extracting information from the relationship between protein levels and cancer in our sample.

The 4 proteins we’ll be looking at:Alpha-fetoprotein (AFP)Carcinoembryonic antigen (CEA)Cancer Antigen 125 (CA125)Cancer Antigen 50 (CA50)I received this data set to use for educational purposes from the MBA program @UAB.

Data Exploration / VisualizationAgain, take a look at the logistic regression analysis to get a more in-depth understanding.

title('Class (Y) Distribution')Figure 2Building the ModelTo refresh on the logistic regression output:CEA and CA125 were the most predictive, with their pvalues below alpha at 5% and their coefficients being higher than the others.

We took out AFP and CA50 from the logistic regression due to their high pvalue.

However, we will keep them in for the random forest model.

The whole purpose of this exercise is to compare the 2 models, not combine them.

We will build the decision tree and visualize what it looks like:#Defining variables and building the modelfeatures = list(df.

492, spread = 144, 111The regression model told us CEA is the most predictive feature with the highest coefficient and the lowest pvalue.

The decision tree agrees with this this by placing CEA on the root node (the most important node).

The tree made the decision to split the dataset by CEA at the point 3.

25.

That is the point where CEA splits the target variable most purely into cancerous and noncancerous.

Anything lower than 3.

25 (n=144) has a stronger likelihood of being non cancerous, anything above 3.

25 (n=111) will likely be cancerous.

In general, the lower the gini score, the more purely the data is split by its target variable.

The root node is selected to be the feature with the strongest split.

The rest of the tree’s decision nodes are derivative and work in the same way.

Random ForestInstead of stopping there and basing our model off of this, we will be implementing a random forest: taking random samples, forming many decision trees and taking the average of those decisions to form a new model.

Testing the ModelConfusion MatrixEdit: I was talking with a friend in biostats about my analysis, and the convention in that field is that the disease is attributed as being positive.

I arbitrarily set cancer as negative because I didn’t know that at the time.

Figure 4#Confusion Matrixfrom sklearn.

metrics import confusion_matrixconfusion_matrix = confusion_matrix(y_test, y_pred)print(confusion_matrix)Match the matrix above to Figure 4 to learn what it is saying:34 of our model’s guesses were True Positive: The model thought the patient had no cancer, and they indeed had no cancer.

21 of our model’s guesses were True Negative: The model thought the patient had cancer, and they indeed had cancer.

14 of our model’s guesses were False Negative: The model thought the patient had cancer, but they actually didn’t have cancer8 of our model’s guesses were False Positive: The model thought the patient had no cancer but they actually did have cancer.

30% of our total data went to testing group, that leaves 255(.

3) = 77 instances that were tested.

The sum of the matrix is 77.

Divide the “True” numbers by the total and that will give the accuracy of our model: 55/77 = 71%.