Discussed were only high level concepts and a bivariate model example.

In this analysis we will look at more challenging data and learn more advanced techniques and interpretations.

Table of Contents:1.

Data Background2.

Data Exploration/Cleaning3.

Data Visualization4.

Building the Model5.

Testing the ModelData Background:Measuring certain protein levels in the body have been proven to be predictive in diagnosing cancer growth.

Doctors can perform tests to check these proteins levels.

We have a sample of 255 patients and would like to gain information with regard to 4 proteins and their potential relationships with cancer growth.

We know:The concentration of each protein measured per patient.

Whether or not each patient has been diagnosed with cancer (0 = no cancer; 1= cancer).

Our goal is:To predict whether future patients have cancer by extracting information from the relationship between protein levels and cancer in our sample.

The 4 proteins we’ll be looking at:Alpha-fetoprotein (AFP)Carcinoembryonic antigen (CEA)Cancer Antigen 125 (CA125)Cancer Antigen 50 (CA50)I received this data set to use for educational purposes from the MBA program @UAB.

Data Exploration / CleaningLet’s jump into the analysis by pulling in the data and importing necessary modules.

This chart, like the distribution histogram, is also separated into 20 bins.

Combining our knowledge from the distribution and the proportions of our target variable above, we can intuitively determine there likely isn’t much predictive knowledge to be gained from this protein.

Let’s take it step by step.

The majority of patients have an AFP value of under 10, which is shown in the first 2 bars in Figure 4.

Because the majority of patients are in those first 2 bars, the change in Y between them in Figure 5 matters more than the changes in Y among the other bars.

The proportion of cancerous patients increases slightly from bar 1 to bar 2.

The proportion of cancerous patients decreases from bar 2 to 3.

After bar 3, there are so few patients left to analyze that it has little effect on the trend.

From what we can see here, the target variable looks mostly independent to changes in AFP.

The most significant change (bar 1 to 2) is very slight and the changes after that are not in the same direction.

Let’s see how the other proteins look.

CEAFigure 6Figure 7CEA appears to have a different story.

Figure 6 shows the distribution shape is similar to AFP; however, Figure 7 shows different changes in cancer rates.

Just like with AFP (due to the distribution shape) the most significant cancer change between bars would be among bar 1 and 2.

The change from bar 1 to bar 2 went from around 63% noncancerous to 18% noncancerous (or to put that another way, 37% cancerous to 82% cancerous).

Additionally, the change from bin 2 to bin 3 is in the same direction, more cancer.

fit(X_train, y_train)#Calculating the accuracy of the training model on the testing dataaccuracy = logreg.

score(X_test, y_test)print('The accuracy is: ' + str(accuracy *100) + '%')A good way to visualize the accuracy calculated above is with the use of a confusion matrix.

Below is the conceptual framework for what a confusion matrix is:Figure 14“Confusing” is the key word for a lot of people.

Try to look at one line at a time: The top row is a good place to start.

This row is telling us how many instances were predicted to be benign.

If we look at the columns, we can see the split of the actual values within that prediction.

Just remember, rows are predictions and columns are the actual values.

from sklearn.

metrics import confusion_matrixconfusion_matrix = confusion_matrix(y_test, y_pred)print(confusion_matrix)Match the matrix above to Figure 14 to learn what it is saying:39 of our model’s guesses were True Positive: The model thought the patient had no cancer, and they indeed had no cancer).

18 of our model’s guesses were True Negative: The model thought the patient had cancer, and they indeed had cancer.

14 of our model’s guesses were False Negative: The model thought the patient had cancer, but they actually didn’t have cancer6 of our model’s guesses were False Positive: The model thought the patient had no cancer but they actually did have cancer.

30% of our total data went to testing group, that leaves 255(.

3) = 77 instances that were tested.

The sum of the matrix is 77.

Divide the “True” numbers by the total and that will give the accuracy of our model: 57/77 = 74.

03%.

Keep in mind, we randomly shuffled the data before performing this test.

I ran the regression a few times and got anywhere between 65% and 85% accuracy.

ROC CurveLastly, we are going to perform a Receiver operating characteristic (ROC) analysis as another way of testing our model.

The 2 purposes of this test are toDetermine where the best “cut off” point is.

Determine how well the model classifies through another metric called “Area under curve” (AUC).

We will be creating our ROC curve from scratch.

Below is all the code used to format a new dataframe to calculate the ROC, cutoff point, and AUC.

Index: This dataframe is sorted on the model_probability, so I reindexed for convenience.

CA125 and CEA: The original testing data protein levels.

model_probability: This column is from our training data’s logistic model outputting it’s probabilistic prediction of being classified as “1” (cancerous) based on the input testing protein levels.

The first row is the least-likely instance to be classified as cancerous with it’s high CA125 and low CEA levels.

y_test: The actual classifications of the testing data we are checking our model’s performance with.

The rest of the columns are based solely on “y_test”, not our model’s predictions.

Think of these values as their own confusion matrices.

These will help us determine where the optimal cut off point will be later.

tp (True Positive): This column starts at 0.

If y_test is ‘0’ (benign), this value increases by 1.

It is a cumulative tracker of all the potential true positives.

The first row is an example of this.

fp (False Positive): This column starts at 0.

If y_test is ‘1’(cancerous), this value increases by 1.

It is a cumulative tracker of all potential false positives.

The fourth row is an example of this.

tn (True Negative): This column starts at 32(the total number of 1’s in the testing set).

If y_test is ‘1’(cancerous), this value decreases by 1.

It is a cumulative tracker of all potential true negatives.

The fourth row is an example of this.

fn (False Negative): This column starts at 45(the total number of 0’s in the testing set).

If y_test is ‘0’(benign), this value decreases by 1.

It is a cumulative tracker of all potential false negatives.

The fourth row is an example of this.

fp_rate (False Positive Rate): This is calculated by taking the row’s false positive count and dividing it by the total number of positives (45, in our case).

It lets us know the number of false positives we could classify by setting the cutoff point at that row.

We want to keep this as low as possible.

tp_rate (True Positive Rate): Also known as sensitivity, this is calculated by taking the row’s true positive count and dividing it by the total number of positives.

It lets us know the number of true positives we could classify by setting the cutoff point at that row.

We want to keep this as high as possible.

accuracy: the sum of true positive and true negative divided by the total instances (77 in our case).

Row by row, we are calculating the potential accuracy based on the possibilities of our confusion matrices.

Figure 15I pasted the entire dataframe because it’s worthwhile to study it for a while and make sense out of all the moving pieces.

After looking it over, try to find the highest accuracy percentage.

If you can locate that, you can match it to the corresponding model_probability to discover the optimal cut-off point for our data.

#Plottingplt.

plot(df['model_probability'],df['accuracy'], color = 'c')plt.

xlabel('Model Probability')plt.

ylabel('Accuracy')plt.

title('Optimal Cutoff')#Arrow and Starplt.

plot(0.

535612, 0.

753247, 'r*')ax = plt.

axes()ax.

arrow(0.

41, 0.

625, 0.

1, 0.

1, head_width=0.

01, head_length=0.

01, fc='k', ec='k')plt.

show()Figure 16The model probability is 54% where the accuracy is highest at 75%.

It may seem counter-intuitive, but that means if we use 54% instead of 50% when classifying a patient as cancerous, it will actually be more accurate.

If we want to maximize the accuracy, we would set the threshold to 54%, however, due to the extreme nature of cancer, it is probably wise to lower our threshold to below 50% to ensure patients who may have cancer are checked out anyway.

In other words, false positives are more consequential than false negatives when it comes to cancer!Lastly, let’s graph the ROC curve and find AUC:#Calculating AUCAUC = 1-(np.