Machine Learning : Accuracy & Memorization vs Learning

This tutorial talks about various measures of accuracy in in various predictive analytics and machine learning scenarios. It also talks about potential pitfall of over fitting resulting in memorization vs the learning which is desirable.

Please make sure you have gone through previous articles in the series –

In the process of creating Machine learning models the key question that we are all posed with is – ” How accurately does my model generalize the inherent pattern in predicting the outcome?”

Sadly the answer to the questing isn’t very straightforward. Please make a note that we want a model which can generalize. We will delve into this aspect further in this post later. But going back to our key question. How should we know that the model we have created is
good enough or not. Lets see this with an example.

Example: We have to predict whether our potential customers are loan worthy or not. Let us say we know classification techniques and we employed on of the techniques. Most of these techniques usually give us a probability of an input belonging to a particular class. In this case lets say that we get predicted probabilities of a customer being loan worthy. We sort the data in descending order of probabilities. i.e probability of being loan worthy to not being loan worthy (1-0).

CustNo

Probability

Actual Outcome

6

0.9

1

5

0.7

1

10

0.65

0

4

0.6

0

8

0.55

1

1

0.5

0

3

0.4

1

9

0.3

1

2

0.2

0

7

0.1

0

Logically we may infer that any customer which has probability greater than 0.5 is loan worthy and other as not loan worthy. This value 0.5 is called cutoff. Using 0.5 as cutoff we get following confusion matrix

Actual Outcome

LW

NLW

Predicted 0.5

LW

3

2

NLW

2

3

Confusion matrix gives us a quick peek at how our model fared in predicting classes. Confusion matrix is also called as Error matrix.

In this case ( for cutoff 0.5) it classified 2 loan worthy customers as not Loan worthy (Bottom left) and 2 Not Loan worthy customers as Loan worthy (Top Right ). So in all it made error in classifying for 4 customers and classified 6 customers correctly. So our Overall Accuracy in this case –

Though we picked cutoff as 0.5 by intuition, it is worthwhile to check if we get better accuracy with other cutoffs. Lets inspect the confusion matrix for cutoff 0.3 and 0.6

Actual Outcome

LW

NLW

Predicted 0.3

LW

4

3

NLW

1

2

Accuracy = (4+2)/10 = 0.6

Actual Outcome

LW

NLW

Predicted 0.6

LW

2

1

NLW

4

3

Accuracy =(2+3)/10 = 0.5

Looking at the accuracy values it seems we made a sensible choice of cutoff as 0.5 since it gives us higher accuracy. So statistically speaking cutoff 0.5 seems like a reasonable choice. Now we introduce another dimension to our problem – cost of misclassification. Accuracy considers all types of misclassifications as equals. This may not be true in Business. In our case wrongly giving loan to a person who will default is much more costly than denying a loan to a customer who will pay back on time.

Let us say the Wrongly classifying a Not Loan worthy customer as Loan worthy costs us 50$ and wrongly classifying a Loan worthy customer as not Loan worthy costs us $10 in business. Now lets inspect how our model fares in terms of costs incurred for misclassification for the different cutoffs discussed earlier.

Cutoff

#LW classified as NLW

#NLW classified as LW

Cost LW classified as NLW

Cost NLW classified as LW

Total Cost

Overall Accuracy

0.6

4

1

40

50

90

0.5

0.5

2

2

20

100

120

0.6

0.3

1

3

10

150

160

0.6

Observe here that Total cost of misclassification is highest i.e 160 in case of cutoff 0.3 and is lowest for cutoff 0.6 i,e 90.

So we can say that we incur lower cost when we choose cutoff 0.6 even when it has a lower classification accuracy of 0.5
Cost of misclassification is an important consideration when preferring one type of errors over another. Usually costs of misclassification are to be taken in as an input from Business.

We can find correct cutoff based on best accuracy using ROC curves which gives us the best Area under the Curve (AOC). We can also find best cutoff using Cost curves. But we will leave it for another discussion.

Memorization vs Learning

Earlier when we were talking about creating a model that generalizes, we emphasized on the importance of generalization a pattern as an important aspect. We can generate a really complex model which does very well on sample data but when faced with unseen data it fails miserably. On the other hand we may generate a really simple model which does not do that well on our sample but consistently gives similar accuracy on unseen values. Which model would you choose.

In our college days we always had 2 kinds of students. The first kind would memorize all texts, all possible problems and their solutions , these students would do very well in exams as well. The second kind were the ones who used to understand the concepts and then try to solve the problems in exam using their understanding of concepts. These students probably did not do as well in the exam compared with the first kind. But now you have to pick up one of the students for a vacancy , which kind of student would you pick?

That is precisely the difference between memorization and learning. The more general the model is the better are its chances of doing well when faced with unknown values. Usually Overfilltting and Underfitting are two extremes. As a data scientist we must be able to find the golden spot. There are measures which can aid us in finding that golden spot specific to a model. We will discuss about those when we discuss in detail different types of techniques .