News

The Best Metric to Measure Accuracy of Classification Models

Measuring accuracy of model for a classification problem (categorical output) is complex and time consuming compared to regression problems (continuous output). Let’s understand key testing metrics with example, for a classification problem.

By Jacob Joseph, CleverTap.

Unlike evaluating the accuracy of models that predict a continuous or discrete dependent variable like Linear Regression models, evaluating the accuracy of a classification model could be more complex and time-consuming. Before measuring the accuracy of classification models, an analyst would first measure its robustness with the help of metrics such as AIC-BIC, AUC-ROC, AUC- PR, Kolmogorov-Smirnov chart, etc. The next logical step is to measure its accuracy. To understand the complexity behind measuring the accuracy, we need to know few basic concepts.

Model Output

Most of the classification models output a probability number for the dataset.

E.g. – A classification model like Logistic Regression will output a probability number between 0 and 1 instead of the desired output of actual target variable like Yes/No, etc.

The next logical step is to translate this probability number into the target/dependent variable in the model and test the accuracy of the model. To understand the implication of translating the probability number, let’s understand few basic concepts relating to evaluating a classification model with the help of an example given below.

The fraud observations constitute just 0.1% of the entire dataset, representing a typical case of Imbalanced Class. Imbalanced Classes arises from classification problems where the classes are not represented equally. Suppose you created a model that predicted 95% of the transactions as Non-Fraud, and all the predictions for Non-Frauds turn out to be accurate. But, that high accuracy for Non-Frauds shouldn’t get you excited since Frauds are just 0.1% whereas the Predicted Frauds constitute 5% of the observations.

Assuming you were able to translate the output of your model to Fraud/Non-Fraud, the predicted result could be compared to actual result and summarized as follows:

a) True Positives: Observations where the actual and predicted transactions were fraud

c) False Positives: Observations where the actual transactions weren’t fraud but predicted to be fraud

d) False Negatives: Observations where the actual transactions were fraud but weren’t predicted to be fraud

Confusion Matrix is a popular way to represent the summarized findings.

True Positives (TP)

False Negatives (FN)

False Positives (FP)

True Negatives (TN)

Typically, a classification model outputs the result in the form of probabilities as shown below:

First 5 rows of the dataset:

Observation

Actual

Predicted

1

Non-Fraud

0.45

2

Non-Fraud

0.10

3

Fraud

0.67

4

Non-Fraud

0.60

5

Non-Fraud

0.11

Suppose we assume 0.5 as the cut-off probability i.e. observations with probability value of 0.5 and above are marked as Fraud and below 0.5 are marked as Non-Fraud as shown in the table below:

Accordingly, the above first 5 rows will be as below:

Observation

Actual

Predicted

1

Non-Fraud

Non-Fraud

2

Non-Fraud

Non-Fraud

3

Fraud

Fraud

4

Non-Fraud

Fraud

5

Non-Fraud

Non-Fraud

Let’s summarize the results from the model of the entire dataset with the help of the confusion matrix:

TP = 90

FN = 10

FP = 10

TN = 999,890

We have all non-zero cells in the above matrix. So is this result ideal?

Wouldn’t we love a scenario wherein the model accurately identifies the Frauds and the Non-Frauds i.e. zero entry for cells, FP and FN?

A BIG YES.

Consider a scenario wherein as a marketing analyst; you would like to identify users who were likely to buy but haven’t bought yet. This particular class of users would be the ones who share the characteristics of the users who bought. Such a class would belong to False Positives – Users who were predicted to transact but didn’t transact in reality. Hence, in addition to non-zero entries in TP and TN, you would prefer a non-zero entry in FP too. Thus, the model accuracy depends on the goal of the prediction exercise.

Key Testing Metrics

Since we are now comfortable with the interpretation of the Confusion Matrix, let’s look at some popular metrics used for testing the classification models:

i) Sensitivity/Recall Sensitivity also known as the True Positive rate or Recall is calculated as,