Supervisely/
Classification metrics

Classification metrics

Studying classification metrics is better to start with the case of binary classification (i.e. when there are two classes). We will consider an example of a binary image classification, where the goal is to determine whether a dog is in the picture or not. If there is a dog in the pictures, then we will mark it with the tag "dog" and if not “no dog”.

To describe the main classification metrics it is worthwhile to introduce the following definitions:

True Positives (TP): True positives are the cases when the actual class of the data point was 1(True) and the predicted is also 1(True). (If there is a dog in the picture and we tag it as “dog”)

True Negatives (TN): True negatives are the cases when the actual class of the data point was 0(False) and the predicted is also 0(False) (If there are no dogs in the picture and we tag it as “no dog”)

False Positives (FP): False positives are the cases when the actual class of the data point was 0(False) and the predicted is 1(True). False is because the model has predicted incorrectly and positive because the class predicted was a positive one(1). (If there are no dogs in the picture and we tag it as “dog”

False Negatives (FN): False negatives are the cases when the actual class of the data point was 1(True) and the predicted is 0(False). False is because the model has predicted incorrectly and negative because the class predicted was a negative one(0). (If there is a dog in the picture and we tag it as “no dog”)

They all make up the Confusion matrix:

The most common classification metric is Accuracy, it is determines as the ratio between the number of correct predictions and the total number of predictions made, or in terms of confusion matrix:

Another popular metric is Precision:

Precision talks about how precise/accurate your model is out of those predicted positive, how many of them are actual positive. (In our case: what fraction of the pictures we tagged as "dog" actually contains dogs.)

Often, a metric such as Recall is used with Precision:

Recall is the fraction of relevant instances that have been retrieved over the total amount of relevant instances. (In our case: what fraction of the pictures containing dogs we tagged as “dog”)

If you want to seek a balance between Precision and Recall, you can use F1-Measure(or F1-Score), which is a function of these two quantities:

Multiclass case

All these metrics can also be defined in the multi-class setting. Here, the metrics can be "averaged" across all the classes in many possible ways. Some of them are:

micro: Calculate metrics globally by counting the total number of times each class was correctly predicted and incorrectly predicted.

macro: Calculate metrics for each "class" independently, and find their unweighted mean. This does not take label imbalance into account.

“src” - mode in which the plugin will work with the project. Now Classification metrics plugin supports two modes: 1) “image” and 2) “object”. In “image” mode we operate with image tags:
And in “object” mode with object tags:
Note: In “object” mode plugin works correctly only when images contain only one object.