Area Under the ROC Curve

ROC curve and Area Under the ROC Curve (AUC) are widely-used metric for binary (i.e., positive or negative) classification problems such as Logistic Regression.

Binary classifiers generally predict how likely a sample is to be positive by computing probability. Ultimately, we can evaluate the classifiers by comparing the probabilities with truth positive/negative labels.

Now we assume that there is a table which contains predicted scores (i.e., probabilities) and truth labels as follows:

probability(predicted score)

truth label

0.5

0

0.3

1

0.2

0

0.8

1

0.7

1

Once the rows are sorted by the probabilities in a descending order, AUC gives a metric based on how many positive (label=1) samples are ranked higher than negative (label=0) samples. If many positive rows get larger scores than negative rows, AUC would be large, and hence our classifier would perform well.

Compute AUC on Hivemall

In Hivemall, a function auc(double score, int label) provides a way to compute AUC for pairs of probability and truth label.

Sequential AUC computation on a single node

For instance, the following query computes AUC of the table which was shown above: