2.4 Subgroup Search and Evaluation in the ROC and TP/FP Space

The goal of this section is to clarify the relation between the ROC
space which is usually used for evaluating classifier performance, and
the TP/FP space which is being searched by the qg heuristic in
the SD algorithm.

Evaluation of induced subgroups in the ROC space (ROC: Receiver
Operating Characteristic,
Provost & Fawcett, 2001) shows their
performance in terms of TPr and FPr, where TPr is the
sensitivity of a classifier measuring the fraction of positive
cases that are classified as positive, and FPr is the false
alarm measuring the fraction of incorrectly classified negative
cases:
,
and
.
A point in the ROC space shows
classifier performance in terms of false alarm rate FPr (plotted on the
X-axis) that should be as low as possible, and sensitivity TPr
(plotted on the Y-axis) that should be as high as possible (see
Figure 5 in Section 3.2).

The ROC space is appropriate for measuring the success of subgroup
discovery, since subgroups whose TPr/FPr tradeoff is close to the
diagonal can be discarded as uninteresting. Conversely, interesting
rules/subgroups are those sufficiently distant from the diagonal.
Those rules which are most distant from the diagonal define the points
in the ROC space from which a convex hull is constructed. The area
under the ROC curve defined by subgroups with the best TPr/FPr
tradeoff can be used as a quality measure for comparing the success of
different learners or subgroup miners. In subgroup construction, the
data analyst can try to achieve the desired TPr/FPr tradeoff by
building rules using different data mining algorithms, by different
parameter settings of a selected data mining algorithm or by applying
a cost-sensitive data mining algorithm that takes into the account
different misclassification costs.

The qg measure in the SD algorithm that needs to be maximized,
tries to find subgroups that are as far as possible from the diagonal
of the ROC space in the direcion of the left upper corner (with TPr
equal to 100% and FPr equal to 0%). Note, however, that the actual
computation, as implemented in Algorithm SD, is not performed in terms
of TPr and FPr, as assumed in the ROC analysis, but rather in
terms of TP and FP in the so-called TP/FP space. The reason is
the improved computational efficiency of computing the qg value
which is used as a search heuristic for comparing the quality of rules
for a given, fixed domain. For a fixed domain, the TP/FP space is as
appropriate as the ROC space: the ROC space is namely equivalent to
the normalized TP/FP space where Pos and Neg are normalization
constants for Y and X axes, respectively. The TP/FP space and
the ROC space are illustrated in Section 3.2 by
Figures 4 and 5,
respectively.