Classifiers that are deployed in the field can be used and evaluated
in ways that were not anticipated when the model was trained. The
ultimate evaluation metric may not have been known to the modeler at
training time, additional performance criteria may have been added,
the evaluation metric may have changed over time, or the real-world
evaluation procedure may have been impossible to simulate. But
unforeseen ways of measuring model utility can degrade performance.
Our objective is to provide experimental support for modelers who face
potential "cross-metric" performance deterioration. First, to
identify model-selection metrics that lead to stronger cross-metric
performance, we characterize the expected loss where the selection
metric is held fixed and the evaluation metric is varied. Second, we
show that the number of data points evaluated by a selection metric
has a substantial effect on the optimal evaluation. In trying to
address both these issues, we hypothesize that whether classifiers are
calibrated to output probabilities may influence these issues. In our
consideration of the role of calibration, we show that our experiments
demonstrate that cross-entropy is the highest-performing selection
metric where little data is available for selection.
With these experiments, modelers may be in a better position to choose
selection metrics that are robust where it is
uncertain what evaluation metric will be applied.