Evaluation Framework

The two challenge datasets from RadboudUMC and UMCUtrecht will be unified and divided into training and test sets, and the latter will become available at a date close to ISBI 2016 challenge event.

Evaluation of the algorithms for the manuscript submission should be performed on the training data either through cross-validation or a hold-out experiment. The following two strategies will be used to evaluate the performance of the algorithms:

Slide-based Evaluation: The merits of the algorithms will be assessed for discriminating between slides containing metastasis and normal slides. Receiver operating characteristic (ROC) analysis at the slide level will be performed and the measure used for comparing the algorithms will be area under the ROC curve (AUC).

Lesion-based Evaluation: For the lesion-based evaluation, free-response receiver operating characteristic (FROC) curve will be used. The FROC curve is defined as the plot of sensitivity versus the average number of false-positives per image.

As this challenge evaluates algorithms for both WSI classification and metastasis localization/detection, there will be two main leaderboards for comparing the algorithms.

Results format

First Evaluation: Participants need to provide a single CSV file, with the first column corresponding to the name of the image and the second column containing the probability of that image to contain metastasis.

Second Evaluation: For each tumor region detected in the whole-slide image (WSI) you are required to provide the X and Y coordinates of the detected region together with a confidence score representing the probability of the detected region to be tumor. The results must be submitted as a CSV file, one for each WSI, with the same filename as the WSI it referes to. Each row in the CSV file should correspond to one detected tumor region. The first column should be filled with the confidence score for the detection and the second and third columns should contain the X and Y coordinates of the detection. See the figure below for an example CSV output.

Description of the second evaluation metric

This challenge evaluates the performance of the algorithms for lesion detection/localization. The detection/localization performance is summarized using Free Response Operating Characteristic (FROC) curves. This is similar to ROC analysis, except that the false positive rate on the x-axis is replaced by the average number of false positives per image. In this challenge, we consider a true positive, if the location of the detected region is within the annotated ground truth lesion.

If there are multiple findings for a single ground truth region, they will be counted as a single true positive finding and none of them will be counted as false positive.

All detections that are not within a specific distance from the ground truth annotations will be counted as false positives.

The final score that ranks teams in the second leaderboard is defined as the average sensitivity at 6 predefined false positive rates: 1/4, 1/2, 1, 2, 4, and 8 FPs per whole slide image.

Evaluation code in Python and Matlab

To get more detailed information about the challenge evaluation metrics please refer to Camelyon16 challenge forum. We have also provided the participants with evaluation codes in Python and Matlab. Links are available inside the forum.