In recent years, increasing effort has gone into evaluating computer vision algorithms in general, and edge detection algorithms in particular. Most of the evaluation techniques use only a few test images, leaving open the question of how broadly their results can be interpreted. Our research tests the consistency of the receiver operating characteristic (ROC) curve, and demonstrates why consistent edge detector evaluation is difficult to achieve. We show how easily the framework can be manipulated to rank any of three modern edge detectors in any order by making minor changes to the test imagery. We also note that at least some of the inconsistency is the result of the erratic nature of the algorithms themselves, suggesting that it is still possible to create better edge detectors