Transcription

1 Methodologies for Evaluation of Standalone CAD System Performance DB DSFM DCMS OSEL DESE DP DIAM Berkman Sahiner, PhD USFDA/CDRH/OSEL/DIAM AAPM CAD Subcommittee in Diagnostic Imaging CAD: CADe and CADx INTRODUCTION CADe: Identify portions of an image to reveal abnormalities during interpretation by reader CADx: Provide assessment of disease; specify disease severity, type, or stage to the reader Standalone assessment Assessment of the performance of device alone Assessment of the effect of CAD on the reader is next talk CAD SYSTEM ASSESSMENT Measure the performance of your system Inform users, regulators, scientific community, and yourself Establish its effectiveness for use Compare with other systems with a similar intended use If you can t assess it, you will not know how to improve it

2 STANDALONE VERSUS WITH READERS The effect of CAD on the radiologists performance is the ultimate test Currently, CAD devices in radiology are intended for use by radiologists Not standalone or triage use The effect of CAD on the readers performance may be more burdensome to assess than standalone STANDALONE VERSUS WITH READERS Merits of standalone assessment Potential impact at early stage of development, prior to testing with readers Potentially large datasets, more amenable to subset analysis Reader variability is eliminated COMPONENTS OF CAD ASSESSMENT Dataset Reference standard Mark-labeling Assessment metric

3 Training DATASETS In theory, known properties of abnormals and normals may suffice for CAD In practice, many parameters are determined using a training data set Test Used for performance assessment Mixing training and test sets introduces optimistic bias to CAD assessment DATASETS Images and data components used as inputs to the CAD system Other images necessary for reference standard Other data to provide context and perform subgroup analysis Age, demographics, disease type, lesion size, concomitant diseases TRAINING DATASET Ideally, covers the spectrum of intended task May not need to be representative Sub-group may be over-represented if thought to be more difficult or more important May include Phantom images Electronically altered images

4 TEST DATASET Independent of the training data set used at any stage of development Should include the range of abnormalities for the target population Image acquisition and patient preparation parameters should be consistent with those in the target population Should be large enough for adequate statistical power to demonstrate study objectives ENRICHMENT Low prevalence disease Enhance with cases containing disease Will not affect sensitivity, specificity, area under the ROC curve In an observer study, may affect the reader s behavior SPECTRUM OF DIFFICULTY Spectrum of difficulty for test cases versus spectrum of difficulty for intended population: If different, test results may be biased Bias may be acceptable if Comparing two modalities and both modalities are affected similarly by spectrum bias

5 STRESS TESTING Study differences between competing modalities using cases selected to challenge those differences* Example in CADe Excluding obvious cases because they will be detected both with and without CAD RF Wagner et al, "Assessment of Medical Imaging Systems and Computer Aids: A Tutorial Review," Acad Radiol 14, (2007). TEST DATASET REUSE Can I keep using the same test dataset while trying to improve my CAD system? Starting over with a completely new dataset Burdensome Does not promote enlarging the dataset, i.e., reducing uncertainty in performance estimates Danger: Tuning the CAD system explicitly or implicitly to test dataset TEST DATASET REUSE Risks / benefits need to be weighed depending on The stage of CAD algorithm design e.g., an early-stage CAD design for a new modality Should acknowledge data set reuse How dataset reuse occurred e.g., were detailed results reported back to algorithm design?

6 COMMON SEQUESTERED DATASET Some public datasets available, but not sequestered Sequestered dataset for independent testing Must ensure CAD systems are not tuned to sequestered dataset Dataset evolves over time, does not become obsolete DATASET SUMMARY Very critical in both design and assessment For assessment purposes, training does not need to be optimal Training dataset may not have to follow the distribution of intended population Independent test dataset essential Prevalence enrichment often necessary Disease status REFERENCE STANDARD Ideally, independent of the modality that CAD is designed for Location and extent of disease Ideally, additional data or images are used to complement the images targeted by CAD

7 REFERENCE STANDARD: DISEASE STATUS Disease status often known by biopsy, follow-up, or other method with very high accuracy Mammography However*, 11-gauge vacuum-assisted biopsy: % rate of discordance 14-gauge vacuum-assisted biopsy: % rate of discordance If long-term follow-up is missing, negative cases may have uncertainty In other situations, the imaging modality for CAD may be standard of care CT for pulmonary embolism *ES Burnside et al., A probabilistic expert system that provides automated mammographic histologic correlation: Initial experience, AJR 182, (2004) REFERENCE STANDARD: LOCATION AND EXTENT Required in CADe if truth location is part of the assessment Generally the case for standalone CADe assessment Other imaging data often available to locate disease Breast cancer: Images acquired during biopsy Colon cancer: Optical colonoscopy In other situations, additional imaging data may not be available CT for pulmonary embolism VARIABILITY IN LOCATION AND EXTENT

8 Expert panel LACK OF GOLD STANDARD Combine expert readers interpretations into a reference standard Example: Each reader first reads independently Interpretations are merged using an adjucation method Majority, independent arbiter Uncertainty in truth REFERENCE STANDARD - SUMMARY In practice, a perfect reference standard may be difficult to establish for many CAD applications Practical scenario: Use as much information as possible, but recognize that the reference standard may not be perfect Expert panels May be beneficial or may be the only option in some applications Additional uncertainty in truth MARK-LABELING Rules for declaring a mark as a TP or FP Applies to CADe only

9 By a human: MARK-LABELING A human may be a good judge for deciding whether a mark points to a FP May be subjective Labeler should not have a stake in the outcome of assessment to reduce bias May be burdensome if repeated mark-labeling is desired Automated: MARK-LABELING Compare computer mark to reference standard mark using an automated rule Overlap of computer and reference standard marks Centers of computer and reference standard marks Distance of centroids Some methods better at the task than others MARK-LABELING Reference mark

10 MARK-LABELING Most studies do not report the mark-labeling protocol Randomly-selected publications on CADe Nodule detection on CT 47/58 (81%) did not report mark-labeling protocol Polyp detection in CT colonography 9/21 (43%) did not report mark-labeling protocol MARK-LABELING SUMMARY It is important to specify the mark-labeling method in a study It can have a major effect on the reported performance of the CADe system* Methods that have the potential to label clearly unhelpful marks as TPs should be avoided *M Kallergi et al., "Evaluating the performance of detection algorithms in digital mammography," Med Phys 26, (1999) MARK-LABELING SUMMARY If a parameter is used in mark labeling e.g., Area(intersection) / Area(union) > P i/u it is helpful to study how performance is affected when the mark-labeling parameter is 1.0 modified. 0.8 Sensitivity P i/u

11 PERFORMANCE MEASURES: BINARY OUTPUT Many CAD systems internally produce continuous (or multi-level) scores If so, assume a threshold has been used CADx system with binary output Positive Negative CADe system that marks potential lesions Mark No mark CADx: TRUE AND FALSE-POSITIVE FRACTIONS Number of units (images) correctly called positive TPF = Total number of positive units (images) Number of units (images) incorrectly called positive FPF = Total number of negative units (images) Unit: 2D or 3D image, region-of-interest, case CADe: LESION AND NON-LESION LOCALIZATION FRACTIONS Lesion localization fraction (LLF) ~ Sensitivity Non-lesion localization fraction (NLF) ~ Number of FPs per unit Number of correctly marked locations LLF = Total number of abnormalities Number of incorrectly marked locations NLF = Total number of negative units (images)

12 (TPF, FPF) AND (LLF, NLF) PAIRS Always in pairs Should always be accompanied with uncertainty estimates or confidence intervals TPF, FPF, LLF: Binomial Normal approximation, Wald interval More accurate: Agresti-Coull*, or Jeffreys** interval NLF: Poisson Normal approximation, Wald interval More accurate: Jeffreys** interval *A Agresti and BA Coull, "Approximate is better than "exact" for interval estimation of binomial proportions," American Statistician 52, (1998) ** LD Brown, et al., "Interval estimation in exponential families," Statistica Sinica 13, (2003) COMPARISON OF TWO STANDALONE SYSTEMS A AND B System A is better if TPF A is significantly higher that TPF B and FPF A is significantly lower than FPF B In practice, a high bar to achieve COMPARISON OF TWO CADx SYSTEMS Often, both members of the (TPF, FPF) pair are higher for one system compared to the other Higher TPF but also higher FPF Lower TPF but also lower FPF Instead of (TPF, FPF) at a fixed threshold, use the continuous scores for each unit (image) Compare ROC curves

14 FIGURES OF MERIT Area under the curve (AUC) Partial area under the curve Important to pre-specify which part of the ROC curve you are interested in before performing the comparison Point estimates should always be accompanied with confidence intervals ROC ANALYSIS Numerous methods in the literature To fit the data and estimate uncertainties Parametric To estimate FOMs and uncertainties Both parametric and non-parametric To statistically compare FOMs of two systems Both parametric and non-parametric LOCATION-SPECIFIC ROC ANALYSIS ROC: Scores Location-specific ROC: (Mark, Score) pair LROC, AFROC, FROC, EFROC 1.0 LLF (Sensitivity) Area under FROC (FPPI threshold) Bootstrapping* NLF (FPs per image) * FW Samuelson and N Petrick, "Comparing image detection algorithms using resampling," IEEE Int Symp on Biomedical Imaging: 1-3, (2006)

Glossary of Terms 2D plus 3D images a set of images that allow radiology is compare the results of a standard 2D mammogram image and the corresponding 3D tomosynthesis image, while viewing them independently

Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer

p.1/1 Performance Metrics The simplest performance metric is the model error defined as the number of mistakes the model makes on a data set divided by the number of observations in the data set, err =

The Fundamentals of MTF, Wiener Spectra, and DQE Robert M Nishikawa Kurt Rossmann Laboratories for Radiologic Image Research Department of Radiology, The University of Chicago Motivation Goal of radiology:

im3d S.p.A. La Ricerca fa 26 gennaio 2011 Alberto Bert Outline Introduction CAD COLON CAD BREAST DTS Introduction La Ricerca fa Who we are and what we do The company im3d S.p.A. is a company developing

Evaluation and Credibility How much should we believe in what was learned? Outline Introduction Classification with Train, Test, and Validation sets Handling Unbalanced Data; Parameter Tuning Cross-validation

Biost/Stat 578 B: Data Analysis Emerson, September 29, 2003 Handout #1 Organizing Your Approach to a Data Analysis The general theme should be to maximize thinking about the data analysis and to minimize

3D Ultrasonic Diagnosis of Breast Tumors Wei-Ming Chen Three major benefits of ultrasound Ultrasound imaging has been shown to be valuable for differentiating some aspects of benign and malignant diseases.

European Journal of Radiology 81 (2012) 60 65 Contents lists available at ScienceDirect European Journal of Radiology journa l h o me pa ge: www.elsevier.com/locate/ejrad Use of prior mammograms in the

1 MedicalBiostatistics.com HOME ROC Curve Many full term births in a hospital require induction of labor. Induction succeeds in most cases but fails in a few. In case the induction fails, a Cesarean is

Special Report The Possibilities Brings to Lung Cancer Screening Low-dose is a Useful Tool in Lung Cancer Screening A Technical Evaluation of Low-dose with the SONIALVISION safire Joint Industrial-Academic

Appendix 1 Key responsibilities of QMP Clinical Leads The following table provides an overview of the key responsibilities for each of the designated QMP Clinical Lead roles. Role QMP Provincial Lead QMP

Machine Learning model evaluation Luigi Cerulo Department of Science and Technology University of Sannio Accuracy To measure classification performance the most intuitive measure of accuracy divides the

Local classification and local likelihoods November 18 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor

Predictive Modeling using SAS Purpose of Predictive Modeling To Predict the Future x To identify statistically significant attributes or risk factors x To publish findings in Science, Nature, or the New

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework Jakrarin Therdphapiyanak Dept. of Computer Engineering Chulalongkorn University

Medical imaging monitors specification guidelines Document details Contact for enquiries and proposed changes If you have any questions regarding this document or if you have a suggestion for improvements,

Formula Guide is a comprehensive tool dedicated for developing, evaluating, and monitoring scorecard models. For more information see TUTORIAL Developing Scorecards Using STATISTICA Scorecard [4]. is an

PEER REVIEW HISTORY BMJ Open publishes all reviews undertaken for accepted manuscripts. Reviewers are asked to complete a checklist review form (http://bmjopen.bmj.com/site/about/resources/checklist.pdf)

Critical appraisal of a diagnostic test Dr Suzanne Mahady, Lecturer in Clinical Epidemiology suzanne.mahady@sydney.edu.au Outline Do a critical appraisal together in small groups Cover important aspects

PEER REVIEW HISTORY BMJ Open publishes all reviews undertaken for accepted manuscripts. Reviewers are asked to complete a checklist review form (http://bmjopen.bmj.com/site/about/resources/checklist.pdf)

Cancer Screening Robert L. Robinson, MD, MS Ambulatory Conference SIU School of Medicine Department of Internal Medicine March 13, 2003 Why screen for cancer? Early diagnosis often has a favorable prognosis

S03-2008 The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY ABSTRACT Predictive modeling includes regression, both logistic and linear,

Application of Data Mining Techniques in Improving Breast Cancer Diagnosis ABSTRACT Breast cancer is the second leading cause of cancer deaths among women in the United States. Although mortality rates

Guideline for the Imaging of Patients Presenting with Breast Symptoms incorporating the guideline for the use of MRI in breast cancer Version History Version Date Summary of Change/Process 0.1 09.01.11

2006 International Software Measurement and Analysis Conference A Fool with a Tool: Improving Software Cost and Schedule Estimation Ian Brown, CFPS Booz Allen Hamilton A fool with a tool is still a fool.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this