Results and Discussion

The clinical experiment took place at Stanford University Hospital during spring 1996. The gold standard was established by E. Sickles, M.D., Professor of Radiology, University of California at San Francisco, and Chief of Radiology, Mt. Zion Hospital, and D. Ikeda, Assistant Professor and Chief, Breast Imaging Section, Department of Radiology, Stanford University, an independent panel of expert radiologists, who evaluated the test cases and then collaborated to reach agreement. The majority of the detected items were seen by both radiologists. Any findings seen by only one radiologist were included. The other type of discrepancy resolved was the class of the detected lesions. Since the same abnormality may be classified differently, the two radiologists were asked to agree on a class.

The focus of the statistical analysis is the screening and management of patients and how it is affected by analog vs digital and lossy compressed digital. In all, there were 57 studies that figure in what we report. According to the gold standard, the respective numbers of studies of each of the four management types RTS, F/U, C/B, and BX were 13, 1, 18, and 25, respectively.

For each of the four possible outcomes, the analog original is compared to each of four technologies: digitized from analog original, and wavelet compressed to three different levels of compression (1.75, 0.4, and 0.15bpp). So the McNemar 2x2 statistics based on the generic table of Table 2 for assessing differences between technologies were computed 48 times, 16 per radiologist, for each competing image modality (original digital and the three lossy compressed bit rates). For example, the 2x 2 tables for a single radiologist (A) comparing analog to each of the other four modalities are shown in Fig. 14.

For none of these tables for any radiologist was the exact binomial attained significance level (_p-value) 0.05 or less. For our study and for this analysis, there is nothing to choose in terms of being "better" among the analog original, its digitized version, and three levels of compression, one rather extreme. We admit freely that this limited study had insufficient power to permit us to detect small differences in management. The larger the putative difference, the better our power to have detected it. Figure 15 summarizes the performance of each radiologist on the analog vs. uncompressed digital and lossy compressed digital using the independent gold standard. In all cases, columns are "digital" and rows "analog". Figure 15A treats analog vs original digital and Figs 15B-15D treat analog vs lossy compressed digital at bit rates of 1.75 bpp, 0.4 bpp, and 0.15 bpp, respectively.

Consider as an example the analog vs digital comparison of (A). Radiologist A made 23 "mistakes" of 57 studies from analog, and 20 from digital studies. The most frequent mistake, seven for both technologies, was classifying a gold standard "biopsy" as "additional assessment". Radiologist B made 26 "mistakes" from analog studies, and 28 from digital. In both cases, the most frequent mistake was to "biopsy" what should, by the gold standard, have been "additional assessment." There were 15 such mistakes with analog and 14 with digital. Radiologist C made 19 "mistakes" from analog studies and 19 from digital. With the former, the most frequent mistake occurred eight times when "biopsy" was judged when "additional assessment" was correct. With digital, the most frequent mistakes were for what was judged "additional assessment", but that should have been "biopsy" for five and "return to screening" for five. On this basis, we cannot say that analog and digital are different beyond chance. However, we note here, as elsewhere, that radiological practice varies considerably by radiologist.

The primary conclusion from these data and analyses is that variabilities among judges exceed by a considerable amount, in their main effects and interactions, the variability in performance that owes to imaging modality or compression within very broad limits. In other words, the differences among analog, digital, and lossy compressed images are in the noise of the differences among radiologists, and are therefore more difficult to evaluate. This suggests variations in statistical analysis that will be explored in other experiments.