Statistical Size and Power

The size of a test is the probability of incorrectly rejecting the null hypothesis if it is true. The power of a test is the probability of correctly rejecting the null hypothesis if it is false. For a given hypothesis and test statistic, one constrains the size of the test to be small and attempts to make the power of the test as large as possible.

Given a specified size, test statistic, null hypothesis, and alternative, statistical power can be estimated using the common (but sometimes inappropriate) assumption that the data are Gaussian. As data are gathered, however, improved estimates can be obtained by modern computer-intensive statistical methods. For example, the power and size can be computed for each test statistic described earlier to test the hypothesis that digital mammography of a specified bit rate is equal or superior to film screen mammography with the given statistic and alternative hypothesis to be suggested by the data. In the absence of data, we can only guess the behavior of the collected data to approximate the power and size. We consider a one-sided test with the "null hypothesis" that, whatever the criterion [management or detection sensitivity, specificity, or predictive value positive (PVP)], the digitally acquired mam-mograms or lossy compressed mammograms of a particular rate are worse than analog. The "alternative" is that they are better. In accordance with standard practice, we take our tests to have size 0.05. We here focus on sensitivity and specificity of management decisions, but the general approach can be extended to other tests and tasks.

Approximate computations of power devolve from the 2 by 2 agreement tables of the form of Table 1. In this table, the rows correspond to one technology (for example analog) and

All rights of reproduction in any form reserved.

columns to the other (digital, say). "R" and "W" correspond to "right" (agreement with gold standard) and "wrong" (disagreement with gold standard). So, for example, the count N(1,1) is the number of cases where a radiologist was right when reading both the analog and digital images. The key idea is twofold. In the absence of data, a guess as to power can be computed using standard approximations. Once preliminary data are obtained, however, more accurate estimates can be obtained by simulation techniques taking advantage of the estimates inherent in the data. Table 2 shows the possibilities and their corresponding probabilities. The right-hand column and bottom row are sums of what lies, respectively, to the left and above them. Thus, ^ is the value for one technology and ^ + h is the value for the other; h = 0 denotes no difference. It is the null hypothesis. The four entries in the middle of the table are parameters that define probabilities for a single study. They are meant to be average values across radiologists, as are the sums that were cited. Our simulations allow for what we know to be the case: radiologists are very different in how they manage and in how they detect.

Two fundamental parameters are y and R. The first is the chance (on average) that a radiologist is "wrong" for both technologies; R is the number of radiologists. These key parameters can be estimated from the counts of the 2 by 2 agreement table resulting from the pilot experiment, and then improved as additional data are acquired.

In our small pilot study of management, we found sensitivity of about 0.60 and specificity about 0.55. The respective estimated values of h varied from more than 0.02 to about 0.07; y was about 0.05. These numbers are all corrupted by substantial noise. Indeed, the variability associated with our estimation of them is swamped by the evident variability among radiologists. For a test of size 0.05, by varying parameters in amounts like what we saw, the power might be as low as 0.17 with 18 radiologists, or as high as 1.00 with only 9 radiologists. The power is very sensitive to the three parameters. No matter how many