When Tests Cry Wolf

A test might be “sensitive,” but is it “specific” enough to be valid and useful?

Many studies try to evaluate the use of tests to find disease of one sort or another. “Tests” are of course not limited to laboratory studies or X-rays, but for the purposes of this type of research may include a given historical finding, an abnormality on physical examination, a set of “high-yield criteria” or anything we can add to our prior knowledge of patients to further discriminate between those who do or do not have the disease in question. The accuracy of a test is its ability to do this job, and is typically described in terms of its sensitivity and its specificity. Sensitivity is the ability of the test to find the disease when it’s present. If your dog has a sensitive nose it can find a bone when one is around.

Specificity is the ability to say there’s no disease when it isn’t present; this is best conceptualized as the avoidance of false positive results, or not identifying a disease when it is absent. If just about every time a patient has crushing chest pain and shortness of breath and 4mm ST elevation the cause turns out to be an MI, you can say those findings are specific for MI, because they’re almost never anything else. Their presence won’t falsely identify MI when it’s not there. In general, the first responsibility of a clinician is to find disease, especially when it’s serious. So, if all else is equal, we first want our tests to be sensitive. But no matter how sensitive a test is, it isn’t of any value if it has very little specificity, because it will not only pick up the disease when it’s present, but mislead us into thinking it’s there when it’s not. After a while, we’ll stop trusting such a test (the boy who cried wolf too often was really sensitive to the presence of wolves, but people stopped listening because he was so non-specific).

One of the important characteristics about sensitivity and specificity is that they are, for the most part, independent of the prevalence of a given disease in the population being tested. A test that’s 95% sensitive will find disease, when its present, 95% of the time, while one that’s 75% specific will be negative in three-quarters of the people who don’t have the disease. These characteristics are primarily from a research point of view, since they tell us how a test performs. Studies of tests should generally concentrate on reporting sensitivity and specificity.

Researchers often try to stress predictive values, which are often inappropriate. Positive predictive value (PPV) is the proportion of all positive tests, in a given group being tested, that come from patients with the disease, and the negative predictive value (NPV) is the proportion of negative tests that come from patients without the disease. PVs are tremendously dependent upon the population studied, and thus, from a research point of view, can be quite misleading.

In a study of abdominal trauma, for example, the authors showed that in a two-year period during which they performed 200 triple-contrast CTs, there was only one patient who proved to have a surgical lesion when the CT was falsely negative. They concluded that this test, with a negative predictive value of almost 100%, should be widely used. Partially hidden, however, was the fact that only two of the 200 patients tested required surgery since they excluded patients with high clinical likelihood, who underwent immediate laparotomy or DPL instead. Thus, the sensitivity of the test was only 50% (1 of 2, one true positive and one false negative), and its NPV was so good only because the baseline NPV in the group, even without any tests, was 99% (198/200). They only missed one spleen, but they’d have missed at most two if instead of triple-contrast CT they had consulted a dermatologist . . . or an astrologer for that matter.

While sensitivity and specificity are typically the most important characteristics in research, we as clinicians mostly need to know how to apply a test to an individual patient, not so we’ll know how good the test is (as the researcher is trying to determine), but so we’ll know whether or not our one patient has the disease (i.e. how well can we trust a positive or negative test). In order to do this, we need to know not only about the test’s accuracy (sensitivity and specificity), but also the prevalence of the disease in patients just like the one we’re evaluating, the prior probability of disease even before the test is done.

Let’s look at a simple example. If a certain HIV test is 99.99% specific, that means it will be falsely-positive in only 1 out of 10 thousand patients who are really HIV-negative. If we do the test among 1 million people in Rwanda, where (conservatively) 30% truly carry the virus, while assuming that the test is also 99.99% sensitive, there will be almost 300,000 positive tests, all but 30 of which are truly infected, and there will be about 700,000 negative tests, all but 70 of which who are free of the virus. The test would have great screening value, since a clinician there could be overwhelming confident in counseling patients about their tests, knowing that only once in every 10,000 people or so would he or she be wrong, regarding both positive and negative results.

This would be very different in a place like rural Utah, however, where for the sake of argument, let us imagine that only one person out of a population of one million truly carries the virus. In this setting, that person’s test would almost certainly be positive, but so would 100 others, one in every 10 thousand people without the disease. Thus, a clinician could be awfully confident reassuring someone with a negative test result. But, even without the test you could pretty well assure everyone, and be right 99.9999% of the time! On the other hand, the PPV of the test would be only 1%, meaning that almost all of the positives would be false, and 100 patients would be mistakenly treated as though they were HIV-positive in order to find the 1 real case.

Therefore, such a testing scheme would obviously be inappropriate in this latter environment, even though the test is still much more accurate than just about any clinical test available, and even with its performance characteristics being no different than in Rwanda where it worked so well. Understanding this paradox is critical to understanding why we need researchers to tell us about test accuracy. Then, we must apply the results to individual patients, when appropriate, according to the PVs they would attain, after (and only after) first estimating the likelihood, before the test is done, that a given patient has the disease in question.