This is actually the second entry in this series;† the first was Part V of the Homeopathy and Evidence-Based Medicine series, which began the discussion of why Evidence-Based Medicine (EBM) is not up to the task of evaluating highly implausible claims. That discussion made the point that EBM favors equivocal clinical trial data over basic science, even if the latter is both firmly established and refutes the clinical claim. It suggested that this failure in calculus is not an indictment of EBM’s originators, but rather was an understandable lapse on their part: it never occurred to them, even as recently as 1990, that EBM would soon be asked to judge contests pitting low powered, bias-prone clinical investigations and reviews against facts of nature elucidated by voluminous and rigorous experimentation. Thus although EBM correctly recognizes that basic science is an insufficient basis for determining the safety and effectiveness of a new medical treatment, it overlooks its necessary place in that exercise.

This entry develops the argument in a more formal way. In so doing it advocates a solution to the problem that has been offered by several others, but so far without real success: the adoption of Bayesian inference for evaluating clinical trial data.

Many readers will recognize that the term “prior probability” comes from Bayesian statistical analysis. They may correctly conclude that at least part of the reason to prefer Bayesian over “frequentist” statistical evaluations of clinical trials—which have been dominant throughout the careers of every physician now alive—is that the former require considering evidence external to the trial in question. That, of course, is what we should be doing in any case, but it helps to have a formal reminder. Bayes’ Theorem shows how our existing view (the prior probability) of the truth of a matter can be altered by new experimental data. Prior probability must be estimated from all existing evidence: basic science, previous clinical trials, funding sources, investigators’ identities and histories, and other factors. How conclusions based on such evidence might be altered by new data is illustrated by this statement of Bayes’ Theorem:

Where:

P stands for probability;

A is the hypothesis in question;

| stands for “given”; and

B is the data generated by the trial at hand.

Thus P(A|B), the probability of the hypothesis given the data (also called the “posterior probability”), is proportional to P(B|A), the probability of the data given the hypothesis, and also to P(A), the “prior probability” of the hypothesis. P(A|B) is inversely proportional to P(B), the probability of producing the data.

We might not know P(B), but it is a constant. Thus on the right side of the equation we can direct our attention to the terms in the numerator, which predict certain things: if the prior probability of a hypothesis is high, it will not require much in the way of confirming data to reassure us of that opinion. If the prior probability of a hypothesis is small, it will require a large amount of credible, confirming data to convince us to take it seriously. If the prior probability is exceedingly small, it will require a massive influx of confirming data to convince us to take it seriously (yes, extraordinary claims really do require extraordinary evidence). The simplest result, albeit one that many find discomfiting, is found if P(A) approaches zero: no amount of “confirming data”—especially of the error-prone sort generated by a clinical trial—should convince us to accept the hypothesis.

It turns out that the last assertion, although undeniably true, is not necessary to make the case for the superiority of Bayesian statistics in clinical research. I didn’t know that until I had read the following two articles:

Dr. Goodman makes the arguments for Bayesian inference in far more compelling ways than I can. Thus after a brief introduction I’ll quote him liberally—and ask his forgiveness in advance for whatever embarrassing inaccuracies I will be or may have already been guilty of spouting.

Dr. Goodman observes that it is the subjective nature of prior probabilities (“measuring ‘belief’ ”) that explains why clinical trial literature has shied away from Bayesian statistics, instead favoring the familiar “frequentist statistics” with its P values, confidence intervals, and hypothesis tests: tools that are widely assumed to provide objective measures of evidence for hypotheses by looking exclusively at data from trials. Those tools don’t provide such objective measures, however, nor can they. As any scientist and most physicians know, it is foolish to evaluate trial results without considering external knowledge. What most don’t know, however, is that “frequentist statistics” are irrational tools for the job that they have been assigned to do, and that they include methods that are not compatible even with each other. These points are introduced in the abstract of the first article cited above:

An important problem exists in the interpretation of modern medical research data: Biological understanding and previous research play little formal role in the interpretation of quantitative results. This phenomenon is manifest in the discussion sections of research articles and ultimately can affect the reliability of conclusions. The standard statistical approach has created this situation by promoting the illusion that conclusions can be produced with certain “error rates,” without consideration of information from outside the experiment. This statistical approach, the key components of which are P values and hypothesis tests, is widely perceived as a mathematically coherent approach to inference. There is little appreciation in the medical community that the methodology is an amalgam of incompatible elements, whose utility for scientific inference has been the subject of intense debate among statisticians for almost 70 years. This article introduces some of the key elements of that debate and traces the appeal and adverse impact of this methodology to the P value fallacy, the mistaken idea that a single number can capture both the long-run outcomes of an experiment and the evidential meaning of a single result. This argument is made as a prelude to the suggestion that another measure of evidence should be used—the Bayes factor, which properly separates issues of long-run behavior from evidential strength and allows the integration of background knowledge with statistical findings.”

The “intense debate” that Dr. Goodman refers to is over a problem central to science: that of “inductive” vs. “deductive” reasoning. As many will recall from college philosophy courses, “inductive” reasoning uses observations to generate hypotheses: if the first 10,000 swans one sees are white, then a reasonable (tentative) hypothesis is that all swans are white. That is the way science, including clinical trials, usually works. The obvious problem with it is that it can’t be conclusive: the 10,001st swan might be black. “Deductive” reasoning begins with a principle and makes predictions: if at least some swans are white, then the next one we see has a probability > zero of being white. Deductive reasoning is logically sound, but has obvious limitations as a tool for learning about nature.

It turns out that “frequentist statistics” not only lacks a formal way to consider external evidence, but is inappropriate for evaluating clinical trials for a more fundamental reason: it applies only to deductive inference. Thus

…when physicians are presented with a single-sentence summary of a study that produced a surprising result with P = 0.05, the overwhelming majority will confidently state that there is a 95% or greater chance that the null hypothesis is incorrect. This is an understandable but categorically wrong interpretation because the P value is calculated on the assumption that the null hypothesis is true. It cannot, therefore, be a direct measure of the probability that the null hypothesis is false. This logical error reinforces the mistaken notion that the data alone can tell us the probability that a hypothesis is true.” (emphasis added)

On the other hand,

Determining which underlying truth is most likely on the basis of the data is a problem in inverse probability, or inductive inference, that was solved quantitatively more than 200 years ago by the Reverend Thomas Bayes.”

Dr. Goodman explains this expertly, but for me it still required a few reads before it began to sink in. (Please, dear reader, if you hope to see Evidence-Based Medicine become synonymous with Science-Based Medicine, tackle these articles.) The final sentence in the abstract above introduces the second point about Bayesian statistics that I had not previously appreciated, which follows from its application to inductive inference: it is not necessary to dwell on Prior Probability estimates to appreciate the superiority of the Bayesian method. Another term in the theorem, known as the Bayes Factor, is calculated entirely from objective data but is a more useful and accurate “measure of evidence” than the familiar “P value.” The Bayes Factor is illustrated in this statement of Bayes’ Theorem (from Dr. Goodman’s second article):

Where Bayes factor = Prob(Data, given the null hypothesis)
Prob(Data, given the alternative hypothesis)

The abstract of Goodman’s second article continues the discussion (emphasis added):

Bayesian inference is usually presented as a method for determining how scientific belief should be modified by data. Although Bayesian methodology has been one of the most active areas of statistical development in the past 20 years, medical researchers have been reluctant to embrace what they perceive as a subjective approach to data analysis. It is little understood that Bayesian methods have a data-based core, which can be used as a calculus of evidence. This core is the Bayes factor, which in its simplest form is also called a likelihood ratio. The minimum Bayes factor is objective and can be used in lieu of the P value as a measure of the evidential strength. Unlike P values, Bayes factors have a sound theoretical foundation and an interpretation that allows their use in both inference and decision making. Bayes factors show that P values greatly overstate the evidence against the null hypothesis. Most important, Bayes factors require the addition of background knowledge to be transformed into inferences—probabilities that a given conclusion is right or wrong. They make the distinction clear between experimental evidence and inferential conclusions while providing a framework in which to combine prior with current evidence.”

At this point I must stop, but let me suggest a fun project: pick a “CAM” study of an implausible hypothesis such as homeopathy, “distant healing,” or whatever, that has been evaluated by “frequentist statistics” and purports to demonstrate an effect “significant at P=.04” or so. Now, using your new knowledge of inductive inference, re-evaluate the data using a few points from a range of prior odds of the null hypothesis being true, say 8 to 1—99,999 to 1 (which are far more favorable to homeopathy, for example, than established knowledge warrants). You needn’t even make calculations: both Goodman and Ioannidis provide tables and nomograms that can help you estimate the answers.

Here is a short additional bibliography, including a couple of shameless pitches for offerings by two of your humble bloggers: