Good Bayes Gone Bad

A reader recently linked to a book about information and inference, which definitely leans toward the Bayesian rather than frequentist view of inference. I do too. But I’m not the avowed anti-frequentist that some Bayesians are.

The book contains what is, in my opinion, at least one example of Bayesian analysis gone wrong. Chapter 37 (in section IV) discusses Bayesian inference and sampling theory. It begins with an example of a clinical trial.

We are trying to reduce the incidence of an unpleasant disease called microsoftus. Two vaccinations, A and B, are tested on a group of volunteers. Vaccination B is a control treatment, a placebo treatment with no active ingredients. Of the 40 subjects, 30 are randomly assigned to have treatment A and the other 10 are given the control treatment B. We observe the subjects for one year after their vaccinations. Of the 30 in group A, one contracts microsoftus. Of the 10 in group B, three contract microsoftus. Is treatment A better than treatment B?

The author begins by undertaking a frequentist analysis of the null hypothesis that the two treatments have the same effectiveness. He applies a chi-square test (pretty standard), as well as a variant (Yates’s correction). He first tests using a critical value (at 95% confidence), concluding that the uncorrected chi-square rejects the null hypothesis (but not by much) while the corrected chi-square fails to do so (but not by much). Then he estimates a p-value, getting 0.07 — which is not significant at 95% confidence but is at 90% confidence. The overall result is that there’s some evidence that treatment A is better, but it’s certainly not conclusive. Incidentally, he also warns that since the observed numbers are small, the chi-square test is imperfect (but that’s not relevant to the point we’ll address).

Then he takes a Bayesian approach to evaluate the difference in effectiveness of treatment A and treatment B. He begins by saying:

OK, now let’s infer what we really want to know. We scrap the hypothesis that the two treatments have exactly equal effectivenesses, since we do not believe it.

Remember that.

Let be the probability of getting “microsoftus” with treatment A, while is the probability with treatment B. He adopts a uniform prior, that all possible values of and are equally likely (a standard choice and a good one). “Possible” means between 0 and 1, as all probabilities must be.

He then uses the observed data to compute posterior probability distributions for . This makes it possible to computes the probability that (i.e., that you're less likely to get the disease with treatment A than with B). He concludes that the probability is 0.990, so there's a 99% chance that treatment A is superior to treatment B (the placebo).

However, there’s a problem with this analysis. He assumed that and are different, so he has rejected the null hypothesis by assumption. Given that, it’s no surprise that his result resoundingly favors treatment A! I’ll also point out that his model — that and are unequal — incorporates two parameters rather than the one in the null hypothesis (), but he hasn’t included any inference penalty for the extra parameter (as would be required by any good information criterion) because he rejects the null by assumption.

Let’s do the analysis again. For the frequentist side, instead of using a chi-square test (a bit dicey with such small numbers) we’ll use the exact test, the hypergeometric distribution. Under the null hypothesis (that A and B have the same effect), the probability of getting cases in samples for A, and cases in samples for B, when we have total samples with cases, is

.

The “combination” operator is the usual, given by

.

We compute the frequentist p-value by summing the probabilities for the observed, and more extreme, cases, i.e.,

.

Using this exact test, the p-value is less than 5%, so by the 0.05 standard we actually end up rejecting the null hypothesis (but not by much). Hence it’s likely (one could even say “statistically significant”) that the treatment is effective, but that doesn’t mean proved conclusively.

Now let’s take a Bayesian approach. But instead of just assuming that the null hypothesis is false, let’s compare the null and alternate hypotheses. Under the null hypothesis, there’s a single probability for both treatments A and B, under the alternate there are two different probabilities and . I too will adopt the uniform prior probability for the values, that all possible values are equally likely.

I won’t give the full calculation, but I will give the final result. The probability of getting the observed result (the “data” ) under the null hypothesis is

.

The probability of getting the observed result under the alternate hypothesis is

.

We see that the given data are more likely under the alternate hypothesis — that treatment A differs from treatment B — than under the null that there’s no difference. But it’s not overwhelmingly more likely. Clearly it’s likely that the treatment is effective, but it’s far from proved conclusively.

To translate these likelihoods into a probability that the alternate hypothesis is true, we’d have to have a “prior probability” that the alternate is true. Let this prior probability be . Then the probability that the alternate hypothesis is true is

.

If we use a 50-50 prior (equal chance that the treatment works and that it doesn’t) we get . With this prior, the chance that treatment A is actually having an effect is only 75%, a far cry from the 99% previously claimed.

And it’s well to bear in mind that even the 75% result depends on the prior, and although a 50-50 prior doesn’t assume much, it is an assumption and has no real justification. Perhaps the best we can say is that the data enhance the likelihood that the treatment is effective, increasing the odds ratio by about a factor of 3. But, the odds ratio after this increase depends on the odds ratio before the increase — which is exactly the prior we don’t really have much information on!

In case you feel a bit cheated because I didn’t show the details of how to do the calculation, don’t feel too bad, because the author of the aforementioned text does so himself! Immediately after computing that the probability is 99% that the treatment is effective (which frankly I disagree with!), he considers the example of just one person given treatment A and one person given treatment B. Then he performs exactly the hypothesis comparison I’ve discussed for the example with 30 patients given treatment A and 10 given treatment B. A bit ironic, eh? You can read all about it there.

29 responses to “Good Bayes Gone Bad”

However, there’s a problem with this analysis. He assumed that \theta_A and \theta_B are different, so he has rejected the null hypothesis by assumption. Given that, it’s no surprise that his result resoundingly favors treatment A!

Wait, what? Surely rejecting the null only commits us to the statement that \theta_A != \theta_B, and says nothing about which is larger.

[Response: Indeed. But since the data favor A, if you *assume* the null is false then it’s A (rather than B) that gets an exaggerated significance. Essentially, is *is* much more likely that A is better than B, than that it is *worse* — but it’s still a mistake to reject their *equivalence* by assumption.]

For 2×2 contingency tests, exact tests are preferable to Chi-squared tests, but I have gotten used to doing ‘approximate’ exact tests by brute force because there are less restrictions on numbers of categories. I used freeware called “RxC” on this example and came out with p = 0.0413 (which is quite close to your p-value). I might note, though, that this is a two-tailed test. Should we be doing a one-tailed test?

[Response: It so happens that because the numbers are integers, the other “tail” doesn’t kick in — the case P(4,0) is *not* “more extreme” than the observed result.]

I don’t know Bayes from Adam, so I’m generally confused here, but there’s something else specific about this example that confuses me. Can’t we (shouldn’t we) use the background incidence of microsoftus to inform the prior? Does it make a difference?

[Response: No information is available about the background incidence (since it’s a hypothetical problem). If we had that info, then we might incorporate it into devising a prior. But even then we have to be careful … ]

Yes, I just read that chapter of the book, and it jarred with me too… good observation!

BTW about the probability of a chemical substance taken at random from a sales catalog, say, having any clinical effect on a specified disease, I would guess that is very small. Silver bullets are rare. Now if this were a drug that already did well on other mammals…

The model \theta_A=\theta_B is unrealistic and unphysical. Any treatment is likely to have some effect, however small. A correct prior would then be relatively uninformative about the absolute value of \theta, but it would prefer \theta’s with small difference. In a fundamental(ist) bayesian sense then, I see this as a prior encoding problem. (But read further.)

Your two models could be interpreted as a mixture prior of “diagonal delta” at \theta_A=\theta_B, and then an uninformative background. From the fundamentalist viewpoint, this hides part of the prior encoding problem by taking the delta limit, and leaves the other part, \pi, unresolved.

But in practice, nobody would believe an analysis if results were relative to a complex prior that is an approximation of the even more complex subjective prior.

In practice I would therefore use the MacKay model, show the posterior of \theta_A-\theta_B graphically, test prior sensitivity and say that evidence is not quite enough, although there seems to be something in there. Testing sensitivity to the prior is essential here.

Note that \theta = const. over the interval (0,1) is not the only obvious choice for an uninformative prior. (I’m too lazy to check, but if the prior is conjugate beta, parameters (0,0) and (0.5, 0.5) are also used instead of (1,1) that leads to the \theta = const.)

Note that the chi-square approximation and the permutation test both have hidden priors.

BTW, IMO all p-values 0.01-0.07 are on the same ballpark. If there is little data, all analysis is sensitive to assumptions.

Before someone asks what use is bayes if one cannot trust the prior, I’d say that uninformative prior families can be empirically tested on a large scale, as has been done for GLM’s between L1, L2, and Cauchy. Even more importantly, sometimes data is the prior, as in hierarchical designs, and in models for dependencies over time or space.

“The model \theta_A=\theta_B is unrealistic and unphysical. Any treatment is likely to have some effect, however small.”

I can’t agree with this. Does a homeopathic remedy have an effect as compared to placebo? Chemicaly, they may be identical. If you have no knowledge of the remedy, you cannot assume that it has an effect.

I agree about homeopathy, but it is a special case. The typical case is placebo against an active molecule. Even if you were testing homeopathy *while* being physically ignorant of its implausibility, would you a priori want to assign a finite probability to a zero-measure set?

But the issue is kind of beyond the point, although it may interestingly highlight some of the cognitive conventions of people used to more bayesian and more frequentist line of thought.

I guess my main point was that a practical bayesian would keep it simple, and observe that there is too few data and a high sensitivity to the (artificial) prior. This is quite close to what a frequentist would do.

The posterior summary of the bayesian corresponds to a frequentist confidence interval. When H0 is rejected there is little difference. When H0 is not rejected, the confidence interval reveals the power of the experiment. This is often practically imporant, although it is sometimes a reason to avoid confidence intervals in publications. It is easier to say “p=0.17” than to acknowledge that the estimate is all over the place because of high measurement errors or inadequate data.

Nothing happening on Open Thread 18, so I strayed here. A brief contribution to muddy the waters still further, by examining what it is you really want to test.

θ_A=θ_(B )is unrealistic. What you probably want to test is the null hypothesis θ_A≥θ_(B ) against the alternative θ_A<θ_(B ), or perhaps θ_A≥θ_(B )–δ vs. θ_A0 is some minimal practical difference between θ_A and θ_B that you interested in.

I believe that there’s quite a literature out there on such questions. I don’t know it myself, but the key word bioequivalence may get you into it.

Regarding Bayes vs. other approaches most statisticians are pragmatists and recognise that there are occasions when each approach is useful/appropriate.

[Response: Sorry I didn’t respond to your earlier emails; I’ve been so swamped (with both personal and professional things) I just haven’t had time for anything but keeping my head above water.

I think this is one of those cases in which a pragmatist would acknowledge the usefulness of the frequentist approach.]

θ_A=θ_(B )is unrealistic. What you probably want to test is the null hypothesis θ_A≥θ_(B ) against the alternative θ_A<θ_(B ), or perhaps θ_A≥θ_(B )–δ vs. θ_A0 is some minimal practical difference between θ_A and θ_B that you interested in.

A section in the middle above got omitted. It should have read '… vs. theta_A zero) is some minimal …

On the same day that you published this entry I sent a note to one of the UK Cabinet Ministers concerned with climate change.

My note argues for a Plan B which takes the role of shorter lived greenhouse forcing agents more urgently to cut the rise in global temperatures quickly, to avoid unexpected climate feedbacks or lack of progress in cutting CO2 emissions.

Plan A, the current UK Government Plan, relies on mitigating the emissions of long lived greenhouse gases, especially carbon dioxide, during this century with an emphasis on keeping the level below an all important peak level.

Plan B has the advantage of creating room for policy changes as the climate story unfolds. Plan A has no escape route.

My note identifies Professor David MacKay (and Professor Myles Allen) as key influences in Plan A. David MacKay is is Chief Scientific Advisor to the UK Department of Energy an Climate Change and is also the author of the book you discuss.

Despite your criticisms, which sound valid, I will download Professor MacKay’s book – the download is free of charge. I liked one of his other books “Sustainable Energy – without the hot air”.

But I can I hope that an eminent commentator, such as yourself, will comment on the short/long term mitigation argument in a way that focuses this debate. Now’s the time.

Unless I have missed something, I don’t really see the irony in MacKay discussing the Bayesian version of the “traditional” hypothesis. He is just showing that the Bayesian machinery allows you to address that question as well, should you want to (ISTR Janes doing something similar). The mechanism is the same, the important thing is to ask the right question (which seems to be the key contention).

I also don’t see why the frequentist test is preferable in this situation. The advantage of the Bayesian approach is that it gives a direct answer to the question you really want answered, whereas the frequentist approach doesn’t. The Bayesian approach tells you which hypothesis is more likely given the data, whereas the frequentist approach only tells you the frequency of a statistic at least as large (in some imaginary replication of the experiment) given that the null hypothesis is true, and then the statistician needs to make a choice (not without a subjective element) as to whether that probability is too high to be confident that the null hypothesis is false. Certainly isn’t a direct answer to the question, it isn’t surprising that p-values are so often misinterpreted as being the answer to the question we want answered!

As a Bayesian I would start by just looking at the Bayes factor, which is about three, which (at least according to Wikipedia) is evidence for the alternate hypothesis being on the borderline between “barely worth mentioning” and “substantial” (note the prior \pi isn’t involved), which seems reasonable as an initial finding.

As to the problem of the influence of \pi, you could always assume an uninformative hyper-prior over \pi and integrate it out of your computation. You would then have the desired probability assuming that you knew nothing about the true value of \pi. The ability to explicitly say there are things we know we don’t know is one of the best features of the Bayesian approach (Rumsfeld, 2002).

Ian Jolliffe is quite right, both Bayesian and frequentist methods have the advantages and disadvantages. Having said which I do enjoy pretending to be a Bayesian fundamentalist occasionally as I am greatly outnumbered by my frequentist colleagues. ;o)

Ask yourself this, if you saw those data, would you honestly believe that both the treatments (which you know to be different in terms of their biological/chemical/medical fundamentals) had precisely the same effect, to 10 decimal places?

As others have pointed out, no reasonable person would assign significant probability to such a belief even at the outset of the test. The only possible weakness I can see in McKay’s example is that the prior doesn’t assign much weight to the treatments being rather similar, which I suspect in medical testing may be more appropriate. But this is rather more subtle than just rejecting the “nil hypothesis”.

[Response: I don’t agree. It seems to me that a chemical about which we have no information has a vanishingly small probability of affecting the likelihood of contracting a disease. Does aspirin reduce your chance of contracting AIDS?

Clearly we have different priors — I see no reason a priori to believe there’s any effect, for good or ill. And if we do the test — the Bayesian test — to ask the question “is there an effect?” — we get a suggestive but inconclusive result.

This is one of those cases where I think the frequentist approach is appropriate. Chance of given result under the null hypothesis: a little over 4%. Null rejected at 95% confidence. Treatment probably works, but we can’t be dead-sure without more data.]

In practice, it wouldn’t be a chemical about which we have no knowledge. The fact that it has gone as far as a clinical trial strongly suggests that there is some a-priori reason to believe it will be effective, probably from chemoinformatic evidence, or in-vitro or animal testing.

Both the Bayesian and frequentist tests sugget that the “treatment probably works, but we can’t be sure without more data” (as the Bayes factor suggest the evidence in favour is just about substantial).

The frequentist approach is certainly appropriate, but there is little to suggest it is “better” than the Baysian approach AFAICS. Conversely the Bayesian approach makes all of the assumptions explicit and gives a direct answer to the question of interest.

Tamino, out of curiosity I googled for ASA and HIV – it seems that Aspirin inhibits production of HIV through nuclear factor kappa binding. So it’s a close call. :) This was of course completely irrelevant, but it demonstrates how messy things are in biology.

Well, if you want to say “vanishingly small” you can choose a prior for that too. But you probably wouldn’t want to completely rule out the possibility that the drug works, or you might as well not do the test!

>He assumed that and are different, so he has rejected the null hypothesis by assumption.

No he didn’t. He merely didn’t assume that they are the same. With appropriate data his analysis could have led to the conclusion that they are the same. Bayesian analysis does not make use of null hypotheses.

>I’ll also point out that his model — that and are unequal — incorporates two parameters rather than the one in the null hypothesis (), but he hasn’t included any inference penalty for the extra parameter (as would be required by any good information criterion)

He doesn’t need any penalty. Information criteria (AIC, BIC, etc.) are not used in fully Bayesian model fitting.

[Response: This QUOTE is from the author:

We scrap the hypothesis that the two treatments have exactly equal effectivenesses, since we do not believe it.

I think it would be fair to say that MacKay didn’t reject the null hypothesis, he just thought a different null hypothesis was more appropriate (which I would agree with). Ian Jolliffe’s suggestion of adding a delta representing a practically significant level of effectiveness is even better.

It is also fine that there is no inference penalty for the extra parameter as the parameter is integrated out to get the marginal likelihood rather than inferred. It is the optimization of parameters that makes the inference penalty necessary.

[Response: Come on, people — what part of

QUOTE

We scrap the hypothesis that the two treatments have exactly equal effectivenesses, since we do not believe it.

MacKay thinks the question we should be asking is “is treatment A beneficial”, if so the correct null hypothesis is that treatment A is harmful (compared to the placebo) or as useless as the placebo.

H0: theta_A >= \theta_B
H1: theta_A \theta_B that explains why you get the result of 0.99 instead of 0.75 (the data don’t give much support for that).

The difference is in the question posed, not the machinery. It is “is A better than a placebo” rather than “is A different from a placebo”. MacKay doesn’t assume that theta_A is not equal to \theta_B in his analysis, he only scraps it as a suitable null hypothesis on its own.

MacKay thinks the question we should be asking is “is treatment A beneficial”, if so the correct null hypothesis is that treatment A is harmful (compared to the placebo) or as useless as the placebo.

H0: theta_A greater than or equal to theta_B
H1: theta_A less than theta_B

so the possibility that the probabilities are equal is still there in the null hypothesis. It is the theta_A greater than \theta_B part of H0 that explains why you get the result of 0.99 instead of 0.75 (the data don’t give much support for that).

The difference is in the question posed, not the machinery. It is “is A better than a placebo” rather than “is A different from a placebo”. MacKay doesn’t assume that theta_A is not equal to theta_B in his analysis, he only scraps it as a suitable null hypothesis on its own.

Neither question is “wrong” it is more a matter of which you think is more
interesting.

To say “he has rejected the null hypothesis by assumption” seems, to me, to sound too much like he has set up a null hypothesis and rejected it, which, of course he hasn’t (and I do not believe you mean this).

But what, then, does it mean to reject a null hypothesis by assumption?

How can he reject a null hypothesis without setting up a null hypothesis?

Well, it seems that it is your null hypothesis, and it seems it is being insinuated as the (only) logical starting point. Why should this be the case?

Again, as Blaise Egan said, to not assume equality is not the same as to assume inequality. The latter is a narrower assumption, and in this case the former means to make no assumption: they could be equal or either theta greater than the other. (If the thetas are allowed to be continuous, then of course it is basically impossible for them to come out equal, but quite possible for them to come out similar enough to be deemed functionally equivalent — in general, if not in this instance.)

Perhaps the “since we do not believe it”, the author’s reason to not assume equality, is part of the point? If this is thought to suggest that the author wants to force inequality, then I disagree with this interpretation (based on what I’ve read here and assuming inequality to mean a greater-than-negligible difference). Again, to scrap a hypothesis is not the same as to assume its opposite.

Hence, I don’t see how your response’s quote contradicts what Blaise Egan says.

BIC is not typically usable in a Bayesian model, because the number of parameters is not well defined. DIC is quite widely used instead. There are also other ways to validate models (posterior predictive checks, CV, etc.)

“Full Bayesian” in the sense of not trying different models is a fantasy. In almost all practical settings different models should be tried. (The present example does not count.)

You cannot really integrate \pi out above. Technically you can, but using an “uninformative” hyperprior for a single parameter does not help much.

I agree that the equal thetas (null) model should be compared to the different thetas model (rather than assumed to be wrong), but I think the Bayesian test used here inherently favours the null model. With the “50-50” prior on the hypothesis, the posterior probability that H1 is true is the ratio of marginal likelihoods (i.e. the likelihoods integrated over the prior distributions on the thetas), or the Bayes factor.
But marginal likelihoods – hence Bayes factors – can be very sensitive to the parameter priors (especially with small sample sizes). Bayes factors inherently penalize additional parameters, and the degree of penalty increases with the “vagueness” of the prior (because extreme parameter values – which may actually be very unlikely and fit the data very poorly – are given the same prior weight as realistic values). Therefore the use of uniform priors -while uninformative about the parameter values – may “bias” results of model comparison toward the simpler models (e.g. nulls).
The uniform prior says that infection rates of 100% are just as likely (a priori) as zero infection rates (for both treatments). Even basic information about the process of interest would allow more informative (about possible parameter values) priors, that likely would make a more powerful comparison (and still be objective).
Using a probit model in WinBUGS with an essentially uniform prior (standard normal prior on probit coefficients), I get a posterior probability of 0.86 for the alternative model (not sure why this is different to Tamino’s result!). If I use a prior with marginally less weight on extreme values, P(H1 | data) increases to > 0.9, and if I use a U-shaped prior that puts more weight on extreme values, P(H1|data) = 0.5. All of this is without changing the 50-50 prior on models.

Yep, and that is good news. In Bayesian, the problem has a name and we can discuss it in its own terms. There are real-life situations (like this one) where it just isn’t very clear what an appropriate prior looks like. All you can do then is ‘what if’ for all possibly plausible priors, a sensitivity analysis as it were. If the result depends on it, you need more data.

Yes. Naive Bayes and Bayes estimators also have Bayes in the name but aren’t Bayesian. Empirical Bayes is also not fully Bayesian.AIC and BIC both have a Bayesian derivation. They are model selection tools, not model fitting tools. They assume flat priors. Automatic model selection using flat priors is not a very Bayesian thing to do. (Bayesians would do model averaging by preference or, if forced to choose models, would use slab-and-spike priors.) But we’re FITTING here, not choosing models.

[Response: All I did was apply the exact same analysis as before to different numbers — you didn’t object before. Now you want to insist on an informative prior?]

>I really think the current discussion points out the inherent difficulties in Bayesian analysis for some problems.

We have very knowledgeable, well informed and well intentioned analysts with widely divergent opinions on the proper choice of Prior. That sort of thing happens a lot less with Frequentist analysis.

Almost all frequentist analyses have a corresponding Bayesian analysis that uses a flat prior and gets the same result. So the real difference is that Bayesian methods make the prior explicit and available for discussion whereas frequentist methods assume a key component of the model but hide it from debate. That’s not a model of science that I like.

Search for:

Support Your Global Climate Blog

New! Data Analysis Service

Got data? Need analysis?
My services are available at reasonable rates. Submit a comment to any thread stating your wishes (I'll keep it confidential). Be sure to include your email address.