Wednesday, July 6, 2011

Another example of null-value assessment by estimation or model comparison

The recent article in Perspectives on Psychological Science (see blog post here) showed an example of estimating the bias of a coin, and assessing the credibility that the coin is fair. In that article I showed that the Bayes factor can change dramatically when the alternative hypothesis is changed, but the Bayesian estimate of the bias barely changes at all.

Here I show an analogous example of estimating the mean and standard deviation for data described by a normal distribution. In other words, it's an example to which frequentists would apply a single-sample t test. The issue is assessing whether the mean is credibly non-zero. The results show that the Bayes factor can change dramatically when the alternative hypothesis is changed, but the Bayesian estimate of the mean barely changes at all. This fact is not news; what is new is illustrating it in BUGS as hierarchical model comparison, using the style of programming used in the book.

One other novelty in this post is a demonstration of the model comparison preferring the null even when the uncertainty in the estimate is large. This conclusion seems very suspicious, again with the estimation technique yielding a more informative conclusion.

For purposes of illustration, I set up a hierarchical model comparison in BUGS, analogous to the example in Section 10.2.1 (p. 244) of the book. Both models are the ordinary Bayesian estimation of mu and sigma of a normal likelihood, exactly as in Chapter 15 (e.g., p. 396) of the book. All that differs between models is the prior on mu. One model represents the null hypothesis and puts an extremely narrow "spike" prior on mu, normal(mean=0,SD=0.01). The other model represents an alternative hypothesis and put a relatively wide prior on mu, such as normal(mean=0,SD=20). For both models, the prior on sigma is uniform(low=0,high=10), which is diffuse relative to the data that will be entered. Altogether, the hierarchical model estimates five parameters: mu_alt and sigma_alt in the alternative model, mu_null and sigma_null in the null model (with mu_null residing close to zero because of the spike prior), and the model index parameter (which is given a 50/50 prior).

The data set was merely a random sample of 40 values from a normal distribution, re-scaled so that the sample had a mean of 0.8 and a standard deviation of 2.0. The resulting posterior distribution (20,000 steps in the MCMC chain) looks like this:

Notice that the model index, displayed in the central plot, substantially prefers the null model, 0.751 to 0.249. In other words, the Bayes factor is .751/.249 in favor of the null hypothesis. But notice that the estimate of mu, shown in the upper right, has a 95% HDI from 0.1814 to 1.453, clearly excluding the null value. From these results, should we conclude that the null hypothesis is credible, as suggested by the Bayes factor, or should we conclude that the null value is not among the credible values, as suggested by the parameter estimate? To me, the latter makes more sense.

When the alternative prior is changed to be a bit less diffuse, as normal(mean=0,SD=1.5), the posterior instead looks like this:

Now the model index substantially prefers the alternative model; the Bayes factor is 0.225/0.775 in favor of the alternative. Thus the model comparison has completely flip-flopped. But the 95% HDI of the estimated mu (upper right) has changed only slightly, now going from 0.1413 to 1.377.

As a final demo, consider a case in which there is a small sample (N=10) with a sample mean of zero:

Here, the model comparison overwhelmingly prefers the null hypothesis (with a Bayes factor of .967/.033), even though the estimate of mu (upper right) is extremely uncertain, with a 95% HDI ranging from -1.6 to +1.6, which is 1.6 SD's of the sample data! From these results, do we strongly believe the null hypothesis, as the Bayes factor suggests, or do we instead believe that the null value is among the credible values, with a large uncertainty? To me, the latter makes more sense.

4 comments:

Well you know of course that my perspective is somewhat different from yours. I feel that when estimation makes more sense, one should do estimation. When model selection makes more sense, one should do model selection and calculate a Bayes factor (possibly with a robustness check). Trying to compare methods that seek to answer different questions is comparing apples and oranges.

It's comparing apples and torafugu. Both are edible, but while apples are generally healthy, torafugu can be poisonous if not prepared very carefully.

As I concluded in the PoPS article (and in the book), null assessment by Bayesian model comparison can be swallowed *if* the alternative hypothesis is very carefully constructed so that it accurately represents a theoretically meaningful alternative. (And, as you mention, a robustness check can help.) But even then, the Bayes factor provides no estimate of the parameter value and hence no information about uncertainty in the parameter value. Usually we want the apple even if we've survived the torafugu.

Bayesian model comparison is most delicious and nutritious when applied to cases other than null-value assessment, because both models are probably theoretically meaningful (instead of one model being "anything else" like a generic alternative hypothesis). But even then, the priors of the models must be equally informed if the model comparison is to be healthy.

I'm not sure whether researchers want that apple at all. In experimental psychology, the question of interest is almost always one of model selection: does the effect exists? The question of how big the effect is, in case it exists, is theoretically not very pressing. I believe this is also the reason why confidence intervals have not replaces p-values: researchers want to know whether their manipulation was effective or not; whether the effect is 30 msec or 50 msec is not relevant.

Also, if we agree that the Bayes factor is the right answer in case the prior is appropriate, then the scientific problem becomes how to specify an appropriate prior. If you know the needle is hidden in a haystack, you wouldn't go look for it on the kitchen table instead.

Depending on the specific domain, scientists can be interested in both questions: Is there a non-zero effect? and What's the magnitude of the effect? The issue is not which question is more appropriate or more popular to ask, because in fact both can be legitimately asked in different applications. The issue is what's the most useful way to address the questions.

The parameter-estimation approach is more informative than the model-comparison approach, and the model-comparison approach can be misleading. These were among the points I made with the examples in the original blog post and in the articles. To recap the blog post, when model comparison and parameter estimation disagree, it's the parameter estimation that is more sensible. When model comparison and parameter estimation agree, it's the parameter estimation that is more informative, because is reveals what parameter values are credible. A special case of the latter situation is when the sample size is small and the Bayes factor strongly prefers the null hypothesis even though the parameter estimation reveals huge uncertainty in the parameter value.

You (E.J.) correctly point out that the model-comparison approach relies on the scientific establishment of an appropriate prior for the alternative (non-null) hypothesis. There are various ways to set an alternative-hypothesis prior. One way relies on mathematical criteria to define an "uninformed" prior. Unfortunately, there are different mathematical criteria that lead to different priors, and, more fundamentally, that method has limited usefulness because the uninformed prior may have little resemblance to any alternative hypothesis that scientists actually care about. Another way to establish a prior relies on elicitation from experts who intuit a prior that expresses their beliefs. That method might be useful if the theoretical debate can be reduced to ad hominem arguments about competing intuitions. Another --and most useful-- way to establish a scientifically informed alternative-hypothesis prior is by using a posterior distribution derived from previous data sets. The posterior distribution from previous data is computed through parameter estimation. Thus, even the proper use of the model-comparison approach relies on using parameter estimation.

One more thing that parameter estimation is useful for, besides serving as a prior for subsequent model comparisons, is power analysis. The posterior distribution provides parameter values for generating simulated data for analyzing power. The Bayes factor from model comparison provides no way to assess power.