Friday, April 29, 2011

Here's a simple example to illustrate some of the advantages of Bayesian data analysis over maximum likelihood estimation (MLE) with null hypothesis significance testing (NHST). I haven't seen this example anywhere else, but please let me know if similar things have previously appeared "out there".

Figure 1.

Suppose we have data consisting of <x,y> pairs for every individual; e.g., X is time and Y is amount of carbon dioxide in a sample. An example of this type of data is shown in Figure 1. We might try to describe this sort of data with simple linear regression. Instead, we'll describe the data with an exponential growth curve:

Y = exp( b1 ( X - b0 ) ) + N(0,s)

where b1 is the growth rate, b0 is the X value at which the predicted Y value is one, and s is the standard deviation of normally distributed noise. Notice that s is the same at all values of X. (The data in the figure were generated from the model using particular values for b0, b1, and s. But you don't need to know the generating values.)

Goals of the analysis:

Estimate the values of b0, b1, and s that are most credible given the sample of data.

Predict values of Y for future X; i.e., values of X larger than any yet observed.

We can attempt to address those goals by Bayesian analysis or by MLE+NHST. Let's see what happens.

Goal: Estimate the values of b0, b1, and s that are most credible given the sample of data.

The MLE is the specific combination of <b0,b1,s> values that maximizes the probability of the data:

Πi p( yi | xi, b0, b1, s ) =Πi N( yi - exp( b1 ( xi - b0 ) ) , s )

The MLE is graphed as the red line in Figure 1. It can be seen that the MLE provides a reasonable description of the trend in the data. But it provides no sense of what other parameter values are reasonable, given the data.

By contrast, a Bayesian analysis inherently reveals the entire 3-dimensional conjoint distribution of credible parameter values. Figure 2 (below) shows aspects of the posterior distribution, using a very diffuse prior.

Figure 2.

Notice in the left panel of Figure 2 that the credible values of b0 and b1 trade off with each other, in a curious curved fashion. A random subset of these credible values are plotted as the blue curves in Figure 1.

The MLE point estimate is plotted as the red circle in the left panel of Figure 2. Notice that the MLE reveals nothing about what other values of b0 and b1 are reasonable, given the data. Sometimes people approximate the likelihood function at the MLE with a multivariate Gaussian, with covariance matrix set to the second derivatives of the likelihood function (either analytically computed or numerically approximated). But the posterior is obviously not Gaussian, so the asymptotic (large N) approximation is not good in this case, even when N=100.

The right panel of Figure 2 shows clearly that the credible values of b1 are larger than zero. 100% of the posterior falls above zero, and the 95%HDI falls way above zero; even a wide ROPE would be excluded by the HDI. The posterior also reveals the gradation of credibility across the range of b1.

How would NHST decide whether b1 is non-zero? We need to determine whether p<.05. And to determine p, we need to get a sampling distribution of b1. To get a sampling distribution, we need to determine what the intended sampling process was. In this case, there happened to be N=100 data points, but we don't know whether the experimenter intended to stop when N=100, or intended to stop at the end of the week and just happened to get 100 observations, or intended to collect 150 but was unexpectedly interrupted, or whatever. For each sampling process, there is a different sampling distribution and a different p value! But suppose we select some sampling process. Next, we need to figure out a null hypothesis from which to generate simulated samples. The null hypothesis will be the model with b1=0, because the "null" we're asking about is b1=0. But what values should be use for b0 and for s? We could use the MLE estimates of b0 and s that we happened to observe in our real data, but those values are just from a random sample and plausible null hypotheses could use different values of b0 and s. For each value of b0 and s, there is a different sampling distribution and a different p value (I think; correct me if I'm wrong). But suppose we select some values for b0 and s. Then we need to generate the sampling distribution. Maybe someone could derive it mathematically, but in lieu of that derivation, it can be approximated by simulating many random samples from the null hypothesis and for every random sample computing the MLE. Finally, we compute the percentile of the actual MLE b1 in the sampling distribution of MLE b1, which yields (1 minus) the p value. All that work to get something that is ill defined, and tells us only whether we'd reject b1=0.

It does not tell us what other values of b1 we would or would not reject. To determine the NHST confidence interval, we need to repeat the above process for every candidate value of b1! The confidence interval is the range of b1 values that we would not reject, so we have to go through the sampling distribution computations, and all their assumptions, for every candidate value of b1. And even when we determine the limits of the confidence interval, we still don't have any gradation of confidence over that range.

Figure 3 shows the data replotted with an expanded Y scale. The MLE is plotted by the red bars, which show the 95% HDI of its predicted values Y. In other words, the MLE value of s specifies a spread of variability in the data, and that spread is plotted as vertical red bars, spanning the MLE curve that was shown in Figure 1.

Figure 3 also shows the Bayesian posterior predictions, as the vertical blue bars. Notice that the Bayesian predictions reveal the increasing uncertainty in prediction as the probed X values extrapolate farther beyond the observed values. Notice also that the mean of the Bayesian predictions (marked by the blue horizontal tic marks) curves away from the MLE prediction in this case! This is caused by the asymmetric spread of credible curves around the MLE curve.

Saturday, April 23, 2011

Exercise 18.3 (p. 513) has you consider an ANOVA model that does not assume that every group has the same underlying variance. But the model is applied to data that, it turns out, have fairly homogeneous variances, and therefore the basic conclusions do not change from the model with homogeneous variances assumed. The graph at right shows some data in which the assumption of homogeneous variances does matter. There are four groups with apparently quite different variances. For these data, Groups 2 and 3 have small variances, and therefore they might be credibly different if considered alone. But groups 1 and 4 have large variances, so if the model is forced to accommodate all four groups simultaneously with a single variance (i.e., a single precision parameter), then the estimated variances of Groups 2 and 3 will be too large. Consequently, the estimated means of Groups 2 and 3 will be uncertain because a wide range of candidate mean parameter values can accommodate the data when the variance parameter is large. And, therefore, the means of Groups 2 and 3 will not be credibly different. Simultaneously, the estimated variances of Groups 1 and 4 will be too small. Consequently, the estimated means of Groups 1 and 4 will be too certain because only a relatively narrow range of candidate mean parameter values can accommodate the data when the variance parameter is small. And, therefore, the means of Groups 1 and 4 will appear to be credibly different.

Below are histograms of the differences between group deflections when the estimated variances are forced to be equal across groups. Group 2 vs 3 is lower left; Group 1 vs 4 is upper right. Notice that the comparison of Groups 2 and 3 suggests (inappropriately) that they are not credibly different, and the comparison of Groups 1 and 4 suggests (inappropriately) that they are credibly different. Click the image to display it enlarged, but don't forget to click the "back" button of your browser to come back to this page.

Below are histograms of the differences between group deflections when the variances are allowed to be non-homogeneous across groups. Group 2 vs 3 is lower left; Group 1 vs 4 is upper right. Notice that the comparison of Groups 2 and 3 suggests (appropriately) that they are credibly different, and the comparison of Groups 1 and 4 suggests (appropriately) that they are not credibly different. Click the image to display it enlarged, but don't forget to click the "back" button of your browser to come back to this page.

By the way, if you run a traditional, NHST-style ANOVA on these data, using R's aov() function, and follow it with pairwise comparisons using R's TukeyHSD() function, you get the graph displayed at the right. Notice that it says that the difference between Groups 1 and 4 (labeled 4-1) is significant, but the difference between groups 2 and 3 (labeled 3-2) is not significant. In other words, aov() assumes homogeneity of variances.

Friday, April 15, 2011

[For these editors, there might be versions for MacOS and Unix, but if you are going to run the programs from Doing Bayesian Data Analysis that call BRugs, then you'll need to run the 32-bit Windows version of R and (I presume) run the editor in Windows too. But if you don't use BRugs, and instead use JAGS with rjags, then you should get the version of the editor that is appropriate for your operating system.]