Wednesday, December 28, 2011

For many months I've been meaning to create the code for a split-plot design. The basic split-plot design has each subject participate in every level of factor B, but only one level of factor A. In this blog post, I report an example of a hierarchical Bayesian approach to a split-plot design, coded in JAGS (not BUGS). As this is my first go at this sort of design, please consider this a preliminary version of the analysis, and please provide feedback regarding any errors or infelicities. I'm also looking for people to do analogous NHST analyses for comparison, and I'm looking for other data sets for additional examples.

To make things concrete, consider an example provided by Maxwell & Delaney (2004, Designing Experiments and Analyzing Data: A Model Comparison Perspective, 2nd Edition, Erlbaum; Ch. 12). (As I've mentioned elsewhere, if you must learn NHST, their book is a great resource.) A perceptual psychologist is studying response times for identifying letters flashed on a screen. The letters can be rotated off of vertical by zero degrees, four degrees, or eight degrees. Every subject responds many times to letters at each of the three angles. The experimenter (for unknown reasons) analyzes only the median response time of each subject at each angle. Thus, each subject contributes only one datum at each level of angle (factor B). There are two types of subjects: "young" and "old." Age of subject is factor A. The structure of the data is shown below. Y is the median response time, in milliseconds. There are 10 subjects per Age group.

Y

Subj

Angle

Age

450

1

B1(Zero)

A1(Young)

510

1

B2(Four)

A1(Young)

630

1

B3(Eight)

A1(Young)

390

2

B1(Zero)

A1(Young)

480

2

B2(Four)

A1(Young)

540

2

B3(Eight)

A1(Young)

...

...

...

...

510

10

B1(Zero)

A1(Young)

540

10

B2(Four)

A1(Young)

660

10

B3(Eight)

A1(Young)

420

11

B1(Zero)

A2(Old)

570

11

B2(Four)

A2(Old)

690

11

B3(Eight)

A2(Old)

600

12

B1(Zero)

A2(Old)

720

12

B2(Four)

A2(Old)

810

12

B3(Eight)

A2(Old)

...

...

...

...

510

20

B1(Zero)

A2(Old)

690

20

B2(Four)

A2(Old)

810

20

B3(Eight)

A2(Old)

Split-plot designs are difficult to analyze in NHST because it is challenging to select an appropriate "error term" for constructing F ratios of different effects. Ch. 12 of Maxwell & Delaney is devoted largely to explaining the various options for error terms and the different F ratios (and different p values) that result. Split-plot designs get even more difficult when they are unbalanced, that is, when there are different numbers of subjects in the different levels of factor A. And, of course, when we do multiple comparisons we have to worry over which corrections to use.

In a Bayesian approach, on the other hand, there is no need to worry over the selection of error terms because we never construct F ratios. And there is no difficulty with unbalanced designs because the parameters are estimated based on whatever data are provided. The only challenge, in the hierarchical Bayesian approach that I prefer, is the construction of sum-to-zero parameterized effects. In all the ANOVA designs described in the book, the parameter values are converted so the effects, as deflections from baseline, sum to zero. (In case you haven't seen it yet, there is an important update regarding the sum-to-zero computations in the book's ANOVA programs, linked here.) With some effort, I figured out a way to convert the parameter estimates to sum-to-zero versions. It is done in R, outside of JAGS.

The model involves main effects for factor A and factor B, an interaction effect for AxB, and a main effect for subject within level of A. The basic model specification looks like this: for ( i in 1:Ntotal ) { y[i] ~ dnorm( mu[i] , tau ) mu[i] <- base + a[aLvl[i]] + s[sLvl[i]] + b[bLvl[i]] + axb[aLvl[i],bLvl[i]] }
The index i goes over rows of the data table, and Ntotal is the total number of data points. The data values, y[i], are assumed to be normally distributed around the predicted mean mu[i]. The predicted mean is the overall baseline, base, plus a deflection a[aLvl[i]] due to being in level aLvl of factor A, plus a deflection for each subject, plus a deflection for factor B, plus a deflection due to interaction of factors A and B. There is a hierarchical prior on each type of deflection, so that the variance of the deflections is estimated.

The tricky part comes in converting the parameter values to sum-to-zero values. Essentially, at each step in the chain the predicted mu is computed, then the cell means for the AxB table are computed. From the cell means, the baseline, A, B, and AxB deflections are computed. And, from the cell means, the within-level deflections of the subjects are computed. (The difficult part for me was figuring out that I should compute the cell means of the AxB table first, and compute everything else, including the subject effects, relative to those cell means.)

The first figure (below) shows marginals of the posterior standard deviations on the various effects and the noise standard deviation. The estimates of the standard deviations are not of primary interest, but here they are:

January 8, 2012: It was pointed out to me that the SD estimates for aSD (age), bSD (Angle), and axbSD are not very stable, even for chains of length 100,000. The reason is that only a few values inform the estimates. Specifically, for aSD, only two values inform it, namely a[1] and a[2], so it is very uncertain. And for bSD, only three values inform it, namely b[1], b[2], and b[3], so it too is very uncertain. I therefore included a warning message in the display. . Estimates of the effects (graphs below) are stable.

The next plot (below) shows the baseline, factor A deflections, factor B deflections, and AxB interaction deflections. Notice that the deflections do indeed sum to zero within each type:

The deflections are informative primarily when they are entered into contrasts. For example, the main effect of factor A is quite large:

This agrees with the conclusion from NHST analysis from Maxwell and Delaney, who reported F(1,18)=7.28, p=.0147. But the Bayesian posterior seems much more powerful than the NHST p value. Did I do something wrong, or is the Bayesian analysis simply much more powerful in this design?

We can also do contrasts, such as comparing young versus old (A1 vs A2) at the angle of zero (B1) only. The result is shown in the left panel below, where it can be seen that a difference of zero is within the 95% HDI:

This result agrees with the conclusion from NHST reported by Maxwell & Delaney (pp. 602-605), with F=3.16 and p=.092, or F=3.08 and p still >.05, depending on the choice of error term.

The right panel above shows a contrast that investigates the difference of quadratic trends across the age groups. A quadratic trend for the three levels of angle has contrast coefficients of 1,-2,1. The question is whether the contrast is different across the two levels of age. The Bayesian analysis shows that a difference of zero is within the 95% HDI. This result agrees with conclusion from NHST reported by Maxwell & Delaney (pp. 605-607), with F(1,54)=3.101 and p>.05, or F(1,18)=4.038 with p>.05, depending again on the choice of error term.

As usual, the above contrasts would have to be corrected for multiple comparisons in an NHST approach.

It is seamless to analyze an unbalanced design in the Bayesian approach. For example, the analysis runs completely unchanged (with different results) if we simply delete the last subject from the data table, so that there are 10 subjects in level A1 but only 9 subjects in level A2.

The data file is linked here, and the program is linked here. To run the program, just put it in the same folder as the data file and make sure that R has that folder as its working directory. The program uses JAGS and rjags, not OpenBUGS and BRugs. You must first install JAGS. JAGS runs on all platforms, including Mac and Linux! And then in R you must install rjags; just type install.packages("rjags").

Thursday, December 8, 2011

It's greatly appreciated when people go to the effort to write a nice review on Amazon. It's appreciated not only by the author :-) but crucially also by prospective readers who are trying to decide whether the book is worth getting. Here are some excerpts from some recent reviews on Amazon.com

This is one of the best written and accessible statistics books I've come across. Obviously, a lot of thinking went into coming up with examples and intuitive explanation of various ideas. I was consistently amazed at author's ability to not just say how something is done but why it is done that way using simple examples. I've read far more mathematically sophisticated explanations of statiscal modeling but, in this book,I felt I was allowed to peek into the mind of previous authors as to what they were really thinking when writing down their math formulas. (Posted November 11, 2011 by Davar314, San Francisco, CA)

As far as I am concerned, if you write a book this good, you get to put whatever you like on the cover - puppies, Angelina Jolie, even members of the metal band "Das Kruschke". While reading "DBDA" - reading *and* stepping through the code examples - will not make you a "Bayesian black-belt", it's impressive how much information it *will* give you - the book is almost 700 pages, after all - and you don't need (but it helps) to have tried to get the hang of the "Bayesian stuff" with other books to appreciate how friendly and effective this one is. (The author's explanation of the Metropolis algorithm is a good example). At the risk of sounding grandiose, the book just might do for Bayesian methods what Apple's original Mac did for the personal computer; here's hoping. (Posted December 7, 2011 by Dimitri Shvorob)

Click here for the full text of all the reviews on Amazon. Thanks again, reviewers, for the nice comments and for helping prospective readers.

Monday, November 14, 2011

The programming language R was recently updated to version 2.14. Unfortunately, BRugs is lagging behind, so it does not yet work with R 2.14. As of minutes ago, Uwe Ligges tells me, "I hope to get a new version to CRAN 'soon', i.e. within few weeks." Meanwhile, keep your older version of R installed if you want to use BRugs!

If you are using RStudio, and have R 2.14 installed in addition to R 2.13, you have to tell RStudio to use R 2.13. In RStudio, do this:
Tools
-> Options
-> R General
-> R Version "Change..." button
-> browse to something like C:\Program Files\R\R-2.13.0
Click apply/okay.
Then quit RStudio and restart it.

Here's the idea: Consider an MCMC chain that is strongly autocorrelated. Autocorrelation produces clumpy samples that are unrepresentative, in the short run, of the true underlying posterior distribution. Therefore, if possible, we would like to get rid of autocorrelation so that the MCMC sample provides a more precise estimate of the posterior sample. One way to decrease autocorrelation is to thin the sample, using only every nth step. If we keep 50,000 thinned steps with small autocorrelation, then we very probably have a more precise estimate of the posterior than 50,000 unthinned steps with high autocorrelation. But to get 50,000 kept steps in a thinned chain, we needed to generate n*50,000 steps. With such a long chain, the clumpy autocorrelation has probably all been averaged out anyway! In fact, Link and Eaton show that the longer (unthinned) chain usually yields better estimates of the true posterior than the shorter thinned chain, even for percentiles in the tail of the distribution, at least for the particular cases they consider.

The tricky part is knowing how long of a chain is long enough to smooth out short-run autocorrelation. The more severe the autocorrelation is, the longer the chain needs to be. I have not seen rules of thumb for how to translate an autocorrelation function into a recommended chain length. Link and Eaton suggest monitoring different independent chains and assaying whether the estimates produced by the different chains are suitably similar to each other.

For extreme autocorrelation, it's best not to rely on long-run averaging out, but instead to use other techniques that actually get rid of the autocorrelation. This usually involves reparameterization, as appropriate for the particular model.

Link and Eaton point out that the inefficiency of thinning has been known for years, but many practitioners have gone on using it anyway. My book followed those practitioners. It should be pointed out that thinning does not yield incorrect results (in the sense of being biased). Thinning merely produces correct results less efficiently (on average) than using the full chain from which the thinned chain was extracted. There are at least a couple more mathematically advanced textbooks that you might turn to for additional advice. For example, Jackman says in his 2009 book, Bayesian Analysis for the Social Sciences, "High levels of autocorrelation in a MCMC algorithm are not fatal in and of themselves, but they will indicate that a very long run of the sampler may be required. Thinning is not a strategy for avoiding these long runs, but it is a strategy for dealing with the otherwise overwhelming amount of MCMC output. (p. 263)" Christensen et al. say in their 2011 book, Bayesian Ideas and Data Analysis, "Unless there is severe autocorrelation, e.g., high correlation with, say [lag]=30, we don't believe that thinning is worthwhile. (p.146)" Unfortunately, there is no sure advice about how long a chain is needed. Longer is better. Perhaps if you're tempted to thin by n to reduce autocorrelation, just use a chain n times as long without thinning.

Tuesday, November 1, 2011

There are quite a few books on Bayesian statistics, but what makes this book stand out is the author’s focus of the book—writing for real people with real data. Clearly a master teacher, the author, John Kruschke, uses plain language to explain complex ideas and concepts. This 23-chapter book is comprehensive, covering all aspects of basic Bayesian statistics, including regression and analysis of variance, topics that are typically covered in the first course of statistics for upper level undergraduate or first-year graduate students. A comprehensive website is associated with the book and provides program codes, examples, data, and solutions to the exercises. If the book is used to teach a statistics course, this set of materials will be necessary and helpful for students as they go through the materials in the book step by step.

My thanks to Cody Ding for taking the time and effort to write a review (and for such a nice review too!).*Ding, C. (2011). Incorporating our own knowledge in data analysis: Is this the right time? (Book Review) PsycCRITIQUES, 56(36). DOI: 10.1037/a0024579

Wednesday, October 26, 2011

Bayesian methods can be used in general data-analytic models, in psychometric models, and in models of mind. What is the difference? In all three applications, there is Bayesian estimation of parameter values in a model. What differs between models is the source of the data and the meaning (semantic referent) of the parameters, as described in the diagram below:

As an example of a generic data-analytic model, consider data about ice cream sales and sleeve lengths, measured at different times of year. A linear regression model might show a negative slope for the line that describes a trend in the scatter of points. But the slope does not necessarily describe anything in the processes that generated the ice cream sales and sleeve lengths.

As an example of a psychometric model, consider multidimensional scaling (MDS). The data are similarity ratings (or confusion matrices) from a human observer, and the parameters are coordinates of items in a geometric representation of mental constructs that produced the ratings. Note that if the MDS model is applied to non-behavioral data, such as inter-city road distances, then it is not a psychometric model.

As an example of a Bayesian model of mind, consider models of visual perception of concavity. The data are the light beams reflecting off the curved object in the world and falling on the retina. The model in the mind has parameters that represent the shape of the object in the world being viewed and the angle of the light falling on it. The prior has strong bias for light falling from above (e.g., from the sun and sky). The posterior estimate of shape and lighting then tends to produce interpretations of shapes consistent with overhead lighting, unless there is strong evidence to the contrary. Notice that the parameter values are inside the head, so there must be additional assumptions regarding how to measure those parameter values.

Thursday, October 20, 2011

[The programs are now all available in JAGS, so you don't need to use this translator any more. See the more recent blog post.]

Fantastic contribution from Thomas G. Smith:

FakeBRugs - Pretend that rjags is BRugs
This tiny library of functions is enough to get you through many of the examples in Dr. John Kruschke's book, Doing Bayesian Data Analysis. The functions translate from the BRugs calls used in the book, to rjags calls that work on Unix-y environments. This is not a complete translation layer!!! It's just enough to do most of the things the book asks you to do!

Wednesday, October 19, 2011

A new article in Psychological Science† correctly points out flaws of p values and of procedures that produce bias in data selection and analysis. The article makes several reasonable conclusions. Unfortunately it also makes two wrong conclusions about issues of fundamental importance with major ramifications.

One conclusion is that p values are okay but need to be corrected for the researcher's stopping intention. I refute that claim by reductio ad absurdum. A second conclusion is that Bayesian analysis is a "nonsolution" to the problem of researchers having too much wiggle room. I dispute that claim by clarifying what problems any analysis method can or cannot address, by denying flexibility attributed to Bayesian analysis that isn't really available, and by claiming the actual flexibility in Bayesian analysis as an important advantage for scientific research.

The first issue stems from the fact, correctly pointed out in the article, that p values depend on the stopping intention of data collector. Here's the idea. Suppose a researcher collects some data and computes a summary statistic such as t or F or χ2. The p value is the probability of the observed statistic, or a value more extreme, in the space of possible data that might have been obtained if the null hypothesis were true and the intended experiment were repeated ad infinitum. The space of possible data that might have been obtained depends on the stopping intention. So, if the data collector intended to stop when the sample size N reached a certain number, such as 23, then the space of possible data includes all data sets for which N=23. But if the data collector intended to stop at the end of the week (and just happened to get N=23), then the space of possible data includes all data sets that could have been collected by the end of the week, some of which have N=23, and some of which have smaller N or larger N. Because the two spaces of possible data sets are not the same, the p values are not the same. The p value can depend quite dramatically on the stopping intention. For example, if the researcher intended to stop when N=100 but was unexpectedly interrupted when N=23, the p value is much smaller than if the intention was to stop when N=23. Or, if the researcher intended to stop when N=23 but got an unexpected windfall of data so that N=100, perhaps because a new volunteer assistant showed up, then the p value is much larger than if the researcher intended to stop at N=100. Therefore, to correctly determine the p value for a set of data, we must know the reason that data collection ceased.

Here is an example of the correct use of p values. A lab director has some research in mind and decides that N=30 is adequate. The director tells the lab administrator to collect data from 30 subjects. The administrator knows that the lab typically recruits about 30 subjects in a week, and therefore tells the data-collecting assistant to run subjects for a week. The assistant dutifully collects data, and at the end of the week the lab director happens to be present as the last datum is collected, which happens to be N=30. As far as the lab director can tell, data collection ceased intentionally when N=30. When the lab director analyzes the data, under the intention of stopping when N=30, a particular p value is computed. But when the assistant analyzes the data, under the intention of stopping at the end of the week, a differentp value is computed. In fact, for the lab director p<.05, but for the assistant p>.05. Which p value is correct? Are the results "significant"?

Here is another example of the correct use of p values. Two competing labs are pursuing the same type of research in a well established experimental paradigm. The two labs independently have the same idea for an identical experiment design, and the researchers go about collecting data. In one lab, they intend to collect data for a week, and they happen to get N=30. In the other lab, they intend to stop data collection when N=30. Moreover, by chance, the data in the two labs happen to be identical. (That's not so far fetched; e.g., perhaps the data are binary choices for each subject, and so the data can be summarized as the number of "heads" out of N subjects.) The two labs compute p values for their identical data sets. The two p values are different because the data were collected with different stopping intentions; in fact p<.05 for one lab but p>.05 for the other lab. Which lab got "significant" data?

The problem is that the data bear no signature of the stopping intention of the researcher. Indeed, researchers go out of their way to make sure that the data are not influenced by their stopping intention. Each datum collected is supposed to be completely insulated from any data collected before or after. The last datum collected has no trace that it was the last or the first or any position in between.

Not only is the intention opaque to the data, it's often opaque to the researcher. Collaborators on a project might have differing sampling intentions (as in the example of the director and assistant, above). Or the sampling intention might change midway through data collection. ("Let's collect N=100." Then on Friday afternoon, "Well, we've got N=94, that's good enough.") Or, as often happens, some subjects have to be deleted from the data set because of procedural errors or failure to respond, despite the data-collector's intention about sample size or stopping time. These are all very realistic sampling procedures, and we tolerate them because we know that the data are completely unaffected by the intention of the researcher.

Therefore it's strange that the interpretation of the data in terms of p values should depend crucially on something that has no impact on the data, namely, the stopping intention. But, the correct calculation of p values does depend on the stopping intention.

So, what should be done about this problem? Given the examples above, it seems clear that we should treat p values as inherently ill-defined because they depend on intentions that are irrelevant to the data.

But here is the #1 recommendation from the new article in Psychological Science:

Requirements for authors: ...1. Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article. Following this requirement may mean reporting the outcome of power calculations or disclosing arbitrary rules, such as “we decided to collect 100 observations” or “we decided to collect as many observations as we could before the end of the semester.” The rule itself is secondary, but it must be determined ex ante and be reported.

Presumably this requirement is declared so that researchers can use the stopping rule to calculate the true and correct p value, determined appropriately for the idiosyncratic intentions of the researchers. Here's a plausible example. "We decided to collect 100 observations, but by Friday we had 94 and figured that was close enough, so we stopped, and then had to delete 5 subjects because of later-discovered transcription errors. A post-experiment inquisition of the lab team revealed that one of our assistants was covertly intending to quit the job on Monday, which would have limited our data collection if we had not decided to stop on Friday. Therefore, running a large Monte Carlo simulation that incorporated the estimated recruitment rate of subjects during the week, and the estimated probability of deciding to stop on Friday for different values of N achieved on Friday, and the estimated probability of transcription errors, and the probability of an assistant quitting on Monday, we determined that p=..."

To ease the reporting of true and correct p values, it would be extremely helpful to have critical values for commonly-used statistics (t, F, etc.) under various typical stopping intentions that researchers adopt. All the critical values in contemporary books and computer programs assume the unrealistic convention that N was fixed in advance. Instead, we should have tables of critical values for stopping after a certain duration, with varieties of sampling rates during that interval. (In fact, I've already seeded this process with an example in this article.) Therefore researchers could obtain the true p values for their data if they intended to stop after a particular duration.

It would also be helpful to have correction factors for unexpected interruptions of data collection, or unexpected windfalls of data collection. The researcher would merely have to enter the intended sample size and the probability of being interrupted or enhanced (and the probability of each magnitude of enhancement) at any particular time during data collection, and the correction factor would produce the true and correct p value for the data.

Federal funding agencies have been especially keen lately to support collaborative efforts of teams of researchers. It is likely that different team members may harbor different intentions about the data collection (perhaps covertly or subconsciously, but harbored nonetheless). Therefore it would extremely useful to construct tables of true and correct critical values for cases of parallel mixed intentions, when one collaborator intends to stop at a fixed sample size, and the other collaborator intends to stop at the end of the week. Clearly, the construction of these tables should be a major funding priority for the granting agencies.

I look forward to a new industry of publications that reveal appropriate corrections for different sampling intentions. Fortunately, we already have a model of this industry in the extensive literature about correcting for false alarms in multiple comparisons. Depending on the types of intended comparisons, and whether the intentions are planned or post-hoc, we have a cornucopia of corrections for every possible set of intended comparisons. Unfortunately, all of those corrections for multiple comparisons have been based on the sampling assumption that data collection stopped at fixed sample size! Therefore, every one of the corrections for multiple comparisons will have to be reworked for different stopping intentions. It will be a great day for science when we have a complete set of corrections for all the various intentions regarding multiple comparisons and intentions regarding stopping collection of data, because then we will know true and correct p values for our data, which were completely insulated from those intentions.

Oops! Sorry, I slipped into sarcasm. But hey, it's my blog. I should reiterate that I agree with many of the points made by the authors of the Psychological Science article. Just not the point about p values and stopping intentions. And one more point, regarding a different analysis method that the authors dismissed as a "nonsolution." Here is the relevant excerpt:

Nonsolutions: ...Using Bayesian statistics. We have a similar reaction to calls for using Bayesian rather than frequentist approaches to analyzing experimental data (see, e.g., Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011). Although the Bayesian approach has many virtues, it actually increases researcher degrees of freedom. First, it offers a new set of analyses (in addition to all frequentist ones) that authors could flexibly try out on their data. Second, Bayesian statistics require making additional judgments (e.g., the prior distribution) on a case-by case basis, providing yet more researcher degrees of freedom.

It's important to be clear that statistical analysis of any kind can only deal with the data its given. If the design and procedure garner biased data, no analysis can fully undo that bias. Garbage in, garbage out. If the problem of "too many researcher degrees of freedom" stems from design-and-procedure problems, then it needs design-and-procedure solutions. To say that Bayesian analysis is a nonsolution to a design-and-procedure problem is like saying that a meat grinder is a nonsolution to rancid meat (and that therefore the meat grinder is useless) (and leaving the reader to make the leap that therefore the meat grinder is useless).1

The authors argue that Bayesian analysis "increases researcher degrees of freedom" (and is therefore bad) in two ways. First, "it offers a new set of analyses (in addition to all the frequentist ones)". The tacit assumption of this statement seems to be that researchers would try frequentist and Bayesian approaches and just report the one that gave the most flattering conclusion. No, this wouldn't fly. Bayesian analyses provide the most complete inferential information given the data (in the normative mathematical sense), and analysts can't just slip into frequentist mode because it's flattering. In fact, reporting a p value is embarrassing, not flattering, because p values are ill defined.

Second, say the authors, "Bayesian statistics require making additional judgments (e.g., the prior distribution)...". Ah, the bogey-man of priors is trotted out to scare the children, as if priors can be capriciously set to anything the analyst wants (insert sounds of mad, wicked laughter here) and thereby predetermine the conclusion. In actuality, priors are overt and explicitly agreeable to a skeptical scientific audience. Typically they are set to be noncommittal so that they have minimal influence on the posterior distribution. When there is considerable previous research to inform a prior, then a strong prior can give great inferential leverage to small samples. And not using strong prior information when it is available can be a serious blunder; consider random drug testing and disease diagnosis, which must take into account the base rates, i.e., the priors.

Bayesian analysis does, in fact, give analysts more flexibility than traditional frequentist analysis. It gives the analyst the flexibility to use a model that actually describes the trends and distributions represented in the data, instead of being shoe-horned into linear models and normal distributions that may have little resemblance to the data. (Of course, if the analyst wants linear models with normal distributions, Bayesian analysis provides the richest possible inference about their parameters without ever computing a p value.) With Bayesian analysis, researchers can actually get useful parametric descriptions of complex data, involving multi-level non-linear models with non-normal distributions at various levels throughout the model. This is the flexibility that scientific theorizing needs.

1 Revision to parenthetical remark made on Oct. 23, 2011, in response to personal communication from Uri Simonsohn. Struck-out version was in the original post, which could have inadvertently been interpreted as saying that the authors themselves explicitly made the meat-grinder-is-useless argument. They did not.

Wednesday, October 5, 2011

From time to time I've been asked whether autocorrelation in MCMC chains is okay if the chains are converged, as indicated by the BGR statistic being close to 1.0. The answer is: No. Autocorrelation in the chains implies that the MCMC sample is clumpy. A clumpy sample is not representative of a smooth distribution.

Here is an example of a case in which the BGR statistic is nicely behaved near 1.0, but there is still notable autocorrelation. It arises from doing multiple linear regression (see Fig. 17.4, p. 458 of the book) on two predictors that are highly correlated. The regression coefficients are denoted b[1] and b[2]. Here are chains that have no thinning:

Notice that the BGR is at 1.0 throughout, but the ACF has notable autocorrelation.

There is a separate question of how much autocorrelation can be tolerated. This depends on the particular parameterization and the summary measure of the MCMC sample that is being considered. But all of that worrying can be avoided if it is easy to thin the chain and get rid of autocorrelation, as it is in the example above.

If mere thinning doesn't do the trick (because the MCMC sampling takes a long time), then sometimes transforming the data can help (see the book, e.g., p. 459). Otherwise, reparameterizing is the usual way to go. You can actually change the model, or sometimes you can transform the parameters after they've been sampled and the transformed versions aren't autocorrelated (e.g., this blog post regarding Bayesian ANOVA).

Monday, October 3, 2011

All of a sudden it just makes sense! Everyone knows that "lightbulb moment", when previously accumulated knowledge or facts become condensed into a lucid concept, where something previously opaque becomes crystal clear. This book is laden with such moments. This is the most accessible statistics text for a generation and I predict (based on prior knowledge) that it will be a major factor in moving scientists of every shape and size towards the Bayesian paradigm. Even if you're sceptical, you're likely to learn more about frequentist statistics by reading this book, than by reading any of the tomes offered by so called popularisers. If you are a social scientist, laboratory scientist, clinical researcher or triallist, this book represents the single best investment of your time. Bayesian statistics offer a single, unified and coherent approach to data analysis. If you're intimidated by the use of a scripting language like "R" or "BUGS", then don't be. The book repays your close attention and has very clear instructions on code, which elucidate the concepts and the actual mechanics of the analysis like nothing I've seen before. All in all, a great investment. The only serious question that can be raised about the design and implementation of a book such as this is: why wasn't it done before?

Click here to see the review at Amazon.com. My great appreciation goes to R. Dunne for taking the effort to post the review.

Well, sort of. From this brief article in The Guardian (UK), it seems that the judge was most concerned about dubious values for the probabilities being used for the terms in Bayes' rule, not the validity of Bayes' rule itself. The judge seems to have been worried that garbage in yields garbage out. But the article also gives the impression that the judge outlawed the use of Bayes' rule in court, unless the component probabilities could be proven. That's like outlawing the use of modus ponens, unless the antecedent can be proven.

Wednesday, September 28, 2011

For users of Linux / MacOS, the windows() command in R is not recognized. The analogous command is X11(). Therefore, you could go through the programs and replace every occurrence of windows(...) with X11(...). Or, you could just tell R to do it for you, by putting this line at the top of the file (but after any remove all objects command):

Saturday, September 10, 2011

Though risking hyperbole, I would describe this book as revolutionary, at least in the context of psychology. It is, to my knowledge, the first book of its kind in this field to provide a general introduction to exclusively Bayesian statistical methods. In addition, it does so almost entirely by way of Monte Carlo simulation methods. While reasonable minds may disagree, it is arguable that both the general Bayesian framework advocated here, and the heavy use of Monte Carlo simulations, are destined to be the future of all data analysis, whether in psychology or elsewhere. If this is so, then Doing Bayesian Data Analysis might be something of a harbinger, rousing psychology to the new realitites of data-analysis in the 21st century. ... Doing Bayesian Data Analysis introduces psychology to new ways of thinking and new ways of talking about and presenting data-analysis. Anyone serious about data analysis in psychology ought to read this book. At the very least, it will serve as a welcome new perspective on the field. More probably, or so it seems to me, the ideas and methods presented here will eventually be seen as the foundations for new approaches to statistics that will becomes commonplace in the near future.

I think the reviewer has a remarkably clear view of the intellectual landscape, of where psychology (and science generally) is going, and of where the book attempts to situate itself. Thank you, Mark, for such a perceptive review. Now may everyone else see as clearly as you!

When plotting the distribution formed by an MCMC chain, the book uses a histogram. I've added a new option to the plotPost.R function so that it can make a curve instead. Get the new plotPost.R program from the book's web site.

Notice that the "breaks" argument was removed from the command when using showCurve=T, because "breaks" only applies to histograms.

Why use one form or the other? A histogram reminds the viewer that the plot represents a sample of points, not a known mathematical function (even though there is a mathematical function underlying the sample). This was the motivation for using histograms in the book. Also, the curve uses a kernel density estimation technique that blurs the distribution, so the curve will tend to be not quite peaked enough and a little too wide in its tails. On the other hand, the curve might be easier to view and more aesthetically pleasing. Ultimately the choice is yours. Now you have the choice!

Sunday, August 28, 2011

I have reviewed a number of statistics texts for academic journals over the years, and have authored published reviews of some six books specifically devoted to Bayesian analysis. I consider John Kruschke's "Doing Bayesian Data Analysis" to be the best text available for learning this branch of statistics.

Learning how to craft meaningful statistical tests and models based on Bayesian methods is not an easy task. Nor is it an easy task to write a comprehensive basic text on the subject -- one that actually guides the reader through the various Bayesian concepts and mathematical operations so that they have a solid working ability to develop their own Bayesian-based analyses.

There are now quite a few texts to choose from in this area, and some are quite good. But Kruschke's text, in my opinion, is the most useful one available. It is very well written, the concepts unique to the Bayesian approach are clearly presented, and there is an excellent instructors manual for professors who have adopted the book for their classes. Kruschke uses R and WinBUGS for showing examples of the methods he describes, and provides all of the code so that the reader can adapt the methods for their own projects.

"Doing Bayesian Data Analysis" is not just an excellent text for the classroom, but also -- and I think foremost -- it is just the text one would want to work through in order to learn how to employ Bayesian methods for oneself.

Saturday, August 27, 2011

"All said and done, the criticisms I have raised here are relatively minor. This is the best introductory textbook on Bayesian MCMC techniques I have read, and the most suitable for psychology students. It fills a gap I described in my recent review of six other introductory Bayesian method texts (Smithson, 2010). I look forward to using it in my own teaching, and I recommend it to anyone wishing to introduce graduate or advanced undergraduate students to the emerging Bayesian revolution."

Friday, August 26, 2011

"A person would have to make an effort not to learn this material after following this tutorial. The book is relentlessly clear. Topics are explained analytically as well as visually and code is provided with which the reader can see and change every assumption made."

Despite this brilliant and insightful assessment, Dan later states "The cover has puppies on it. Yes, puppies. Had paper grocery bags not disappeared from supermarkets, I would have covered my copy to avoid the strange looks my thoroughly quantitative colleagues gave me as I spent weeks working through the book."

Well, the solution to this problem is just a Post-It away! See photo at right.

Thank you, Dan, for working through the book and writing such a thoughtful review.

P.S. As explained at this other blog entry, the happy puppies are named Prior, Likelihood, and Posterior. The Posterior puppy has half-up ears, a compromise between the perky ears of the Likelihood puppy and the floppy ears of the Prior puppy. (The puppy on the back cover is named Evidence. MCMC methods make it unnecessary to explicitly compute the evidence, so that puppy gets sleepy with nothing much to do.)

Monday, August 1, 2011

Bayesian analysis has recently attracted attention because of its application to data from experiments that investigate extrasensory perception (a.k.a. non-local perception, psi phenomena, etc.). There have been Bayesian analyses and re-analyses of data that were initially analyzed by NHST (see e.g., "Feeling the Future" by Bem (2011), a critique by Wagenmakers et al. (2011), a rejoinder by Bem, Utts & Johnson, and related discussion by Kruschke (2011). In their exchange, Wagenmakers et al. and Bem et al. emphasized Bayesian model comparison, not Bayesian parameter estimation, which are explained and contrasted by Kruschke (2011). In this blog post I show a Bayesian estimation approach to meta-analysis of ESP data.

The data. The specific motivation for this post started when my attention was attracted to a recent article,

Storm et al. (2010) provide summary data from 67 experiments. In all of the experiments, data were in the form of dichotomous correct/wrong judgments out of a fixed pool of choices. In 63 of the 67 experiments, there were 4 choices, hence chance responding was 1/4. (The other experiments: For May (2007) chance was 1/3; for Roe & Flint (2007) chance was 1/8; for Storm (2003) chance was 1/5; and for Watt & Wiseman (2002) chance was 1/5. See Storm et al. (2010) for reference info.) Storm et al. (2010) computed summary data for each experiment by summing across trials and subjects, yielding a total correct out of total trials for each experiment. Storm et al. (2010) also divided the experiments into three types of procedure, or categories: Ganzfeld (C1), Non-Ganzfeld noise reduction (C2), and free response (C3).

The New Analysis. The 63 experiments that have chance performance at 1/4 are exactly the type of data that can be plugged into the hierarchical model in Figure 9.7 (p. 207) of the book, reproduced at right, using program BernBetaMuKappaBugs.R. Trial i in experiment j has response yji (1=correct, 0=wrong), as shown at the bottom of the diagram. The estimated underlying probability correct for experiment j is denoted θj. The underlying probabilities correct in the 63 experiments are described as coming from an overarching beta distribution that has mean μ and "certainty" or "tightness" κ. The model thereby estimates probability correct for individual experiments and, at the higher level, across experiments. In particular, the parameter μ indicates the underlying accuracy across experiments. We are primarily interested in the across-experiment level because we want to know what we can infer by combining information from many experiments. But the estimates at the individual-experiment level are also interesting, because they experience shrinkage from the simultaneous estimation of other experiment parameters in the hierarchical model. The constants in the top-level prior were set to be vague and non-committal; in particular A=1 and B=1 so that the beta prior on μ was uniform. The MCMC chains were burned-in and thinned so they were nicely converged with minimal autocorrelation. The MCMC sample contained 10,000 points.

Results. First I'll show results for each category, then combined across categories. For category C1 (Ganzfeld), here are histograms of the marginals on each parameter:

Click image to view it enlarged, but click "back" in your browser to come back here!

The across-experiment μ (top left) is clearly above chance, with 100% of the posterior sample falling above 0.25. The 95% HDI goes from 0.288 to 0.366.

The estimates for the 29 individual experiments reflect shrinkage; for example, C1:25 has a mean posterior θ of 0.402, but in the data the proportion correct was 23/51 = 0.451. In other words, the estimate for the experiment has been shrunken toward the central tendency of all the groups. The estimates for the 29 individual experiments also reflect the sample sizes in each experiment; for example, C1:19 had only 17 trials, and its HDI is relatively wide, whereas C1:14 had 138 trials, and its HDI is relatively narrow. And, of course, experiments with smaller samples can experience more shrinkage than experiments with larger samples.

Results from category C2 are similar but less strong, and results from C3 do not exclude μ=0.25:

The results for C3 (above) are not as strong as they could be because the three experiments that were excluded (because they did not have chance at 1/4) all had results very different from chance! Thus, the results for C3 are artificially weaker than they should be, due to a selection bias.

The differences between μ1, μ2, and μ3 are shown here:

Although μ3 might be deemed to be different from the others, it is not a strong difference, and, as mentioned above, the magnitude of μ3 was artificially reduced by excluding three experiments with high outcomes. Therefore it is worth considering the results when all 63 experiments are included together. Here are the results:

Using a Skeptical Prior: It is easy to incorporate skepticism into the prior. We can express our skepticism in terms of chance performance in a large number of fictitious previous experiments. For example, we might equate our skepticism with there having been 400 previous experiments with chance performance, so in the hyperprior we set A=100 and B=300. This hyperprior insists that μ falls close to 0.25, and only overwhelming evidence will shift the posterior away μ=0.25. A skeptic would also assert that all experiments have the same (chance) performance, not merely that they are at chance on average (with some above chance and others below chance). Hence a skeptical prior on κ would emphasize large values that force the θj to be nearly the same.

Bayesian estimation vs Bayesian model comparison. A general discussion of Bayesian estimation and Bayesian model comparison can be found in Kruschke (2011) and at other posts in this blog. In the present application, a major advantage of the estimation approach is that we have an explicit distribution on the parameter values. The analysis explicitly reveals our uncertainty about the underlying accuracy in each experiment and across experiments. The hierarchical structure also lets the estimate of accuracy in each experiment be informed by data from other experiments. On the other hand, Bayesian model comparison often provides only a Bayes factor, which tells us only about the relative credibility of the point null hypothesis and another specific non-null prior hypothesis, without telling us what the parameter values could be.

The Bayesian estimation approach is also very flexible. For example, data from individual subjects could be modeled as well. That is, instead of collapsing across all trials and subjects, every subject could have an estimated subject-accuracy parameter, and the subjects within an experiment are modeled by an experiment-level beta distribution that has mean θj, and the higher levels of the model are as already implemented here. Doing model comparisons with such a model can become unwieldy.

If our goal were to show that the null hypothesis is true, then Bayesian model comparison is uniquely qualified to express the point null hypothesis, but only in comparison to a selected alternative prior. And even if the Bayes factor in model comparison favors the point null, it provides no bounds on our uncertainty in the underlying accuracy. Only the estimation approach provides explicit bounds on the uncertainty of the underlying accuracies, even when those accuracies are near chance.

The file drawer problem. The file-drawer problem is the possibility that the data in the meta-analysis suffer from a selection bias: It could be that many other experiments have been conducted but have remained unpublished because the results were not "significant". Therefore the studies in the meta-analysis might be biased toward those that show significant results, and under-represent many non-significant findings. Unfortunately, this is a problem with biased sampling in the data, and the problem cannot be truly fixed by any analysis technique, Bayesian or otherwise. Storm et al. (2010) describe a couple methods that estimate how many unpublished null results would be needed to render the meta-analysis non-significant (in classical NHST). You can judge for yourself if those numbers are impressive or not. The file drawer problem is a design issue, not an analysis issue. One way to address the file drawer problem is by establishing the procedural convention that all experiments must be publicly registered before the data are collected. This procedure is an attempt to prevent data from being selectively filtered out of public view. Of course, it could be that people in the experiments would be able to sense that the experiment had been pre-registered (through non-local perception), which would interfere with their ability to exhibit further ESP in the experiment itself.

So, does ESP exist? The Bayesian estimation technique says this: Given the data (which might be biased by the file drawer problem), and non-committal priors (which do not reflect a skeptical stance toward ESP), the underlying probability correct is almost certainly greater than chance. Moreover, the estimation provides explicit bounds on the uncertainty for each experiment and across experiments. For skeptical priors, the conclusion might be different, depending on the degree of skepticism, but it's easy to find out. No analysis can solve the file-drawer problem, which can only be truly addressed by experiment design procedures, such as publicly registering an experiment before the data are collected (assuming that the procedure does not alter the data).

Friday, July 29, 2011

In the book, the default length of an MCMC chain differs from one program to another, ranging from around 2,000 to 10,000 total steps combined across chains (e.g., 3 chains with 1,000 steps each is 3,000 total steps). But how many steps are needed for a stable estimate of the HDI limits? Is 2,000 enough? Is 10,000 enough?

The issue motivating the question is that the MCMC chain is just one random sample from the true underlying posterior distribution. If we seeded the random chain differently (or used a few more or less burn-in steps, or thinned slightly more or less) we would have a different random MCMC chain with slightly different HDI limits because of random noise in the MCMC sample. When the MCMC chain gets very long, then in the limit it approximates the true posterior extremely well. But for chains of realistic length (not the unobtainable "in the limit" chain), how stable is the MCMC estimate of the HDI?

This blog entry reports some Monte Carlo simulations to illustrate the answer. Suppose that the true posterior is a known distribution, so we can easily generate random values from it. What we'll do is generate an MCMC chain from the true posterior, and estimate the (95%) HDI from the MCMC chain. We'll do that many times and keep track of the estimated HDI's, so we can see how much the estimated HDI's vary. For each MCMC chain we'll also determine the 2.5%ile and the 97.5%ile, to compare with the 95% HDI.

Here are results when the true posterior is a gamma distribution and the MCMC chain has length 1,000:

The top panel above shows the true gamma distribution from which the MCMC chains were generated. Each MCMC chain had 1,000 values in it, and for each chain the 95% HDI was estimated. This was done 20,000 times. The middle row of histograms shows the HDI limits across the 20,000 replications. The bottom row of histograms shows the percentiles across the 20,000 replications. The vertical dashed lines in the histograms show the true values of the HDI.

Notice first that the 2.5%ile and 97.5%ile, in the bottom row, are very poor estimates of the HDI. This is because the true distribution is skewed, and the equal-tailed credible interval is quite different than the HDI.

Notice next that there is indeed quite a lot of variation across the 20,000 replications. For example, the histogram of "HDI low" spans the true value of 0.0799, but different chains give different estimates. (The histogram displays the 95% HDI across the 20,000 replications going from 0.02559 to 0.1976. This is not the HDI of the MCMC chain! It is the HDI, across replications, of the estimated lower value of the HDI of the MCMC chain.)

Notice finally that there is a bit of bias in the estimate of the limits of HDI. Specifically, the estimated HDI is a little bit narrower than the true HDI. This makes intuitive sense: The estimated HDI finds the narrowest 95% interval in the finite MCMC sample. Because the tails of the distribution have more points sampled just inside the HDI than just outside the HDI, there is, for small chains, a higher chance of finding a narrowest 95% interval just inside the true HDI. As the chain length gets larger, this small-sample bias is reduced, as can be seen in the next figures

Here are the results when the chain length is 5,000:

And here are the results when the chain length is 10,000:

When the chains have a length of 10,000, the bias in the HDI estimate is only slight. There is still, of course, some imprecision in the estimate of the HDI limits (and width). It is up to the user to ascertain whether this amount of imprecision is tolerable for the particular application, or whether a longer MCMC chain should be computed.

Conclusion: When the limits of the HDI are close to the decision boundary (such as the ROPE), or when the HDI width is only marginally narrow enough for the desired precision, then be sure to use a very long MCMC chain so that the HDI limits are stable. The graphs above help illustrate how long is "very long." In general, while the book uses chains on the order of 5,000 steps for pedagogical purposes, real research reports should probably use at least 10,000 steps or much more as patience permits.