Who is allowed to cheat? I.J. Good and that after dinner comedy hour….

It was from my Virginia Tech colleague I.J. Good (in statistics), who died five years ago (April 5, 2009), at 93, that I learned most of what I call “howlers” on this blog. His favorites were based on the “paradoxes” of stopping rules. (I had posted this last year here.)

“In conversation I have emphasized to other statisticians, starting in 1950, that, in virtue of the ‘law of the iterated logarithm,’ by optional stopping an arbitrarily high sigmage, and therefore an arbitrarily small tail-area probability, can be attained even when the null hypothesis is true. In other words if a Fisherian is prepared to use optional stopping (which usually he is not) he can be sure of rejecting a true null hypothesis provided that he is prepared to go on sampling for a long time. The way I usually express this ‘paradox’ is that a Fisherian [but not a Bayesian] can cheat by pretending he has a plane to catch like a gambler who leaves the table when he is ahead” (Good 1983, 135) [*]

This paper came from a conference where we both presented, and he was extremely critical of my error statistical defense on this point. (I was like a year out of grad school, and he a University Distinguished Professor.)

One time, years later, after hearing Jack give this howler for the nth time, “a Fisherian [but not a Bayesian] can cheat, etc.,” I was driving him to his office, and suddenly blurted out what I really thought:

“You know Jack, as many times as I have heard you tell this, I’ve always been baffled as to its lesson about who is allowed to cheat. Error statisticians require the overall and not the ‘computed’ significance level be reported. To us, what would be cheating would be reporting the significance level you got after trying and trying again in just the same way as if the test had a fixed sample size. True, we are forced to fret about how stopping rules alter the error probabilities of tests, while the Bayesian is free to ignore them, but why isn’t the real lesson that the Bayesian is allowed to cheat?” (A published version of my remark may be found in EGEK p. 351: “As often as my distinguished colleague presents this point…”)

To my surprise, or actually shock, after pondering this a bit, Jack said something like, “Hmm, I never thought of it this way.”

By the way, the story of the “after dinner Bayesian comedy hour” on this blog, did not allude to Jack but to someone who gave a much more embellished version. Since it’s Saturday night, let’s once again listen into the comedy hour that unfolded at my dinner table at an academic conference:

Did you hear the one about the researcher who gets a phone call from the guy analyzing his data? First the guy congratulates him and says, “The results show a statistically significant difference at the .05 level—p-value .048.” But then, an hour later, the phone rings again. It’s the same guy, but now he’s apologizing. It turns out that the experimenter intended to keep sampling until the result was 1.96 standard deviations away from the 0 null—in either direction—so they had to reanalyze the data (n=169), and the results were no longer statistically significant at the .05 level.

Much laughter.

So the researcher is tearing his hair out when the same guy calls back again. “Congratulations!” the guy says. “I just found out that the experimenter actually had planned to take n=169 all along, so the results are statistically significant.”

Howls of laughter.

But then the guy calls back with the bad news . . .

It turns out that failing to score a sufficiently impressive effect after n’ trials, the experimenter went on to n” trials, and so on and so forth until finally, say, on trial number 169, he obtained a result 1.96 standard deviations from the null.

It continues this way, and every time the guy calls in and reports a shift in the p-value, the table erupts in howls of laughter! From everyone except me, sitting in stunned silence, staring straight ahead. The hilarity ensues from the idea that the experimenter’s reported psychological intentions about when to stop sampling is altering the statistical results.

The allegation that letting stopping plans matter to the interpretation of data is tantamount to letting psychological intentions matter may be called the argument from intentions. When stopping rules matter, however, we are looking not at “intentions” but at real alterations to the probative capacity of the test, as picked up by a change in the test’s corresponding error probabilities. The analogous problem occurs if there is a fixed null hypothesis and the experimenter is allowed to search for maximally likely alternative hypotheses (Mayo and Kruse 2001; Cox and Hinkley 1974). Much the same issue is operating in what physicists call the look-elsewhere effect (LEE), which arose in the context of “bump hunting” in the Higgs results.

The optional stopping effect often appears in illustrations of how error statistics violates the Likelihood Principle LP, alluding to a two-sided test from a Normal distribution:

Xi ~ N(µ,σ) and we test H0: µ=0, vs. H1: µ≠0.

The stopping rule might take the form:

Keep sampling until |m| ≥ 1.96 σ/√n),

with m the sample mean. When n is fixed the type 1 error probability is .05, but with this stopping rule the actual significance level may differ from, and will be greater than, .05. In fact, ignoring the stopping rule allows a high or maximal probability of error. For a sampling theorist, this example alone “taken in the context of examining consistency with θ = 0, is enough to refute the strong likelihood principle.” (Cox 1977, p. 54) since, with probability 1, it will stop with a “nominally” significant result even though θ = 0. As Birnbaum (1969, 128) puts it, “the likelihood concept cannot be construed so as to allow useful appraisal, and thereby possible control, of probabilities of erroneous interpretations.” From the error-statistical standpoint, ignoring the stopping rule allows readily inferring that there is evidence for a non- null hypothesis even though it has passed with low if not minimal severity.

Peter Armitage, in his comments on Savage at the 1959 forum (“Savage Forum” 1962), put it thus:

I think it is quite clear that likelihood ratios, and therefore posterior probabilities, do not depend on a stopping rule. . . . I feel that if a man deliberately stopped an investigation when he had departed sufficiently far from his particular hypothesis, then “Thou shalt be misled if thou dost not know that.” If so, prior probability methods seem to appear in a less attractive light than frequency methods, where one can take into account the method of sampling. (Savage 1962, 72; emphasis added; see [ii])

H is not being put to a stringent test when a researcher allows trying and trying again until the data are far enough from H0 to reject it in favor of H.

Stopping Rule Principle

Picking up on the effect appears evanescent—locked in someone’s head—if one has no way of taking error probabilities into account:

In general, suppose that you collect data of any kind whatsoever — not necessarily Bernoullian, nor identically distributed, nor independent of each other . . . — stopping only when the data thus far collected satisfy some criterion of a sort that is sure to be satisfied sooner or later, then the import of the sequence of n data actually observed will be exactly the same as it would be had you planned to take exactly n observations in the first place. (Edwards, Lindman, and Savage 1962, 238-239)

This is called the irrelevance of the stopping rule or the Stopping Rule Principle (SRP), and is an implication of the (strong) likelihood principle (LP), which is taken up elsewhere in this blog.[i]

To the holder of the LP, the intuition is that the stopping rule is irrelevant; to the error statistician the stopping rule is quite relevant because the probability that the persistent experimenter finds data against the no-difference null is increased, even if the null is true. It alters the well-testedness of claims inferred. (Error #11 of Mayo and Spanos 2011 “Error Statistics“.)

A Funny Thing Happened at the Savage Forum[i]

While Savage says he was always uncomfortable with the argument from intentions, he is reminding Barnard of the argument that Barnard promoted years before. He’s saying, in effect, Don’t you remember, George? You’re the one who so convincingly urged in 1952 that to take stopping rules into account is like taking psychological intentions into account:

The argument then was this: The design of a sequential experiment is, in the last analysis, what the experimenter actually intended to do. His intention is locked up inside his head. (Savage 1962, 76)

But, alas, Barnard had changed his mind. Still, the argument from intentions is repeated again and again by Bayesians. Howson and Urbach think it entails dire conclusions for significance tests:

A significance test inference, therefore, depends not only on the outcome that a trial produced, but also on the outcomes that it could have produced but did not. And the latter are determined by certain private intentions of the experimenter, embodying his stopping rule. It seems to us that this fact precludes a significance test delivering any kind of judgment about empirical support. . . . For scientists would not normally regard such personal intentions as proper influences on the support which data give to a hypothesis. (Howson and Urbach 1993, 212)

It is fallacious to insinuate that regarding optional stopping as relevant is in effect to make private intentions relevant. Although the choice of stopping rule (as with other test specifications) is determined by the intentions of the experimenter, it does not follow that taking account of its influence is to take account of subjective intentions. The allegation is a non sequitur.

We often hear things like:

[I]t seems very strange that a frequentist could not analyze a given set of data, such as (x1,…, xn) [in Armitage’s example] if the stopping rule is not given. . . . [D]ata should be able to speak for itself. (Berger and Wolpert 1988, 78)

But data do not speak for themselves, unless sufficient information is included to correctly appraise relevant error probabilities. The error statistician has a perfectly nonpsychological way of accounting for the impact of stopping rules, as well as other aspects of experimental plans. The impact is on the stringency or severity of the test that the purported “real effect” has passed. In the optional stopping plan, there is a difference in the set of possible outcomes; certain outcomes available in the fixed sample size plan are no longer available. If a stopping rule is truly open-ended (it need not be), then the possible outcomes do not contain any that fail to reject the null hypothesis. (The above rule stops in a finite # of trials, it is “proper”.)

Does the difference in error probabilities corresponding to a difference in sampling plans correspond to any real difference in the experiment? Yes. The researchers really did do something different in the try-and-try-again scheme and, as Armitage says, thou shalt be misled if your account cannot report this.

We have banished the argument from intentions, the allegation that letting stopping plans matter to the interpretation of data is tantamount to letting psychological intentions matter. So if you’re at my dinner table, can I count on you not to rehearse this one…?

The equivalent stopping rule can be framed in terms of the corresponding 95% “confidence interval” method, given the normal distribution above (their term and quotes):

Keep sampling until the 95% confidence interval excludes 0.

Berger and Wolpert concede that using this stopping rule “has thus succeeded in getting the [Bayesian] conditionalist to perceive that μ ≠ 0, and has done so honestly” (pp. 80-81). This seems to be a striking admission—especially as the Bayesian interval assigns a probability of .95 to the truth of the interval estimate (using a”noninformative prior density”):

µ = m+ 1.96(σ/√n)

But, they maintain (or did back then) that the LP only “seems to allow the experimenter to mislead a Bayesian. The ‘misleading,’ however, is solely from a frequentist viewpoint, and will not be of concern to a conditionalist.” Does this mean that while the real error probabilities are poor, Bayesians are not impacted, since, from the perspective of what they believe, there is no misleading?

[*] It was because of these “conversations” that Jack thought his name should be included in the “Jeffreys-Lindley paradox”, so I always call it the Jeffreys-Good-Lindley paradox. I discuss this in EGEK 1996, Chapter 10 , Mayo and Kruse (2001). See a recent paper by my colleague Aris Spanos (2013) on the Jeffreys-Lindley paradox.

[i] There are certain exceptions where the stopping rule may be “informative”. Other posts may be found on LP violations, and an informal version of my critique of Birnbaum’s LP argument. On optional stopping, see also Irony and Bad Faith.

[ii] I found, on an old webpage of mine, (a pale copy of) the “Savage forum”:

Post navigation

18 thoughts on “Who is allowed to cheat? I.J. Good and that after dinner comedy hour….”

“Does this mean that while the real error probabilities are poor, Bayesians are not impacted, since, from the perspective of what they believe, there is no misleading?”

People can believe whatever they want. Scientific evidence/proof demands more. That is why the Higgs evidence required a small p-value. To brush aside the concern over procedure and its relation to error probabilities is junk science. It is unfortunate that such notions have become fashionable.

Byrd: I agree with you. But trying to give the Bayesian view (expressed here) as much of a generous interpretation as possible, I have tried,e.g., in lengthy discussions with Good, to imagine feeling that if we remove concern with the method being wrong with high probability, then the supposed cheating “really” disappears. If one is not reporting on error probabilities of one’s method, then one cannot be reporting misleading error probabilities. Or, as Berger and Wolpert (1988) roughly put it (this is not a quote), given what was believed, one hasn’t been led astray.

So, it is even alright to commit the Texas sharpshooter’s fallacy if I believe I am a good shot? It seems failure to consider stopping rules is to fail to set up the statistical problem correctly, and not just for a frequentist. For any scientist or other problem solver.

John: Good point. By the same reasoning, yes. naturally any Bayesian who cares to block this can also tell you how they can block it, but this misses the real problem, so directly captured by error probs (plus, in my opinion SEV*).

* The reason there’s a need for this evidential rationale (or something akin to it) is that otherwise it can be dismissed as a concern with mere long-runs. I love the Texas sharpshooter example because it seems so patently obvious that the misleading inference concerns the actual skill of this particular sharpshooter (and what can be expected from him in future shoots that are not rigged).

John: the particle physics community required such strenuous test of the null (and unlikely) hypothesis is because the detectors and the accelerator cost several billion dollars. Since there was no chance of replication beyond ATLAS and CMS, they had to get it right the first time. The LEE had to be taken into account as Mayo rightly notes. So rather than trying to model all the interdependencies, one tightens the detection threshold.

This isn’t a critique of their analysis, which was impressive to say the least. Just an explanation for why such an extremal value was required for the detection claim. If the statistical statement involved likelihood ratios, posterior PDFs or Bayes factors, it would have beensimilarly dramatic.

Not to be rude, but who uses stopping rules like these? I am familiar with “observed till we got N total events” or “observed till we got n events of one type”. While I imagine there is a way to go back and forth between the two formulations, I have found counting problems make the issue with optional stopping more apparent.

I realize this is rather presumptive of me, but could you go through the solution for a simple Binomial process with optional stopping. Or point me to it in the literature for I want to understand what I am missing in your argument.

West: I don’t think anyone checks *with each new sample* whether nominal significance has been achieved, but lots of researchers do effectively the same thing, just with a larger number of samples between checks.

Researchers often decide when to stop data collection on the basis of interim data analysis. Notably, a recent survey of behavioral scientists found that approximately 70% admitted to having done so (John, Loewenstein, & Prelec, 2011). In
conversations with colleagues, we have learned that many believe this practice exerts no more than a trivial influence on false-positive rates.

Thanks for citations. Good ones to read. Kruschke says Bayesian methods do not suffer from the problem (or less so) because the methods will tend to accept a true null 75-80% of the time when doing sequential tests… I would say that is suffering. Not sure why he thinks that situation is an improvement over what happens when applying significance tests. Good news is he wishes to mitigate it.

John: That’s a curious defense, because the problem here is being guaranteed to reject a true null! The same thing can happen without sequential trials by the way. Readers might look at the Savage Forum pasted here for Savage’s attempt to wriggle out of this. But again, I’m prepared to accept that if you don’t evaluate methods by their error probabilities then this is not “misleading”.

My bewilderment might be a function of being in astrophysics rather than in medicine. Having a decision theory based stopping rule makes scenes when doing randomized drug trials with sick patients. I am racking my brain to come up with a recent physics experiment that uses a similar rule and have come up empty. Admittedly my attempt is biased itself and suffers from low number statistics.

Although I am often very critical about Bayesians I would like to point out that one has to be very careful in criticising them with this example. If, as a Bayesian, you use an ‘uniformative’ prior you almost certainly believe the null hypothesis is false. If you don’t, then you have a lump of probability on the null being true. This will have the effect of shrinking inferences towards the null. Thus the Bayesian can complain ‘I am conservative all the time that I am testing null hypotheses. The frequentist makes an adjustment when and only when (s)he uses optional stopping and then has the gall to criticise me because I don’t exceptionally use such an adjustment. This is like saying that I am a reckless driver because I don’t check the weather forecast in order to decide whether I should wear my seatbelt or not when my policy, in fact, it to wear it all the time’.

The Bayesian point of view is that If you allow yourself optionally to stop before an experiment reaches its full term you will on average adjust you inferences more than if you ran the experiment to full term. However the additional adjustment only occurs in those cases where you stopped early.

Personally, I think this is just an example of using one framework as a means of criticising the other, which is based on a completely different system. I don’t regard the frequentist criticism of Bayesian statistics as being right any more than I regard the Bayesian criticism of P-values as being right. See my recent exchange with David Colquhoun http://www.dcscience.net/?p=6518

Stephen’s description of the Bayesian Way Out underscores the key problem with their approach (used as he describes) to evaluating what is warranted by the data: It’s all a matter of believing the null false or putting a lump of prior on it! We are back to the idea that such a Bayesian cannot be misled, given what he believed.
As for Stephen’s last remark, I’m entirely prepared to say that evaluating the Bayesian here, according to whether he can avoid misleading interpretations with high probability, might be seen to apply to it a competing approach which he has no interest in satisfying. That was my generous construal of why Jack claims that it is the significance tester who can cheat and not the Bayesian.

Stephen Senn and I are often on the same side of the fence, and he is a very sagacious statistician, but here I think he’s got it wrong. First, as for “one has to be very careful in criticising them with this example,” we should remember that this is the Bayesian’s example, not ours. They even have a name for it: the Stopping Rule Principle SRP (which says it’s irrelevant that you tried and tried again, in this example–I’m putting it quickly, see references).
They laugh at us for taking “intentions” into account, thereby precluding us from being relevant to science (as Howson and Urbach say), on the grounds that science doesn’t usually take intentions into account. That’s why the guffaws at my dinner table. But when we look at this howler closely, we think, “wait a minute, who is allowed to cheat here?” For numerous references to the “argument from intentions” see EGEK 346-8, 363. Chapter 10 Why You Cannot Be Just a Little Bayesian. http://www.phil.vt.edu/dmayo/personal_website/EGEKChap10.pdf

Second, this particular example is not a matter of stopping before you planned, but following an open-ended rule that is sure to stop (it’s a proper stopping rule) and reject the true null. The analogous problem occurs with confidence intervals. (Senn may be thinking of a different kind of example).

All that said, I repeat myself in granting that in a “real”(?) sense the Bayesian can say (as do Berger and Wolpert in one place) that the Bayesian cannot be misled here, given what he believed. I grant as well that if one does not care to report error probabilities associated with a procedure, then one cannot incorrectly report them, nor cheat as regards to them, nor any such thing. This is what I learned from long conversations with IJ.

It is probably useful to bear in mind a few facts about the practical results one would get from using the optional stopping rules.

First, the experiment is only certain to stop with a significant outcome if one is prepared to spend almost infinite amount of time and money on gathering samples. In practical terms it is not guaranteed to stop.

Second, if you use a t-test for the stopping (I haven’t played with the unrealistic circumstance of known variance) then the smallest possible P-value at stopping gets larger and larger as the sample size gets larger. It is never very small and so I can say that the evidence gained from the optional stopping protocol will never be very convincing when the null is true.

Third, the optional stopping protocol increases the power to detect true effects when the null is false. A sensible accounting of errors will take into account more than just type I errors.

As often as I have discussed this howler, I have only mentioned in passing that the most embellished versions, OK the funnier ones, if one is inclined to laugh at such things,are based on a so-called principle of “irrelevant censoring” (rather than optional stopping). This allows the person holding forth to go through vivid (OK, hilarious, if one is inclined to laugh at such things), if incredibly childish, rounds in which the person’s Final Will has to be read to determine if he intended to use such and such instrument, which might have been out of order beyond a certain range…and then there can be the discovery of a Revised Will, etc. etc. followed by the discovery that the instrument was working after all, or would have been working, or if it had not been working, another instrument was at hand and would have been called in,but it too might have been broken on that day, and so on. You get the idea.

I think that Jack Good’s original description of the problem was misleading and you were right to pick him up on it but Good’s formulation is not at all the way I usually hear it from Bayesians.
Here’s my discussion from chapter 19 of Statistical Issues in Drug Development

”
The following is an argument I first heard from Don Berry. Suppose that two physicians run a clinical trial of an experimental treatment to be compared with a control with a ceiling of 100 patients* and that the outcome for each patient is success or failure.

Dr A decides to look at his results and carry out some ‘test’ after 50 patients have been treated. If the result is significant he will stop. If not he will continue. Dr B decides that she will treat 100 patients and then stop. There will be no interim analysis. Both
physicians run their trials. Because he fails to obtain significance with 50 patients, Dr A continues to the end. When the final results are examined, it is seen that the results are identical: identical in every detail. Thus, it is not only the case that the numbers
of successes and failures cross-classified by treatment group are the same for the 100 patients in each trial but it is also the case that if (say) patient 23 in A’s trial received the experimental treatment, then so did patient 23 in B’s trial and if patient 42 had a
successful outcome in B’s trial, then so did patient 42 in A’s trial and so forth. Thus, if Dr B had looked at her data after 50 patients according to A’s strategy she would also have continued.
A and B now perform an analysis of their respective trials (which, but for the inspection strategy, are identical). As it turns out, the results for B’s trial are significant. However, A, faced with the same data, has to pay a penalty for having looked and (as it turns out) when this penalty is paid, his results are not significant. …..

If we consider that they ought to come to the same conclusion under all circumstances in which the trial results are identical, whatever the trial results might be, and that, since the frequentist approach lets them down in this respect, they ought
therefore to be Bayesians (if this is the only alternative), then it follows that they must have the same posterior distribution and also the same utilities. However, since they have seen the same data, they must have had the same prior distribution to
have reached the same posterior distribution. Although there is now no difficulty in explaining their posterior distributions, there is no means of explaining their differing behaviour. Two Bayesians with the same prior distribution and the same utilities ought
to design the same experiment. The fact that A was prepared to stop the trial under certain circumstances, whereas B would have continued regardless cannot now be explained.

It should also be stressed that frequentists are not alone in thinking that formal stopping rules are important. Many Bayesians can be found who think this also. For example, Freedman (1996) describes failure to have a formal stopping rule as being
one of the ten pitfalls to be avoided in conducting sequential trials, and Kadane (1996 )have devoted a whole book to expounding a formal ethical approach to conducting clinical trials which determines whether the next patient can be randomized or not. For
Bayesians, such rules, of course, must reflect priors and utilities as well as evidence……

It is important to understand, however, that this does not quite let the frequentists off the hook, the reason being that it is of course a natural consequence of the Bayesian position that two persons having seen the same data ought not necessarily to come to the same conclusion. What the frequentist must do to defend his position is to accept that there is a subjective element to it after all, but in that case a standard objection to the Bayesian approach – that it is subjective – is no longer valid. That would amount to the pot calling the kettle black.’

* If you don’t like the choice between a one stage and two-stage sequential design the example is easily adapted to accommodate an infinite design.

Stephen: I’m running out in a minute, –I know you wrote about that rather different example on this blog. As I see it, it assumes a meaning of “same results” that we’d reject and reject the priors too. The issue at hand is precisely what counts as same results, the LP says one thing and error probabilities say another. The analogous example to the one here is forming the corresponding confidence interval so as to never include 0, even when 0 is the true value.

Follow Blog via Email

Unauthorized use and/or duplication of this material without express and written permission from this site’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Deborah G. Mayo and Error Statistics Philosophy with appropriate and specific direction to the original content.