I was thinking about the phenomenon of p-hacking (or data dredging), and how it's much more likely to give apparently significant results than proper hypothesis testing would give.

In particular, I'm considering the case where a researcher keeps increasing the sample size until statistical significance is reached. (As in, "Oh we ran the test with 20 participants, and the data are suggestive but not significant, so let's test another person," repeated until significance is reached.)

My question is, if this process is allowed to continue indefinitely, what's the probability of eventually hitting a statistically significant result (at some predefined level of significance)?

For example, when I modeled it as a thousand runs of 100 coin flips, 54 had few enough total heads for p<0.05, while 194 reached "too few heads" significance at some point during the run (i.e. 140 random walks stumbled into and then back out of significance). When I did a thousand runs of a thousand coin flips, 47 had "too few" heads overall while 346 reached significance at some point, meaning an additional 150-ish of the random walks that never stumbled into significance in the first 100 steps managed to do so at least once in the subsequent 900.

Is "eventually stumbling into significance" the kind of tail event that will almost surely happen as the runs are allowed to get arbitrarily long, or is there some limit strictly less than 1?

Also, is there a known expression for the probability of stumbling into significance at some point on a walk of length N (i.e. an expression which would give something near 19.4% for N=100 and something near 34.6% for N=1000)?

Unless stated otherwise, I do not care whether a statement, by itself, constitutes a persuasive political argument. I care whether it's true.---If this post has math that doesn't work for you, use TeX the World for Firefox or Chrome

If the underlying process is exactly zero mean or fair with respect to what you're testing (i.e. fair coin flips), then stumbling into significance will almost surely happen if you are allowed to keep going arbitrarily long, and a multiplicative decrease in number of runs that have not yet hit significance at some point requires you to multiplicatively increase long you are going. i.e. the inverse of the proportion of runs hitting significance should be asymptotically polynomialish with the exponent of the polynomial depending on your threshold on p.

Basic intuition:

Do 100 flips. Did you ever hit significance?

No? Okay, do 10000 more flips, which is so much more data that it should completely wash out and make negligible the result of the 100 flips and give you another almost independent "chance" to find p < 0.05. Did you ever hit significance?

No? Okay, do 1000000 more flips, which is so much more data that it should completely wash out and make negligible the result of the 10000 flips and give you another almost independent "chance" to find p < 0.05. Did you ever hit significance?

... etc

I don't know an exact expression though.

On the other hand, if the underlying process is slightly biased rather than exactly fair, such as coins biased in favor of heads, then you don't get this sort of long asymptotic tail. Roughly, stumbling into significance on the "correct" side will happen with rapidly increasing chance as you start reaching the amount of data needed data to distinguish the bias from noise, whereas doing so on the "wrong" side will only happen with some total probability strictly less than 1.

As a result, I think the "increase your sample size until significance" issue alone, unlike things like publication bias and most other experimenter-degrees-of-freedom, is dealable-with if you just pay attention to confidence intervals on effect magnitudes rather than only the sign of the result. Because if someone really has no other degrees of freedom and is obligated to publish everything and the process underneath is truly zero mean, doing repeated studies where you try to continue until significance on each one eventually requires the researcher to publish results where they had to collect arbitrarily much data before hitting significance and therefore be publishing confidence intervals on effect magnitudes that are arbitrarily small and close to zero. And if the process wasn't truly zero mean, then you can only make the "wrong" conclusion with bounded probability and eventually will have to publish studies that taken together make you strongly confident in the effect in the correct direction with the correct magnitude.

Yeah, I know there are ways to deal with sequential testing, just as there are ways to adjust for multiple comparisons. I was just curious about the probabilities when things are being done improperly, either through naivete or dishonesty. (For example, as you increase the number of separate ("sufficiently" independent, whatever that may turn out to mean) questions you ask about totally random data, the likelihood of hitting p<0.05 for one of them goes to 1.)

Unless stated otherwise, I do not care whether a statement, by itself, constitutes a persuasive political argument. I care whether it's true.---If this post has math that doesn't work for you, use TeX the World for Firefox or Chrome

So from what I understand, stumbling into significance will almost surely happen, but the amount of flips needed is infinite in expectation.

The only part I'm still shaky on is the interpretation the statement(Where S_k is the sum of k iid random variables with expectation 0 and variance 1)

So by expanding out the definitions for lim/sup, my understanding is that this is equivalent to saying:

For every real number w, there is an integer n, such that there exists an integer k >= n such that S_k/sqrt(k) > w, with probability 1.

To me, this seems to be equivalent to: For every real number w, there is an integer k such that S_k/sqrt(k) > w, with probability 1.

However, this would be written as P(sup S_k/sqrt(k) = inf) = 1, which seems like it definitely has a different interpretation from the original equation, and isn't the same. So I'm not sure if i've messed up something here.

In other words, it is true that for any expression E(n), lim sup E = inf implies sup E = inf, right? (where the limit is n -> inf and sup is over natural numbers or natural numbers > n).

lim sup Sk/√k = ∞ means that for any M,N, the maximum value of Sk/√k (where k>N) is greater than M. This means that no matter how far you go into the sequence (i.e. no matter how big N is), you can always find an even greater value for k that is arbitrarily large (i.e. bigger than M). In other words, there is no point in the sequence after which it remains bounded.

By contrast, consider e-k cos k. Here, the farther you go into the sequence, the smaller its bounds. For instance, after k=1, we know the sequence will never again get larger than 1/e. The limit of the sequence as k→∞ is 0, and therefore so is the limit superior (and the limit inferior).

Or consider e-k + cos k. Here, there is no limit at infinity, because as k gets large, the function approaches cos k, which oscillates between -1 and 1. However, the limit superior equals 1, because no matter how deep we get into the sequence, we will never reach a point where we can't get arbitrarily close to 1 just by waiting until k gets really close to a multiple of 2π. The limit inferior is -1 for the same reason.