Tuesday, December 8, 2015

Many rules of statistics are wrong

There are two kinds of people who violate the rules of statistical inference: people who don't know them and people who don't agree with them. I'm the second kind.

The rules I hold in particular contempt are:

The interpretation of p-values: Suppose you are testing a hypothesis, H, so you've defined a null hypothesis, H0, and computed a p-value, which is the likelihood of an observed effect under H0.

According to the conventional wisdom of statistics, if the p-value is small, you are allowed to reject the null hypothesis and declare that the observed effect is "statistically significant". But you are not allowed to say anything about H, not even that it is more likely in light of the data.

I disagree. If we were really not allowed to say anything about H, significance testing would be completely useless, but in fact it is only mostly useless. As I explained in this previous article, a small p-value indicates that the observed data are unlikely under the null hypothesis. Assuming that they are more likely under H (which is almost always the case), you can conclude that the data are evidence in favor of H and against H0. Or, equivalently, that the probability of H, after seeing the data, is higher than it was before. And it is reasonable to conclude that the apparent effect is probably not due to random sampling, but might have explanations other than H.

Correlation does not imply causation: If this slogan is meant as a reminder that correlation does not always imply causation, that's fine. But based on responses to some of my previous work, many people take it to mean that correlation provides no evidence in favor of causation, ever.

I disagree. As I explained in this previous article, correlation between A and B is evidence of some causal relationship between A and B, because you are more likely to observe correlation if there is a causal relationship than if there isn't. The problem with using correlation for to infer causation is that it does not distinguish among three possible relationships: A might cause B, B might cause A, or any number of other factors, C, might cause both A and B.

So if you want to show that A causes B, you have to supplement correlation with other arguments that distinguish among possible relationships. Nevertheless, correlation is evidence of causation.

I disagree. It think regression provides evidence in favor of causation for the same reason correlation does, but in addition, it can distinguish among different explanations for correlation. Specifically, if you think that a third factor, C, might cause both A and B, you can try adding a variable that measures C as an independent variable. If the apparent relationship between A and B is substantially weaker after the addition of C, or if it changes sign, that's evidence that C is a confounding variable.

Conversely, if you add control variables that measure all the plausible confounders you can think of, and the apparent relationship between A and B survives each challenge substantially unscathed, that outcome should increase your confidence that either A causes B or B causes A, and decrease your confidence that confounding factors explain the relationship.

By providing evidence against confounding factors, regression provides evidence in favor of causation, but it is not clear whether it can distinguish between "A causes B" and "B causes A". The received wisdom of statistics says no, of course, but at this point I hope you understand why I am not inclined to accept it.

In this previous article, I explore the possibility that running regressions in both directions might help. At this point, I think there is an argument to be made, but I am not sure. It might turn out to be hogwash. But along the way, I had a chance to explore another bit of conventional wisdom...

Methods for causal inference, like matching estimators, have a special ability to infer causality: In this previous article, I explored a propensity score matching estimator, which is one of the methods some people think have special ability to provide evidence for causation. In response to my previous work, several people suggested that I try these methods instead of regression.

Causal inference, and the counterfactual framework it is based on, is interesting stuff, and I look forward to learning more about it. And matching estimators may well squeeze stronger evidence from the same data, compared to regression. But so far I am not convinced that they have any special power to provide evidence for causation.

Matching estimators and regression are based on many of the same assumptions and vulnerable to some of the same objections. I believe (tentatively for now) that if either of them can provide evidence for causation, both can.

Quoting rules is not an argument

As these examples show, many of the rules of statistics are oversimplified, misleading, or wrong. That's why, in many of my explorations, I do things experts say you are not supposed to do. Sometimes I'm right and the rule is wrong, and I write about it here. Sometimes I'm wrong and the rule is right; in that case I learn something and I try to explain it here. In the worst case, I waste time rediscovering something everyone already "knew".

If you think I am doing something wrong, I'd be interested to hear why. Since my goal is to test whether the rules are valid, repeating them is not likely to persuade me. But if you explain why you think the rules are right, I am happy to listen.

14 comments:

I bet the rules are quoted and interpreted in such an extreme way because people who have learned a little statistics are feeling smug about it, because there really is a naive tendency to make mistakes that the rules are designed to point out.

There is a funny site that you have probably seen where Tyler Vigen shows strong correlation that is coincidence. You could argue that the correlation is evidence of causation, but that would require a definition of evidence a bit more weak than I think most people would assume.

My favorite is the correlation between the production of opium in Afghanistan and a picture of Mount Everest.

https://twitter.com/tylervigen/status/603204482856591360

In this case, none of the relationships you mentioned are likely to obtain. "A might cause B, B might cause A, or any number of other factors, C, might cause both A and B." Instead, in this case C caused A and D caused B, but they still look similar on a plot.

Makes me wonder how the clearly coincidental correlations mentioned were found. If I gather all of the data I can find and am able to pull some spurious relationships out, what hypothesis am I testing? Did I do many, many tests to get a hit? How relevant is the demonstrable existence of pure coincidence to the interpretation of a well designed experiment?

Sometimes I feel like people know just enough statistics to be a little afraid of it, so they take the hard line textbook interpretation.

Thanks for the post, and the response. I was not aware of the opium/Everest correlation. Definitely going to use that in my science literacy class. This stuff boggles my mind.

"Quoting rules is not an argument." Excellent point (especially re. the correlation/causation maxim you mention earlier. This speaks to a larger trend in citing uncertainty as a means of rejecting any and all evidence out there. Great post.

"Assuming that they are more likely under H (which is almost always the case), you can conclude that the data are evidence in favor of H and against H0."

So you don't accept a rule because you choose to assume something else? Not much of an argument. And also unsubstantiated, as there are plenty of examples to the contrary. I'd say generally, merely a significant p-value in social science research where data are noisy, analysis is likely p-hacked or a garden of forking paths, etc. provides fairly weak evidence against H0; I strongly disagree that you can comment directly on H from the p-value without lifting a finger on modeling H. Please see Wagenmakers 2007 p. 792-793, where p = 0.05 can even indicate that H0 is likely to be true, or for another example Nickerson 2000, p. 249-251. There are many criticisms out there. I'd also highly recommend Schmidt and Hunter paper below as general overview.

The main thing is that you are wanting a p-value to be some kind of likelihood ratio or Bayes Factor, which it is not. A p-value is completely one-sided and only concerns the probability of the data under H0, not even the probability of H0. Overall, you are disagreeing with the interpretation of p-values by mis-interpreting them even more than the mess that brings about the current reproducibility crisis, Ionnadis' "most published research is false", etc.

Schmidt, F. L., & Hunter, J. E. (1997). Eight common but false objections to the discontinuation of signiﬁcance testing in the analysis of research data. What if there were no signiﬁcance tests, 37-64.Accessed at: http://www.phil.vt.edu/dmayo/personal_website/Schmidt_Hunter_Eight_Common_But_False_Objections.pdf

It's true that you have to make some additional assumptions in order to say anything about H, and it sounds like you object to that.

But the assumptions are very weak, and nearly always true in practice. So if you are trying to do something practical, like guide decision-making under uncertainty, why would you not accept reasonable assumptions? Especially when the alternative is to provide no guidance whatsover?

I don't object to making additional assumptions so that you can say something about H. If you want to compare hypotheses, then yes go ahead and compare them (with Bayes Factor or likelihood ratios, for god's sakes, something!).

You can 'say' or decide what you want, but what is the number that backs up such a statement? I would argue not the p-value. If there are tools that do exactly what you want (to compare H0 to H, for example), why not use them?

Using the p-value does not stand up to even basic additional scrutiny. For example, you get a p < .05 and you say H is 'more likely'. Then someone (a reviewer, a skeptic, an interested friend) asks, well how much more likely? 1.2 times, 20 times? How do you respond to that? Seems vital to me.

Alex, I think we are agreeing. If all you know is the p-value, the conclusions you can reach about H are pretty weak, and qualitative, even with my additional assumptions. That's why I say that traditional NHST is mostly useless.

But if the p-value is small, you can usually conclude that the observed effect is probably not due to random sampling, and you can turn your attention to other possible sources of error.

A couple of comments:1) Correlation / causation. Even though this may sound pedantic, I think semantics make a difference here. The word "imply" is often used in the sense of "logically implying". When used in this sense, it is in fact true that the the belief that correlation implies causation is the logical fallacy of affirming the consequent. That being said, if you talk about "evidence" (and not "implication") of correlation in favor of causation you are correct. I don't know how most people typically interpret "imply" - in the logical sense, or in the sense of shifting evidence. Maybe you like this quote from XKCD: Correlation does not imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing "look over there."

2) Regression, matching, (and weighting) all have the same underlying causal assumptions, namely "ignorability" (sometimes also called "selection on observables" or "unconfoundedness"). In fact, Angrist and Pischke in their book "Mostly harmless econometrics" formally prove that all three estimators are in the same class. You can re-express regression as a particular weighting scheme, and you can do the same with matching. I am not sure how widespread the belief is that you cite - if you were to go to a conference like Atlantic Causal Inference Conference, I would think that all participants would know that there is nothing magical about matching, and that all these methods share the same underlying assumption. There are practical advantages and disadvantages to all these methods though.

3) Regarding reversing regressions - this is not a theoretically sound way to determine causal direction (or provide evidence for one or the other regression direction). Judea Pearl proved I think in the 80s that the models that you are suggesting are all in the same Markov equivalence class, and that parameters yielded by those models cannot be used to distinguish which one might be the true causal model. Apologies for the self-promotion but my paper on reversing arrows in mediation models also shows this point. That being said, if you are willing to make certain untestable assumptions about distributions of disturbance terms, you can use methods that you suggest (reversing regressions) to determine causal direction. The work of Bernhard Schoelkopf is important in this domain. Unfortunately, the assumptions needed to make these methods work will by definition always be untestable, and thus be subject to debate.

If I'm interpreting him correctly, he would disagree with you that you can say much about H given an adequately small p-value. However, he would also say that this implies that the p-value is a very limited measure since, among other reasons, p-values tend to shrink exponentially fast as the sample size grows.

Alright, the 'old' mantra wants to warn of the 'trivial' case: Correlation between A and B does not imply A causes B directly, or vice versa"

Your new version is, I believe: 'Correlation between A and B implies *something* is causing it'.

Both true, but I believe it's very important to include "directly" and "something" in the respective versions. Just saying "correlation is evidence of causation." is prone to be misinterpreted, just as the simplified "correlation does not imply causation".

I knew "Think Python" from long ago, and recently I discovered the rest of your books, which are great, thank you.

I just wanted to comment on the "correlation does not imply causation" thing. As I see it, this statement usually refers to heavily autocorrelated series, this is, series with actually a few independent points. It is very easy to find spurious correlations in this kind of series, as the global warming and number of pirates example. When you have two samples of n=1000 points each, with no autocorrelation, and find a 0.9 correlation then there is almost certainly a causal link behind.