Sorry for the total lack of any semblance of statistical knowledge or understanding. That said:

If we have two variables, A and B, we can compute a Pearson's r for them. Knowing how many samples we have, we can use a lookup table to find a p value. So, assuming good data (taken using a reliable measurement instrument from a representative sample), what does this p value tell us? Specifically, is it the probability of none of the following being true:

Changes in A cause changes in B,

OR

Changes in B cause changes in A,

OR

There is some variable C, and changes in C cause changes in A and B?

Are there other possibilities? In short, is a statistically significant correlation "acceptably" likely to imply one of the above types of causation, assuming decent sampling but no imposition of treatment, or are there other complications?

$\begingroup$Correlation doesn't imply causation in any of the ways you describe. All that it says is that increases or decreases in A are related to either increases or decreases in B.$\endgroup$
– Michael ChernickSep 1 '17 at 19:49

$\begingroup$Okay. Wikipedia lists seven possibilities: "A causes B (direct causation); B causes A (reverse causation); A and B are consequences of a common cause, but do not cause each other; A and B both causes C, which is (explicitly or implicitly) conditioned on; A causes B and B causes A (bidirectional or cyclic causation); A causes C which causes B (indirect causation); there is no connection between A and B and the correlation is a coincidence"...$\endgroup$
– Jean Luc PicardSep 1 '17 at 19:53

$\begingroup$Of those, the first, second, and third are mentioned in my question. The fifth is a combination of the first and the second. The fourth seems like something a good sampling methodology should avoid. The sixth looks different, but I question the conceptual utility of any definition of causation where A causes C causes B does not imply that A causes B; it seems like we can find some intervening factor for virtually any case of causality. I thought the seventh was what our p value gave a probability of. With which of these assumptions do you disagree?$\endgroup$
– Jean Luc PicardSep 1 '17 at 19:59

1

$\begingroup$Here is an example of an eighth possibility for your consideration: that correlation sometimes represents an arbitrary analytical decision. Give a paper map of a river to people and ask them to digitize locations along it. Each person creates a set of $(x,y)$ coordinates. In some of those sets the river might be oriented to run horizontally, in which case the correlation coefficient is close to zero; in some of them it might seem to run to the northeast, with a correlation coefficient close to $1$; in yet others to the northwest, with correlation close to $-1$.$\endgroup$
– whuber♦Sep 1 '17 at 21:38

1

$\begingroup$(Continued) There are other possibilities as well. The point is that correlation per se is a description of a set of $(x,y)$ data: nothing more. Thus, the value of $r$ by itself does not logically imply anything whatsoever about causation or even the existence of hypothetical variables that have any meaning or importance for a study. A p-value provides some indication of whether a particular probability model (the null hypothesis) may be appropriate or not. Probability models, in themselves, have no causal elements.$\endgroup$
– whuber♦Sep 1 '17 at 21:40

2 Answers
2

I'm simplifying a little bit, but basically: the p-value that you're probably referring to denotes the probability that, given a particular sample size, two random sets of numbers will have a correlation greater than or equal to the one you've observed.

Example: Say we roll a pair of dice 6 times. This generates 6 unique x,y points. If the order of the rolls doesn't matter, what are the chances that you'd end up with the following points:

1,1; 2,2; 3,3; 4,4; 5,5; 6,6?

Pretty low, right? This dataset has a correlation coefficent of 1 and an n of 6. If you look up the associated p-value, you'll find it's < 0.00001. There is less than 0.00001% chance that these numbers are simply random dice rolls.

Now for some actual dice rolls - here's numbers I generated randomly in excel:

1,6; 2,3; 3,4; 3,2; 5,2; 6,4; .

The correlation coefficient of this data set is = 0.1806. Again, the sample size is 6. The associated p-value for a two-tailed test is 0.3660225.

If these were data from an experiment, we'd say that the association between x and y is so weak that it could be adequately described by random chance. In fact, we'd say that there is a 36.6% chance that there is no true treatment effect, and that the (weak) observed correlation is simply due to random chance.

A significant correlation does not necessarily imply causation. To determine whether there is a causal relationship, you must use a randomized experiment. That is one reason why tobacco companies were so difficult to prosecute in the 1960's - no scientists were "randomly assigning" humans to be smokers or non-smokers, then waiting to see who got lung cancer. The prosecutor's primary evidence came from came from observational studies, with no randomized treatments. However, after decades of strong correlation between smoking and lung cancer, along with plausible medical explanations for how tobacco smoking harms the body, courts were finally convinced that smoker's lung cancer was caused by smoking and not something else. Statisticians are the same way - in the absence of experimental data, causation is only implied by strong, repeated, and prolonged correlations, usually with a plausible mechanism for how x might be acting on y.

This page might be helpful if you're still wrapping your head around correlation coefficients.

To learn more about one- vs. two-tailed tests, check this wikipedia page.

$\begingroup$"A significant correlation does not necessarily imply causation." I assume that means any causation, not just any specific causation. Is there a simple explanation of why this is? If A correlates strongly with B in a well-designed sample, what possible scenarios exist beyond some combination of A causing B, B causing A, and a third factor C causing both A and B?$\endgroup$
– Jean Luc PicardSep 1 '17 at 20:47

$\begingroup$The alternate scenario is random chance. With all of the things we can measure in this world, there is no limit to the random associations people have uncovered. This, of course, does not imply causation. See this humorous page for examples.$\endgroup$
– Anson CallSep 1 '17 at 20:52

$\begingroup$Then the p value (communicates/represents/is used as a proxy for) something other than the probability that what we're observing is random chance? If so, what?$\endgroup$
– Jean Luc PicardSep 1 '17 at 21:09

$\begingroup$The p-value, if computed correctly, is always the probability of finding data as different / unusual / extreme as the data being considered, assuming that the hypothesis is true that the data are not different / not correlated / not associated. That's p-value by definition... But unprobable things can be found. Especially when you look many times. The probability of finding something unprobable is high if you look often enough.$\endgroup$
– Sal MangiaficoSep 2 '17 at 2:17

I'll add a couple of points that may be valuable in this conversation. 1) A significant / interestingly-large correlation (S/IL correlation) may be local to the sample you're studying. And, 2) As you do more correlation tests, the chance of finding a S/IL correlation increases just by random chance.

Here's an experiment you can perform. Find about 10 co-workers. Design a survey of about 20 questions. Their height. Their weight. What month they were born in. How dark their hair is on a scale of 1 to 5. How much they like dark chocolate vs. milk chocolate on a scale of 1 to 5. How much they smoke on a scale of 1 to 5. And so on. And then perform correlation tests for each pair of questions.

Illustrating point 2 above, you are likely to find S/IL correlations just by chance for at least some of these correlations. Because you are testing 19 + 18 + 17 + 16 ... (and so on) combinations of these questions.

And you will be tempted to tell yourself "just so" stories about these correlations. "Oh oh course people who like dark chocolate more are more likely to smoke more, because both are bitter." "Oh of course blonder people drink alcohol more, because as everyone says 'blonds have more fun.'"

But of course for at least some of correlations, these explanations will be bullshit. "The first principle is that you must not fool yourself — and you are the easiest person to fool." That applies to all of us. Especially when we are analyzing our hard-earned data.

Illustrating point 1 above, if you went out into the larger world with one of these S/IL correlations in hand, you would quickly find that it doesn't hold up. It just happened to be that among your 10 co-workers that smokers liked dark chocolate more or that blonds drink more alcohol. There's no causation here; there's just a random correlation among two variables for a finite population.