Note: Concatenates and revises four previous posts. It’s over 4000 words, you have better things to do than read this. Even if you’re interested, the eventual paper will be more useful and more correct.

I’m going to start by posing three questions. One has to do with baseball and personality; the other two with statistics and causation. Most people, though not me, find baseball and personality more interesting, so let’s pose that question first.

In Major League Baseball, do younger brothers typically steal at a higher or lower rate than their older brothers, or are they the same on average?

Suppose you wanted to know the relationship between risk-taking and digit ratio (index finger length divided by ringer finger length). Suppose you had the pretty good idea of giving 152 Caucasian experimental subjects a choice of six lotteries, ordered by riskiness.

One approach would be to give every one the same choice of lotteries. You could then draw two dot or scatter plots of lottery choice against digit ratio–one for women, one for men. If you really wanted a P-value, you could test, for each gender, whether the relationship between lottery choice and digit ratio was significantly different from random assignment.

Another approach would be to use three different sets of lotteries (called 50-50, 75-25, and 25-75). Furthermore, devise three different “frames” (wordings) for the lotteries; for each subject, randomly assign one wording. Throw the result into a regression, with indicators for the set of lotteries and for the frame. You get something like this:

The problem with this second approach is that the regression model is wrong. This is because all models is wrong. In particular, these graphs:

make it very hard to believe that the difference between the lotteries is well-represented by a constant; the graphs aren’t even the same shape. (No, an ordered probit doesn’t solve this problem.) There isn’t any good reason to think frame and gender effects are constant, either.

The difference between the first approach and the second is the difference between “hey, this is cool” and “hey, this might be cool but I don’t trust any of the numbers”. (Also, the first approach is easier.) As it stands, there may well be something real going on, but it’s hard to say more than that.

From the study: “Questions… were sometimes ambiguously worded, allowing us not only to diagnose whether students had a correct or incorrect understanding of a carbon-related process but also to uncover their ways of reasoning about carbon-related processes.”

Sample question: “Once carbon enters a plant, it can be converted into energy for plant growth. True or false?”

Chi and Snyder recruited sixty right-handed Sydney students to perform a problem-solving task. Some of them had “transcranial direct current stimulation” during the task (either right-brain positive or left-brain positive), while others had “sham” stimulation. In the right-positive group, 12 of 20 solved the problem with the time limit, compared to 5 of 20 in the left-positive group and 4 of 20 in the sham group. A Fisher’s exact test comparing the right-positive and placebo groups gives P = 0.022. Cool result.

Look at this table though:

The randomisation is very gender-unbalanced, with a Fisher’s exact test P-value of only 0.013! This doesn’t cast doubt on the authors’ results: there was little difference between women’s and men’s performance overall: 11 out of 30 women solved the problem compared to 10 out of 30 men (though note Simpson’s paradox possibilities. Also, I would be a bit worried that the study seems to have been single-blind rather than double-blind). But the gender counts are interesting in themselves. What seems to be happening is that women have more psychic powers than men. They want to do well on the test, so they subconsciously arrange to put themselves into to the group that has the best shot at solving the problem. There’s no other rational explanation, right?