Note: Concatenates and revises four previous posts. It’s over 4000 words, you have better things to do than read this. Even if you’re interested, the eventual paper will be more useful and more correct.

I’m going to start by posing three questions. One has to do with baseball and personality; the other two with statistics and causation. Most people, though not me, find baseball and personality more interesting, so let’s pose that question first.

In Major League Baseball, do younger brothers typically steal at a higher or lower rate than their older brothers, or are they the same on average?

There’s a bunch of social psychology research on the relationships between birth order and aspects of personality. In particular, previous research has found a link between being a younger sibling and increased risk-taking. There are several evolutionary just-so stories for this: later-born children get less attention from their parents, so they rebel more, etc. How can we see risk-taking in baseball data? Well, one possibility is in base stealing. There’s inherent risk in trying to steal a base. If you’re on first or second base, you’re in a good position. You could just rely on the next better to get a hit and advance you. Or: you could try to steal a base. If it works, you’ve put your team in a better position. If it doesn’t work, you look like a fool.

Now, nobody thinks that the rate at which a player tries to steal bases is completely determined by risk-taking. In particular, it’s also got to do with how fast you run. Are speed and other additional variables things you need to control for? Spoiler–the answer is: it depends.

The third question is: What does Simpson’s paradox have to do with all of this? Simpson’s paradox states that, roughly speaking, what’s true for the parts is not necessarily true for the whole. Conversely, what’s true for the whole is not necessarily true for the parts. Now, mathematically, this isn’t much of a paradox: you can give examples where it’s easy to see what’s going on, and I’ll do so. But a corollary is that whether or not you control for things matters, and further, controlling for things doesn’t necessarily give you a better answer: in fact, it might give you a worse one. If we statisticians did a better job of teaching our intro classes, this wouldn’t be news to anyone. But we don’t, so it is.

I don’t have objective metrics, but I feel safe in asserting that most academic work in sabermetrics–that’s the study of baseball statistics–doesn’t measure up to the best publicly available fan work. (Note, however, that much of the best fan work is done by people who, one way or another, get paid to sabermetrise.) Whenever a journal article gets press, usually because of a surprising or counter-intuitive result, the fan community–in particular, Phil Birnbaum, whose blog is linked on the screen–will run various fine-toothed combs over the paper; often, little of the article will remain standing. (Cue my house whine about how I wish peer review usually worked that well, etc.) Sometimes, a mutually beneficial dialogue between the fan and academics is instigated. This is a result of such a dialogue; I’ll leave it to you to decide whether or not it benefits anyone.

I should mention that I have a hypothesis that the more publicity an academic finding receives, the less likely it is to be true. In my less charitable moods, I call this the Superfreakonomics principle.

One day the topic of discussion on Phil’s blog was a New York Times article on a paper by Frank Sulloway and Richard Zweigenhaft. They had dug through baseball statistics for evidence of siblings behaving differently.

Previous studies had claimed that younger siblings took more risks than their older counterparts. You can theorise various sociological or evolutionary just-so stories to explain this — maybe younger brothers had to fight harder for food or for the love of their parents or something. Doctors Sulloway and Zweigenhaft examined whether the phenomenon might mean that baseball-playing brothers would try to steal bases at significantly different rates.

Here’s what the Times reported that they found. “For more than 90 percent of sibling pairs who had played in the major leagues throughout baseball’s long recorded history, including Joe and Dom DiMaggio and Cal and Billy Ripken, the younger brother (regardless of overall talent) tried to steal more often than his older brother.”

Ninety percent? Is that really plausible? Look at the names the Times quotes. If you know a bit about baseball, you know that Joe DiMaggio was Dom’s older brother, and Joe didn’t steal much—maybe two or three times a season. A quick check of Baseball-Reference.com verifies that Dom tried to steal at a higher rate than Joe. But Joe and Dom had an even older brother who made the majors—Vince. And Vince stole at a higher rate than Joe. We can think of plenty of other counterexamples. This ninety percent figure can’t be right, can it? Maybe the Times reported it wrong. What did the paper actually say?

Well, the abstract said this: Consistent with their greater expected propensity for risk taking, younger brothers were 10.6 times more likely to attempt the high-risk activity of base stealing and 3.2 times more likely to steal bases successfully (odds ratios).

10.6 times more likely? Does that mean that the probability of a younger brother having a higher steal attempt rate than his sibling is 10.6 times the probability of an older brother having a higher steal attempt rate? (Steal attempt rate, by the way, is just number of attempted steals divided by number of opportunities.) That would take us back to the 90%, which we’ve already asserted is implausible. But wait: the parenthesized words “odds ratios” suggest we might be calculating something different.

The meaning of the statement is clarified in Sulloway and Zweigenhaft’s second paper:

“It may be seen that the common odds ratio is 10.58… Within each call-up sequence subgroup, the odds ratio indicates that younger siblings attempted more stolen bases per opportunity…” So 10.58 is an estimate of something like the odds of a younger brother having a higher rate of stealing, divided by the odds of an older brother having a higher rate of stealing. Now, one might estimate this by, firstly, finding the proportion of younger brothers who try to steal more often (per opportunity) than their sibling; secondly, finding the proportion of older brothers who try to steal more often than their sibling (which may or may not be one minus the first proportion, depending on how you treat sets of three or more brothers); then finally, finding the quotient of these two proportions. Doing this for the 95 or so pairs of position-player brothers in the database gives an odds ratio around 2.5. I note a minor lesson here: That odds ratios are hard to understand, and that odds, probabilities, or, where possible, counts are more transparent.

Younger brothers have attempted to steal at a higher rate than their older brothers 58 times, while the opposite occurred 37 times. (A two-tailed sign test gives a P-value of 0.040. Note that steal attempt rate is only one of several base stealing statistics used by the authors, so multiple testing caveats may apply.)

S&Z claim this calculation is inappropriate. Why? Because older brothers and younger brothers may not have, on average, the same ability. In particular, an older brother that gets called up to the majors may help a younger brother of marginal ability to get called up. So there might be a number of weak younger brothers in the sample (I don’t recall any named example). Ability might affect one’s propensity to steal, so, S&Z, we should control for ability somehow. This may or my not be helpful; let’s assume for now that it might be. The central issue is the nature of the control.

Let’s work through S&Z’s argument for their control, sentence by sentence.

Comparisons between older and younger brothers are potentially biased by call-up sequence to the major leagues.

What exactly is biased here? Are we just looking for a comparison between pairs of brothers that were both position players in the Major Leagues? If so, no bias exists–we can exactly calculate the number we want. Are we trying to make a numerical generalisation about a wider population than Major League position-playing brothers? Hopefully not, otherwise we have all sorts of insoluble problems.

Are we trying to estimate a causal effect? Suppose we’re trying to estimate the causal effect of birth order on base-stealing proclivity. Usually when we’re trying to estimate causal effects, we have to worry about confounders. But when considering this causal effect, I can’t think of any reasonable confounders. What could possibly be a common cause of birth order and base-stealing proclivity? Perhaps fathers who liked to steal bases when they used to play ball procreate more, but it seems silly to worry about this. Similarly, reverse causality doesn’t look like a problem: maybe a base-stealing son might encourage parents to have more kids, but come on. This is about as close to a natural experiment as we ever get. Estimating the causal effect of birth order on base-stealing proclivity, conditional on brothers being Major League position players, does not obviously benefit from controls or adjustments. Well, about that conditional…

One would expect that owners and managers would be more likely to take a chance on the brother of an already successful major-league player than on someone lacking such a family connection.

I don’t know if one would expect this, but it’s a possibility. One might look into it by trying to find examples of such brothers, but I wouldn’t assume this a priori.

If this is the case, brothers called up later are apt to be less talented on average than brothers called up first–displaying a regression toward the mean in athletic ability.

Regression toward the mean is the wrong concept here: we have two populations with different means. The key thing to note is that brothers called up later will be less talented on average even if there is no manager bias. This is simply because better players are called up at younger ages. If a younger brother is called up earlier than his elder, the younger brother is very probably better. If an older brother is called earlier than his younger brother, that isn’t as informative, but we’d still guess the older brother is marginally more likely to be the better player: for one thing, we have a bit of evidence that the younger brother isn’t exceptional.

Consistent with this hypothesis, when we examined career longevity we found that brothers who were called up first tended to play for significantly more years (r .23, n 700, p .0001) and in more total games (r .19, n 700, p .0001) than brothers who were called up in a subsequent year.

We would expect to see this even if there was no managerial bias. More interesting is the issue of whether older brothers play more, less, or the same as their younger siblings. Both the paper and Phil present a bit of evidence that there is a difference, except Phil says older siblings play for longer, while S&Z say younger siblings play for longer if you control for call-up order. Any such difference, however, might not be because of manager bias. The whole premise of the study is that birth order might affect how you play baseball, so surely differences in birth order might affect the length of your career via ability rather than via selection. An older brother in the pros might induce a more modestly talented younger brother to stick with baseball instead of switching to another sport. However, it seems unnecessary to mathematically adjust for this kind of difference, when we can just say at the end that all our numbers are conditional on siblings making it to the majors, then let readers decide if that affects their judgement as to whether any of this has to do with risk-taking.

Call-up sequence and its relationship to athletic talent introduces a potential bias in athletic performance by birth order, as older brothers were more likely than younger brothers to be called up first owing to the difference in age.

The clause after the comma is true, but the clause before doesn’t follow. Older brothers are usually called up first, yes. Players called up first are usually better, yes. But how do we get from here to call-up introduces bias? Suppose call-up order causes changes in proclivity to steal. Well, birth order is the major determinant of call-up order. So controlling for call-up order repeats Fisher’s classic mistake of controlling for an intermediate variable.

I don’t think that’s the authors’ argument here, though. So suppose the causal effect of call-up order on stealing is negligible. Why would we not simply compare stealing rates for pairs of brothers? The authors might argue that differences in stealing caused by birth order could be mediated by factors other than risk-taking. In particular, such differences could be mediated through ability. So one could try to separate out the risk-taking path and the ability path. I would think this task hopeless, because risk-taking and ability interact in non-trivial ways, but some would disagree.

Let’s say we’re trying to control for ability. Doing this by using call-up order as a proxy for ability is the worst way I can think of doing this. Intuitively, this is because call-up order is much more strongly associated with birth order than with ability. So instead of removing the effect of ability, we end up removing most of the effect of birth order. It’s actually a bit worse than this, as I’ll discuss when I eventually get around to talking about Simpson’s paradox.

The extent of this effect turns out to be pronounced, with older brothers being 6.6 times more likely than their younger brothers to receive a call first to the major leagues (r .70, n 682, p .0001).

No argument with the numbers. I’m not sure whether “effect” is meant to indicate that call-up order has a causal influence on anything; no mechanism for this is given.

We are therefore comparing somewhat more talented athletes who were called up first (and who, by virtue of their relative age, tend to be older brothers) with somewhat less talented athletes who were called up second (and who tend to be younger brothers).

Why do we care? One brother will be more talented, one less so. Why does it matter if it’s usually the first-called one? If it were usually the older one, then we might conceivably want to control for ability, but we’d use a real measure of ability, and not call-up order.

To correct for this bias, we have controlled performance data for call-up sequence in all analyses involving birth order. This control is equivalent to directly comparing older and younger brothers who were called up first (n 304), and then separately comparing older and younger brothers who were called up during the same year (n 74), and finally comparing all older and younger brothers who were called up after another sibling (n 304).

This. Instead of comparing two groups that were at least close in average ability, we’re now comparing pairs of groups that are entirely different in average ability.

Firstly, S&Z use “within-family comparisons”. So start by comparing each players’ steal attempt rate to his brothers’, and see if it’s higher or lower. So each brother has a 1 or 0 attached.

Next, find odds ratios, stratifying by call-up order. Start with players called up before their brothers. Divide the odds an older brother called up before his brother attempts to steal more than his sibling by the odds a younger brother called up before his brother attempts to steal more than his sibling. Next, consider players called up in the same year as their brothers. Divide the odds an older brother called up in the same year as his brother attempts to steal more than his sibling by the odds a younger brother called up in the same year as his brother attempts to steal more than his sibling. Finally, consider players called up after their brothers. Divide the odds an older brother called up after his brother attempts to steal more than his sibling by the odds a younger brother called up after his brother attempts to steal more than his sibling.

The Mantel-Haenszel estimator assumes a common odds ratio exists: that the “true” odds ratio is the same for all three groups. Note that this analysis uses each pair of brothers twice: once for each brother. This creates huge dependencies, which complicates inference. This isn’t the major problem.

This is. The analysis compares younger brothers called up first to elder brothers called up first–slanted in favour of the exceptional younger players. It compares youngers called up at same time to elders called up at same time–also slanted in favour of the exceptional younger players. And it compares youngers called up later to elders called up later–slanted against the weak elders. In each case, the stratification means the youngers are favoured by the comparison. This is visible in S&Z’s paper. In their Table 4, they do the analysis I described for steal attempt rates for years and games in majors. You get the odds ratios 8.75 to 1 for years and 6.41 to 1 for games. These are huge numbers, and S&Z attempts to explain the difference through risk-taking don’t wash.

Let’s assume we have 100 pairs of brothers, with no twins. Suppose that the set of older brothers has the same distribution of talent as the set of younger brothers. Let’s also assume that ability is independent of family and birth order, and that steal attempt rate is an increasing function of ability. We’ll make it a deterministic function, though it could have a random component as well. That means that if you control for speed, your chance of being the brother who steals more is the same regardless of whether you’re the older brother or the younger brother.

We know that older brothers will be called up first more frequently than younger brothers. However, talented younger brothers are sometimes called up first. In particular, younger brothers who are unusually fast are more likely to be called up first, as well as more likely to try to steal.

Let’s say 90% of the time, the older brother is called up first, and the other 10% of the time, the younger brother is called up first. The following set of frequencies is plausible (first column is older brothers, second column is younger brothers):

Let’s check this satisfies our assumptions. The older brother column sums to 100, as does the younger brother column. Of the older brothers, 50 have a higher steal attempt rate, while of the younger brothers, 50 have a higher steal attempt rate. Ten of 100 younger brothers are called up first, and the youngsters who are called up first are more likely to have a higher steal attempt rate.

Now, what’s the odds ratio? First, let’s just look at those who are called up first. The odds ratio is (41/49)/(1/9), which is about 7.5. If you just look at those who are called up second, the odds ratio is also 7.5. If you want to combine these two odds ratios, it would seem that 7.5 would be the logical choice. But this should be 1!

This is a version of Simpson’s paradox, but it’s the other way around from the way we usually think about Simpson’s paradox. Whatever, it’s still not really much of a paradox.

Let’s assume you’ve heard of Simpson’s paradox (or the Yule-Simpson reversal, if you insist on accuracy in nomenclature), which states that a relationship that holds across each of a number of groups may change, disappear, or be reversed when the groups are aggregated. The way it’s usually presented is as a reminder of correlation-does-not-prove-causation, that you need to account for confounders. Take the canonical Berkeley sex bias study, as presented by Freedman, Pisani, and Purves:

An observational study on sex bias in admissions was done by the Graduate Division at the University of California, Berkeley, During the study period, there were 8,442 men who applied for admission to graduate school and 4,321 women. About 44% of the men and 35% of the women were admitted…

Each major did its own admissions to graduate work. By looking at them separately, the university should have been able to identify the ones which discriminated against the women. At that point a puzzle appeared. Major by major, there did not seem to be any bias against women. Some majors favored men, but others favored women. On the whole, if there was any bias, it ran against the men.

In this example, the correct comparisons are between male and female applicants to the same major. Now, many statisticians would jump on the use of “correct” in the previous sentence. They might argue that looking at the whole isn’t any more right or wrong than looking at the parts; it’s just different. This is of course mathematically true. But statisticians, unlike almost any others who call themselves scientists, have the luxury of never having to talk about causation, unless they really want to. Just about everyone else looks at the examples, says “Ah, to get the right estimate you have to control for confounders”, and scuttles off to run regressions with twenty independent variables.

The example here will helpfully drive home the other half of the lesson. If you care about the whole group, then drawing inferences from subgroups is the wrong thing to do. If you want to find the causal influence of X on Y, and Z has negligible causal influence on Y, then controlling for Z doesn’t help. In this example, it can hurt a lot: taking you from something like the right answer to an implausible answer. This will typically happen when X and Z are closely associated.

Some pictures might help to clarify. Let’s posit a model with three variables: birth order, steal attempt rate, and call-up order. An arrow from one variable to a second means that the first variable has a causal effect on the second. We’re not going to draw a connection between call-up order and steal attempt rate because S&Z make no claim as to a direct causal connection between those two variables. Put this bluntly, it’s obvious that, if the goal is to estimate the effect of birth order on steal attempt rate, call-up order doesn’t help you at all. It’s off in another direction entirely.

But, you might say, there are other variables! Like ability, for example. Let’s set aside call-up order for a moment. The pink line means I don’t know what the heck the relationship between risk-taking and ability is: whether there’s an arrow or not, or, if there is an arrow, whether it goes one way, the other, or both ways. (By the way, that pink line is one reason why the clever covariate selection methods of Judah Pearl, Don Rubin et al. are often difficult to use in practice. It’s hard to write down a single correct directed acyclic graph.) One problem is that ability and risk-taking aren’t well-defined. How do we measure risk-taking? We don’t; we measure steal attempt rate instead. What can we get out of measuring steal attempt rate? Well, we could get an idea of how much of the effect of birth-order flows through A along the top path, and how much flows through R along the bottom path. If there really is no direct causal link between A and R, this could tell us at least something about the effect of B on R.

Let’s pretend the pink line doesn’t exist. Let’s reinstate C and try to work out why it’s there at all. It’s there as a kind of proxy for A. The first problem with this is that B causes C, and conditioning on an effect is usually bad. The second problem is that C is much more closely associated with B and A, so C is a horrible proxy for B in this problem. Together, these two problems open the door for a Simpson’s paradox reversed situation to come up, and that’s exactly what happened.

OK, OK, we now know not to control for call-up order when it’s an effect and not cause. But you don’t care about baseball anyway. Why does any of this matter? Let’s look at a fictitious example.
In the Land of 400 Robots, there are 400 robots. There are also two political parties, the Robopublicans and the Digicrats. It’s well- known that rich robots are more likely to be Robopublicans, while poor robots are more likely to be Digicrats. The difference here looks like it’s 20%, or if you like odds ratios, 1.5 squared, which is 2.25. One day, the robot census decides to collect data about where the robots live, in addition to wealth and party. Here are the results. It now looks like rich urban robots were a couple of percent less likely to be Robopublicans than poor urban robots. Also, rich rural robots were a couple of percent less likely to be Robopublicans than poor rural robots. The common odds ratio for wealth is 0.92. Now, knowing nothing else about the Land of 400 Robots, it’s reasonable to suspect that wealth has a small negative effect on the chance of being Robopublican, and the large positive effect you had in the previous table was due to the confounding effect of location, which itself has a huge effect of the order of 30%. But: this isn’t proof of a causal effect. Let’s say the census also collected gender information.

Now what does it look like the effect of wealth is? For every combination of location and gender, rich robots are 20% more likely to be Robopublicans than poor robots, which was our original estimate. (Note that the zeroes make the odds ratios infinite, which is another reason to be careful about using odds ratios.) Furthermore, for every combination of wealth and gender, urban or rural doesn’t appear to make any difference. Now, we’re wary of jumping to causal conclusions, because we don’t know if there are still more confounders that could change our estimates again. So we go talk to the robots, ask them why they selected the party that they did. If it seems this model is justified, great. If not, we try again.