It’s interesting that causality doesn’t come up until the comments, when Stephen Conn discusses Galileo. Now, you can get a long way in science without thinking about causality, or only thinking about causality in a very simple way — that’s the basis for the success of statistics over the last century. And indeed, if you only care about prediction without intervention, it’s not necessary to tease out cause. But if you care about what happens after an intervention, you need to know something about the causal structure, and at this point in history, finding causal structure is not something machines are good at doing. (To be fair, humans aren’t always good at this either.) The philosophical difficulty is that almost everything is an intervention — changing Google’s search algorithm is an intervention — though often, the intervention changes the system by a negligible amount. Intervention is a continuum, and given the current state of knowledge in statistics and machine learning, this is something that should keep statisticians awake at night.

Phil: “Suppose you’re a Martian who has just immigrated to North America, and you have no idea how baseball works. All you’ve got is a database full of statistics, and a black-box graph theory algorithm to try to figure out cause-and-effect relationships. What would you conclude?”

… while we found some evidence that winning affects payroll and payroll affests winning, the evidence suggests the effect of winning on payroll is the more direct, larger, and more lasting in magnitude one.

Suppose a second Martian came to North America to verify the results of the first one. Having some time on her hands, she takes the unprecedented step of actually watching a baseball game. When, in the first innings, Ichiro gets a hit, she thinks that it makes a lot more sense to say that ABs cause singles, rather than that singles cause ABs as the black box claimed. But as the innings continues, she sees that singles can cause ABs as well: by extending the innings. Somewhere around the bottom of the third, she realises that there is no way the black box can work out these sorts of causal relationships based on annual data: you would need gamelogs and a super-black-box.

Since even Martians have not yet developed super-black-boxes, she simplifies the problems. She throws out all the performance variables. Instead, for each team-year, she only considers two variables:

– team payroll (standardised in some way, perhaps fraction of total payroll);
– team wins.

She then builds two predictive models:

– year N wins as a function of year N payroll (if you wanted to be careful, you’d use preseason payroll)
– year (N+1) payroll as a function of year N wins and year N payroll.

These are not definitive models. It could be that good managers can talk their bosses into high payrolls, and, over and above that, are good at player selection. Furthermore, there are different ways to change payroll. Increasing your payroll by acquiring free agents will have a different effect from giving your existing players a pay rise. Finally, all the box score variables we dropped might matter after all: perhaps wins due to “luck” affect payroll differently from wins due to improved performance. The black box, however, either ignores these problems or solves them in obviously wrong ways. It’s better to directly model the relationships you care about.

(Meta-note: Now that classes are over for the semester I’ll try to post something every weekend. Annoy me if I don’t.)

Here’s a study by Cornaglia and Feldman of the relationship between baseball salaries and marital status. They overinterpret their results, but I’m not worried about that aspect here. What I’ll focus on is their regression to find the “direct effect of marriage on earnings by initial ability”. It’s a regression for causal inference, so you know I’ll find something to object to. What might be more constructive is a proposal for how I would study the issue.

This is what I’m thinking:

Is there a difference between the distribution of salaries of single players and the distribution of salaries of married players? What is the difference? Is it just a shift on a log scale, or does the distribution change shape? (Histograms would be informative.) Is the difference significant for some reasonable definition of significant?

By far the most obvious common cause is age: clearly this affects both marital status and salary, whereas experience, for example, will be much less directly related to marital status. Can any differences between single and married players be explained by differences in age? To answer this, compare the salaries for 23-year-old single players to 23-year-old married players. Then compare 24-year-old married players to 24-year-old married players, and so on. Comparisons of centres are important, but we also care about comparisons between each pair of distributions: if there are differences, are they for all players, or only for parts of the salary distribution?

If part 2 suggests there really are differences, let’s try to quantify them. Build a model for “deserved” salary based on age, experience, and output. The output part would be hard to build from scratch; fortunately, the good people at Fangraphs have done the work for us. Pair single and married players who are very close in predicted salary. Are there systematic differences between the pairs? Do they hold throughout the distribution of predicted salary, or only for one end or the other?