Sabermetric Research

Phil Birnbaum

Thursday, October 30, 2008

Review of "Super Crunchers"

I wrote this review of "Super Crunchers" in January, and for some reason never posted it. Brian Burke's comment in the previous post, which deals with some of the same subject matter, reminded me that I still had it.

-----

I've just finished Ian Ayres' book "Super Crunchers", and I'm a little disappointed.It's an excellent book, and I enjoyed it.But it's not as sabermetric as I thought; I was hoping for lots of meaty examples, like the one in the introduction where statistician Orley Ashenfelter came up with a formula, based on temperature and rainfall, to predict how good a year's wine crop will be.But most of the examples are more commonplace.

"Why Thinking-by-Numbers Is the New Way to Be Smart," is the book's subtitle.And there are many flavors of "knowledge through numbers" discussed.For instance, there's a chapter on traditional statistics, like the normal distribution and standard deviations.There's a bit on the "false positives" problem, where if a disease is very rare, most of the people who test positive for it won't actually have it.

There's an entire chapter on how to make decisions by randomization.In choosing the title for his book, Ayres bought two sets of internet ads: one that used "Super Crunchers," and another that used "The End of Intuition."The first title got 60% more hits than the second, and the rest is history.

In Mexico, this kind of technique was used to test a certain anti-poverty program.Certain households, randomly chosen, were offered cash incentives if they took their children to health clinics and kept them in school.The results showed the program worked – the randomly-chosen group had better results than the non-chosen group.Ayres calls this method of policymaking "Government by Chance."

And there's lots of stuff on just, plain, regular, analysis of data, like Steven Levitt's study concluding that legal abortion reduces crime (since the unaborted babies are more likely to commit crimes because grow up poor).There's also predicting of trends, like when Wal-Mart knows that, after hurricanes, demand for Pop-Tarts skyrockets.

Those chapters are interesting, but the best parts of the book – and the largest – are where Ayres talks about how well data analysis works in situations where you'd think an informed, intuitive, expert judgment would be better.Such as, for instance, the Moneyball claim that players can be better scouted by their statistical record than by the opinions of the scouts who watch them.

A significant portion of the discussion involves medicine.The impression you get is that the medical professionals fly by the seat of their pants, like baseball scouts, trying to figure out what's best.But, as Ayres demonstrates, the data can be a lot more accurate than intuition.And, again like the scouts, doctors are defiant when outsiders' knowledge competes with theirs.

For instance, in the 1840s, a researcher found that mortality rates dropped by about 80% when doctors washed their hands between patients.The doctors didn't want to wash their hands, so they dismissed the findings, and patients died.

It sounds like we should know better now, but apparently not.In 1999, a doctor named Don Berwick went on a crusade to convince hospitals to implement certain basic procedures that would have a huge effect.Hand washing was one of them, but there were others, like formal procedures to double-check drug doses.There were six reforms in total, and they saved some 122,000 lives.

Berwick compares these procedures in hospitals to formal FAA procedures that guide aviation flights – no matter how experienced the pilot, procedures have to be followed.

What's amazing to me is that even though the data was out there, and the research was done, nobody bothered to change the way they were doing things.Apparently baseball is not unique in clinging to tradition and scoring new knowledge.

This is also the case in diagnosis.There are now computer systems out there that will take patient information – such as symptoms, genetic history, etc. – and produce a list of possible causes.Even the best of doctors can't sift through 11,000 diseases in their heads, and, if you rely on the doctor's memory and intuition, misdiagnoses, perhaps fatal ones, will be made.Ayres writes,

"... about 10 percent of the time, [the] Isabel [software] helps doctors include a major diagnosis that they would not have considered but should have."

Ten percent is a LOT.In my opinion, any doctor who doesn't use this software, and misdiagnoses a patient, should meet a swift and unpleasant death.Or, worse, a lawsuit.There's just no excuse for refusing to use all reasonable methods to check your diagnosis.Especially out of arrogance.

Diagnosing rare medical conditions seems reasonably complex, and you might be comfortable with the idea that computers can do it better than humans.But it turns out that formulas can often beat humans even when the formulas are very simple.For instance, experts were asked to predict how US Supreme Court Justice Sandra Day O'Connor would rule on certain cases.They competed against a flowchart that the researchers had devised in advance, a chart that fits on less than one page of the book.

The flowchart beat the experts.

Ayres makes much of the fact that all this is "number crunching."My view is that it doesn't matter that there are numbers involved.What there is, rather, is evidence.The rules of logic and evidence, and the scientific method, are what makes the knowledge, not the arithmetic.I'd argue that the people in the book shouldn't be called "Super Crunchers."They should just be called "scientists."

---------

By the way, it seems to me that the book's main baseball discussion isn't quite correct.In a scouts-versus-sabermetricians discussion, Ayres quotes Michael Lewis quoting Bill James:

"The naked eye was inadequate for learning what you need to know to evaluate players.Think about it.One absolutely cannot tell, by watching, the difference between a .300 hitter and a .275 hitter.The difference is one hit every two weeks."

But James didn't say that to justify sabermetrics – he used that to justify *baseball records*, even traditional ones such as batting average.By this standard, baseball people have been "super crunchers" for over 100 years.

The scouts-vs.-records debate has little to do with which formulas you use to measure productivity, but a lot to do with whether the statistical records of prospects have an importance beyond scouts' impressions.The Jeremy Brown debate – Billy Beane liked him because he could hit, the scouts hated him because he was fat – could have just as easily happened 40 years ago, without Bill James.

And one last point: on page 210, Ayres quotes Ben Polak and Brian Lonergan's statistic that rates players based on changes in win probability.He says "they have done [Bill] James one better."Of course, they have not.

Wednesday, October 29, 2008

Billy Beane: use sabermetrics to improve health care

In this New York Times Op-Ed, Billy Beane teams up with Newt Gingrich and John Kerry to suggest that a sabermetric approach could help save American health care.

"Remarkably, a doctor today can get more data on the starting third baseman on his fantasy baseball team than on the effectiveness of life-and-death medical procedures. Studies have shown that most health care is not based on clinical studies of what works best and what does not — be it a test, treatment, drug or technology. Instead, most care is based on informed opinion, personal observation or tradition ...

"One success story is Cochrane Collaboration, a nonprofit group that evaluates medical research. Cochrane performs systematic, evidence-based reviews of medical literature. In 1992, a Cochrane review found that many women at risk of premature delivery were not getting corticosteroids, which improve the lung function of premature babies.

"Based on this evidence, the use of corticosteroids tripled. The result? A nearly 10 percentage point drop in the deaths of low-birth-weight babies and millions of dollars in savings by avoiding the costs of treating complications."

I have no doubt that the authors are right.Baseball's "doctors" – sportswriters and broadcasters – believe that Derek Jeter is an above-average fielder, despite mountains of evidence and tens of studies pointing the other way.If experts can be so wrong, for so long a time, *despite* the statistical evidence, doesn't it seem like doctors are going to be just as wrong when there is *no* statistical evidence?

The authors recommend:

"Working closely with doctors, the federal government and the private sector should create a new institute for evidence-based medicine. This institute would conduct new studies and systematically review the existing medical literature to help inform our nation’s over-stretched medical providers."

Absolutely this is a good idea.But I think the process can be sped up.My suggestion: open the database to everyone, and offer prizes for the most significant findings, as judged by the new institute.

Why do I think this would work?

1.People are smart.The more people who work on this, the more knowledge is going to come out of the process.Sabermetrics wouldn't have happened if anyone had waited for "formal" baseball researchers to get to it.Linux wouldn't be as advanced as it is today if not for the open-source movement, and thousands of talented volunteers contributing continuous improvement.

2.People respond to incentives.The Netflix contest, where competitors scour a database of customer preferences to try to predict that customer's other preferences, attracted hundreds of competitors.That's very similar to what's involved here.People love an interesting challenge with recognition and prizes at the end of it.

3.You don't need to be a doctor.This is an exercise, mostly, in finding relationships in data.It's almost exactly what sabermetricians do every day.Sure, you'd need to know a bit of medicine.For instance, you'd need to know that corticosteroids are thought to improve the lung function of premature babies.Once you know that, then you take a look at whether the data show it.How do you find that out?Well, I'd bet there would be no shortage of "hints" like these, just as there is no shortage of advice on what companies to invest in, or tips on what third basemen are due for breakout seasons.

4.I'd bet that, often, doctors are unaware of many of the new findings in medicine.It probably takes a while for a good idea to become standard, as it takes time for more and more doctors to become aware of it.But with a big monthly prize, and the resultant publicity, any new knowledge that comes out of this process will be hard to ignore, especially if the medical establishment gets on board.

There is a certain mindset that if something is important, you need experts and academics and government to do it.That's true when you need extremely specialized knowledge or equipment.But when it's software development, or data analysis, there's no shortage of brilliant laymen willing to take a stab at it.

I think the lay public would help, enthusiastically, to figure this stuff out.Releasing the data could save a lot of lives.

Sunday, October 19, 2008

NYT on referee home-field bias in soccer

The article doesn't link to the actual academic articles it cites, but does mention one author, Peter Dawson.I've found two papers on the subject by Dawson and others; this one, which I can't download (if you can, can you send it to me?), and another one, which I can.

That's because the paper just tries to figure out how many penalties (yellow and red cards) were given to home and visiting teams.But even if it turns out that visiting teams were called for a lot more infractions than home teams, that wouldn't tell you anything about the referees – because it could be that visiting teams just commit more infractions!

There could be many reasons for this.Perhaps the cheers of the home crowd frustrate the visitors, and they become more aggressive.Perhaps visiting teams trail more often, due to home field advantage, and have to become more physical in an attempt to come from behind.Maybe the visiting players just miss their wives.Who knows?

But it seems reasonable to assume that whatever makes players better at home could easily lead them to commit fewer fouls at home, too.So just counting penalties, I think, doesn't tell you much about the refs.

Still, one of the more interesting findings in the paper is this: if there's a running track in between the football pitch and the crowd, the referee calls more penalties against the home team (statistically significant at 3 SDs).The authors interpret that as meaning that, the farther the officials are from the screaming, potentially hostile fans, the more willing they are to incur their wrath by punishing their team.

It sounds reasonable, except that it's possible that teams that play in those stadiums just happen to be teams that are more aggressive in general.It would have been better to control for that, perhaps by including a variable for penalties taken on the road.

Saturday, October 11, 2008

Blacks in baseball: the peak was 20%, not 27%

Remember that statistic that said that the percentage of (American) blacks in major league baseball hit a high of 27% back in the early 1970s? It turns out that isn't true.According to Carl Bialik, of the Wall Street Journal, the 27% figure applied only to full-time non-pitchers non-pitchers with at least 50 games that season. If you include everyone, the actual high was 20%.

Today's figure is around 8%, so there's still a sizeable drop to explain – just not as large as originally purported.

The originator of the original "27%" figure, John Loy, used it in a study where he looked for evidence of "stacking" (which means restricting black players to certain positions). He found that African-American players were disproportionally represented in the outfield, and suggested that teams put blacks in the outfield because that way they'll have less interaction with their (mostly white) teammates.

But didn't Bill James note (maybe in his 1987 rookie study?) that black players appeared to keep their foot speed a lot longer than white players did? I remember he once mentioned that Rick Monday was drafted partially because he was so fast. Today, of course, Monday isn't associated with speed at all – he's remembered mostly in connection with flag burning and breaking Tango's heart.

Anyway, if Bill was right, that would certainly explain the effect – you have to be reasonably fast to play the outfield, but not to catch, play first base, or designated hit. So there would appear to be some segregation by race, when it's really by speed.I haven't read the Loy study – he might have corrected for this. I'm just saying.

There's another effect Bialik mentions:

" ... other research suggests that latent racism within the game tended to reserve bench spots for white players."..."[SABR's Mark] Armour found that black players consistently have outperformed their contemporaries in total "win shares," a statistic developed by baseball numbers pioneer Bill James that represents players' total contribution to a team's success. ... One reason black players were, on average, better than white players was that they needed to be to make the roster."

I think Bill James debunked this one a long time ago too. If blacks are slightly better than whites, on average, the effect is magnified at the extremes of ability.

Suppose that, on average, whites average 100 "points" of ability, but blacks average 103. And suppose both races are normally distributed with standard deviation of 10. Finally, let's say blacks are 15% of the population.

In that case, about 24% of blacks will be at 110 or higher – but only 16% of whites. In terms of population, 21% of the 110+ players will be black.

But now, let's look at the star players, the ones above 130. Only 0.35% of blacks will achieve this mark. But for whites, it's a lot less -- 0.13%. So in that group, almost 40% of players will be black.

The authors went through the Mitchell Report page by page, finding all players where there were allegations of steroid use, and the specific seasons in which the players were supposed to have taken them. Then, they tried to find out if, in those seasons, the accused batters showed any evidence of better hitting, as compared to the large population of all other players not accused.

They adjusted each season for the age of the player. To get that adjustment, here's what they did:

-- calculate the average RC27 for hitters at each age-- subtract that from the mean to get an age adjustment-- correct each player-season RC27 by the amount of the age adjustment.

Their conclusions: the accused players outperformed expectations by somewhere between 6 and 12 percent (and even more when Barry Bonds was included in the sample). When the player was compared to his own career only, as opposed to all other players in MLB, the effect was smaller – 3 to 8 percent.

A couple of months ago, J.C. Bradbury criticized the study on the grounds that the authors had not properly separated out the steroid seasons from the non-steroids seasons. His criticism is here, and is followed by SSK's rebuttal, and J.C.'s response to the rebuttal.

My criticisms of the study are a little different. I have two main objections.

First, comparing the accused players to all the others is meaningless. In general, the players accused of juicing tend to be better than average. There are various reasons this could be the case. It could be that power hitters gain the most benefit from steroids, and are therefore most likely to be users. Or it could be that power hitters are more likely to be accused of using, even if they're innocent of the charges.

In any case, suppose that you find that David Segui hit better than, say, Jose Vizcaino, in the years when Segui was said to be on the juice. Why does that qualify as evidence of anything?

The other regressions, the ones where the players are compared only to their own career trajectories, are better – but that brings up my second objection, that the aging adjustments are flawed.

SSK created their aging adjustments by simply observing MLB-wide performance levels at each age. As Bill James pointed out back in the 1982 Abstract, that doesn't work – it severely underestimates the effects of aging because it ignores players who decline so much that they drop out of the league.

Suppose that, after the age of 30, players lose one "unit" of productivity per year. And suppose that once you go below 3, you're out of the league. And suppose there are five 30-year-olds in the league, with productivities of 8, 6, 4, 2, and 0, respectively.

The first year, they perform at 8, 6, 4, 2, for an average of 5.The second year, the last guy is released. The other three players are at 7, 5, 3, for an average of 5.The third year, the top three guys are 6, 4, 2, for an average of 4.The fourth year, they're at 5 and 3, for an average of 4.The fifth year, they're at 4 and 2, for an average of 3.The sixth year, the remaining player is at 3.The seventh and last year, he's at 2.

If you look at these numbers, the average decline is half a unit per season (it fluctuates between a decline of 1 and a decline of 0). But the real decline is 1 unit per year. By ignoring the retired players, you wind up thinking the effects of aging are much smaller than they actually are.

What does this mean for the SSK study? It means that the authors would be too conservative in projecting the effects of steroids. If juiced player X went from 5 units one year to 5.5 units the next, SSK would figure they're 1 unit above where they should be (0.5 unit gain, plus 0.5 units of staying put against the aging current). But, really, X should be pegged at 1.5 units (because the current is really 1 unit, not 0.5 units).

This bias means the results in the paper are probably underestimated. The accused players actually did even better than the authors think they did.

And there's another bias.

If I undersatnd the paper correctly, the authors applied the league values to individual players arithmetically. That is, if the average hitter declined 0.3 runs (per 27 outs) between age 32 and 33, that figure is used to adjust all players. But that number should be higher for better players and lower for worse players, shouldn't it? If the average player drops (say) 0.3 runs, and Barry Bonds is (say) twice as good, shouldn't his expected drop be 0.6? Shouldn't the decline be as a percentage of performance, rather than a fixed number? In fact, since RC27 has increasing returns (instead of being linear), shouldn't Bonds drop even more than twice as much what the average player drops? Maybe he drops 0.7, or 0.8, or even more.

So if you expect Barry Bonds to drop 0.3, but he really should be dropping 0.7, you're again underestimating the benefit he gets from steroids. Of course, this might only apply to older players, but my impression is that the accused individuals were mostly in the declining phase of their careers. So the study might again underestimate the outperformance of the players accused in the Mitchell Report.

But there's also a third bias, and this one goes the other way – it might lead to overestimates of any outperformance.

The authors assumed that aging curves are the same for all hitters. But, as I think Bill James pointed out a long time ago, power hitters tend to stay active longer, as power and walks are skills that tend to increase well into a player's 30s. Those hitters are less affected by aging than the average player.

So even though the average drop in MLB may be 0.3 runs, that figure might be a combination of 0.1 runs for the power hitters, and 0.5 runs for everyone else. In that case, if the batters mentioned in the Mitchell Report tend to be power hitters – and I think they do – applying league-wide aging patterns will tend to overestimate what their decline "should have" been, and thus exaggerate the discrepancy of their real-life age-adjusted performance. This would provide evidence for the hypothesis that the players are users – but it would be false evidence.

That's the third of the three biases. Here they are again, in summary:

Because of these biases, I'd argue that when the regressions find statistically significant coefficients, that does not indicate good evidence of steroid use. It does, perhaps, indicate that the accused players are different from the general population in some way. But that way could be:

The aging adjustments are just too biased, and too rough, to isolate any measure of steroid accusations.

Of course, we don't have a perfect method of making aging adjustments. We don’t even have an excellent one, or a very good one. That means that fixing the study would be a lot of work – you'd have to come up with a model of how players age, and show that it applies, without bias, to various types of players, including those types of players who happen to be over-represented in the Mitchell Report.

That's not likely to happen. Is there any way to get a study like this to work?

I think there might be. For every accused player in the Mitchell report, use the Bill James "paired players" method and find the most similar player not accused, where "similar" includes age, position, era, and recent performance. Then, compare the two players in the alleged steroid year, and see if the accused player outperformed the innocent one.

If you do that, you can ignore the issue of aging completely, since the players are the same age and the same profile. And even if some of your assumptions aren't completely accurate, there probably isn't any reason for them to be biased against the Mitchell players in particular, but not for the players almost exactly similar. So your results are more likely to be meaningful.

Of course, then you're not using regression, and it might be harder to get confidence intervals and such. But, I think, you'd be more likely to get closer to the real answer.

Do power pitchers have an advantage in the post-season?

Yesterday, Bill James, over at his website (subscription required), ran a matched-pair study to see if power pitchers do better in the post-season than control (finesse) pitchers. He finds that they do. Given equal W-L records, starts, and runs saved, the strikeout pitchers wind up outperforming the control pitchers, by a fairly significant amount.

In response, Tango argues that Bill's method doesn't properly match the pitchers, because, even though they have appear to have similar records and careers in all respects other than Ks and BBs, the study didn't control for BABIP (batting average allowed on balls in play). To match the power pitchers in ERA, the control pitchers would have had to have had a better BABIP. And, given that a good BABIP is mostly luck, that would explain why the control pitchers did worse in the playoffs – their luck just went back to normal.

I agree with Tango's analysis (and with mgl, who is critical of the study in the comments to Tango's post). Another way to put it is that Bill's study is legitimate – it truly does find that, all else being equal, power pitchers do indeed outperform control pitchers in October. But the reason they do so is simply that for a control pitcher to have the same regular-season record as a power pitcher, he has to have been lucky with respect to balls in play. And the luck doesn't carry forward into the future.

Twenty years ago, Bill's conclusion would have been valuable to bettors and GMs – it would have told us something new. But, in today's world, sophisticated sabermetricians are already controlling for balls-in-play luck. So, in this case, Bill's study just gives us another confirmation of what we already know about predicting pitcher performance.