Sabermetric Research

Phil Birnbaum

Friday, December 29, 2006

NFL: what can you predict from a team's first two games?

In 2006, the Chicago Bears won their first two games by wide margins: 26-0 on the road, then 34-7 at home. What should this have meant for predicting the rest of the season? Do teams who destroy the opposition in their first two games go on to dominate the league thereafter? Or does their initial dominance turn out to be temporary?

In another excellent study on his blog, Doug Drinen takes on this question.

First, he created a version of Bill James' similarity scores to find the 20 other seasons with their first two game scores most similar to the Bears' 2006. (Most similar season: the 1996 Packers, who won their first two games 34-3 and 39-13.) In those 20 seasons, the teams wound up winning an average of 9.8 games.

So the answer to the original question is that the two blowouts don't mark the Bears as one of the best ever -- they predict the Bears are simply a bit better than average. Teams that started like they did won only 7.8 of their next 14 games (.557). The study shows that the Bears are probably a 9-7 team that got a bit lucky.

It's fun to look at the Bears' twenty comparables. The best was the 1984 Dolphins (14-2). But two of those teams won only three games after their two blowouts -- the 1987 Raiders (5-10) and the 1992 Buccaneers (5-11).

Drinen ran the algorithm for all 30 teams, and gives his method's predictions for the rest of 2006. They seem pretty good to me -- but if you asked a bunch of experts for their own predictions two games into the season, you'd probably get something similar.

I don’t know if this means anything or not, but it’s intriguing that the Colts, who have scored a lot of points and also given up a lot, project better than the Chargers, Ravens, and Bears, who have scored a lot and given up almost none.

And I especially like this comment:

In some sense, this exercise is just a whole lot of work to get … the same results you’d get by running a simple regression … But I like this method better, because it’s not a black box.

You say the Bears should expect to win X games this year. Your friend calls BS: haven’t you seen how dominant they’ve looked? If regression is what you’ve got, it’s tough to give a decent counterargument unless he understands regression. But this method lays the reasoning right out there in a crystal clear way … [it's] the same kind of information that your regression was taking into account, but it’s just so much more transparent here.

I agree with Doug 100%. No matter how much you study a regression, even if you've seen hundreds of them in your lifetime, there's always the nagging doubt that the results don't mean what they seem to mean. This study lays it all out in a way that's comprehensible to anyone.

Tuesday, December 26, 2006

NFL: division rival games no closer than out-of-division games

Another nice Doug Drinen study, this time checking conventional wisdom on whether NFL teams in the same division play each other more closely than you'd expect given their records. That is, can you "throw the records out the window" when a mediocre team plays a better division rival?

Drinen lets you "draw your own conclusions." My conclusion is that the records of the interdivision games are pretty much the same as the records of the intradivision games. That is, you can't "throw out the records" because they mean the same thing, regardless of which divisions the teams came from.

For instance, when two teams five games apart in the standings (e.g., 7-3 vs. 2-8) face each other, the results for the team with the better record are:

.737 when teams are in different divisions (101-36).707 when teams are in the same division (99-41)

The effect goes in favor of the "throw out the records" theory, but it's pretty small. And, besides, 5-games-apart is only one of many categories. For a 4-game spread, it goes the other way:

.644 when teams are in different divisions (163-90).716 when teams are in the same division (184-73)

Skimming the various categories, you'll find that some go one way, some go the other way, and none are way off. You can see the full results at the above link. (And, by the way, only games from week 6 and on were included in the study.)

If you don't trust your eye, it wouldn't be difficult to do a statistical test on the two lists – that Chi-Squared thing where you add up the "(expected-actual)^2 / expected" comes to mind. I'm pretty sure you'd find the difference to be far from statistical significance.

Friday, December 22, 2006

Academic peer review can't be counted on

Berri, of course, is one of the economist co-authors of "The Wages of Wins," a book that has had its share of rave reviews, and also its share of criticism.

On the positive side, writer Malcolm Gladwell famously lauded the book in a New Yorker review, and fellow sports economist J.C. Bradbury was similarly praiseful in an academic review. But there were critical reviews from Roland Beech and myself, and critiques of the book's methodology and conclusions appeared in (among other places) an APBRmetrics forum and King Kaufman's column in Salon.

And that's where peer review comes in.

In a recent post on his blog, Berri makes the specific point that his critics have not been peer-reviewed, which is why he is skeptical of the points they make.

He writes,

"Ultimately it is the research in academic forums that we take seriously, and we often are quite skeptical of findings that have not been exposed to this peer review process ... The route those who disagree must follow is ultimatly the same academic route as everyone else. He or she will have to demonstrate that they have empirical evidence that comes to a different conclusion. And this empirical evidence would be submitted to a peer review process before it could be published in an academic forum."

"Had [Beech's] review simply appeared on his website ... we would have been inclined to either ignore his comments or respond on our own website ...

"In the end it is easy to sit back and make claims on a website. There is no peer review process. No one will refuse to publish your work because you misstate facts or fail to provide any evidence at all or because your evidence does not support your claims. In an academic setting, one expects a higher standard."

But, is academic peer review really a higher standard? I'm not so sure. Certainly academia is well-versed in the complex statistical techniques some of these studies use. But many of the academic papers I've reviewed in this blog over the last few months nonetheless have serious flaws, flaws large enough to cast doubt over the studies' conclusions. These papers were all peer-reviewed, and all made it to respected journals, without those flaws being spotted.

And sometimes they're obvious flaws. In "The Wages of Wins," the authors quote a study (co-authored by Berri) that checks whether basketball players "rise to the occasion" by playing better in the playoffs. After a regression on a bunch of factors, the study finds that players' playoff statistics actually fall relative to regular season performance. "The very best stars ... tended to perform worse when the games mattered most."

But what they failed to recognize was the obvious fact that, in the playoffs, players are facing only the best opponents. So, of course their aggregate performance should be expected to drop.

I looked up the original study (I got it free from my library, but here's a pay link). It's a nine-page peer-reviewed paper, published in a respected journal. It's got 34 references, acknowledgements of help from three colleagues, and it was presented to a room full of economists at an academic conference.

And nobody caught the most obvious reason for the findings. I'd bet that if Berri had posted his findings to any decent amateur sabermetrics website, it would have been pointed out to him pretty quickly.

Another example: a few years back, three economists found that overall league HBP rates were a few percent higher in the AL than the NL. They wrote a paper about it, and concluded that NL pitchers were less likely to hit batters because they would come to bat later and face retribution.

It's an intriguing conclusion, but wrong. It turned out that HBP rates for non-pitchers were roughly the same in both leagues, and the difference was that because NL pitchers hit so poorly, they seldom get plunked.

Think about it. The difference between the AL and NL turned out to be the DH – but no peer reviewer thought of the possibility! I think it's fair to say that wouldn’t happen in the sabermetric community. Again, if you were to post a summary of that paper at, say, Baseball Think Factory, the flaw would be uncovered in, literally, about five minutes.

The point of this is not to criticize these authors for making mistakes – all of us can produce a flawed analysis, or overlook something obvious. (I know I have, many times.) The point is that if peer review can't pick up those obvious flaws, it's not doing its job.

So why is academic peer review so poor? As commenter "Guy" writes in a comment to the previous post about a flawed basketball study:

"But this raises a larger issue that we've discussed before, which is the failure of peer review in sports economics. This paper was published in The Journal of Labor Economics, and Berri says it is "One of the best recent articles written in the field of sports economics." Yet the error you describe [in the post] is so large and so fundamental that we can have no confidence at all in the paper's main finding.... How does this paper get published and cited favorably by economists?"

It's a very good question – but if I were an academic sports economist, I wouldn't wait for an answer. If I cared about the quality of my work, I'd continue to consult colleagues before submitting it -- but I'd also make sure I got my paper looked at by as many good amateur sabermetricians I could find. It's good to get published, but it's more important to get it right.

Wednesday, December 20, 2006

Do teams "choose to lose" to improve their draft position?

Like other sports leagues, the NBA awards the best draft picks to teams that perform the worst, in order to even out team quality over time.

Up until 1984, the worst teams in each of the two conferences flipped a coin for the number one pick. After that, draft choices went to teams in reverse order of their finish the previous season.

But because this system awarded better picks to worse teams, the NBA worried that this drafting method gave teams an incentive to lose. And so, for 1985, the league changed the rule so that the draft order became a lottery among all non-playoff teams. Once a team knew it was going to miss the playoffs, it would have no further incentive to lose – its draft position would wind up the same either way.

The new system, of course, didn't promote competitive balance as well as the previous one. Therefore, in 1990, the NBA changed the system once more. The draft order would still be determined by lottery, but the worst teams would get a higher probability of winning than the less-bad teams. There would still be some incentive to lose, the theory went, but much less than under the pre-1985 system.(It's important to understand that the question isn't about whether players deliberately throw games. Teams can decide to increase their chance of losing in other ways -- sitting out stars, playing their bench more, giving good players more time to come back from injury, trying out a different style of defense, playing a little "safer" to avoid getting hurt, and so on.)

The repeated changes to the system provided a perfect natural experiment, and in a paper called "Losing to Win: Tournament Incentives in the National Basketball Association," economists Beck A. Taylor and Justin G. Trogdon check to see if bad teams actually did respond to these incentives – losing more when the losses benefited their draft position, and losing less when it didn't matter. (A subscription is required for the full paper – I was able to download it at my public library.)The study ran a regression on all games in three different seasons, each representing a different set of incentives: 1983-84, 1984-85, and 1989-90. They (logistically) regressed the probability of winning on several variables: team winning percentage, opposition winning percentage, and dummy variables for home/away/neutral court, whether the team and opposition had clinched a playoff spot, and whether the team and opposition had been mathematically eliminated from the playoffs (as of that game). For the "eliminated" variables, they used different dummies for each of the three seasons. Comparing the different-year dummy coefficients would provide evidence of whether the teams did indeed respond to the incentives facing them.

One of the study's findings was that once teams were eliminated in 1983-84, when the incentive to lose was the strongest, they played worse than you would expect. That year, eliminated teams appeared to be .220 worse than expected from their W-L record.

.220.

That number is huge. Teams mathematically eliminated from the playoffs already have pretty bad records. Suppose they're .400 teams. By the results of this study, after they're eliminated, the authors have them becoming .180 teams! It seems to me, unscientifically, that if these teams – and that's the average team in this situation, not just one or two -- were actually playing .180 ball in a race to the bottom, everyone would have noticed.

The authors don't notice the .180 number explicitly, which is too bad – because if they had, they might also have noticed a flaw in their interpretation of the results.

The flaw is this: in choosing their "winning percentage" measure for their regression, Taylor and Trodgon didn't use the season winning percentage. Instead, they used the team's winning percentage up to that game of the season. For a team that started 1-0, the second entry in the regression data would have pegged them as a 1.000 team.

What that means is that the winning percentages used in the early games of the season are an unreliable measure of the quality of the team. For the late games, the winning percentages will be much more reliable.

For games late in the season, there will be a much higher correlation of winning percentage with victory. And games where a team has been eliminated are all late in the season. Therefore, the "eliminated" variable isn't actually measuring elimination – it's measuring a combination of elimination and late-season games. The way the authors set up the study, there's actually no way to isolate the actual effects of being eliminated.

For instance: the regression treates a 1-2 team the same as a 25-50 team – both are .333. But the 1-2 team is much more likely to win its next game than the 25-50 team. The study sees this as the "not yet eliminated team" playing better than the "already eliminated" team, and assumes it's because the 25-50 team is shirking.

The same pattern holds for the "clinch" variable. Teams that have clinched are .550 teams who are really .550 teams. Those are better than .550 teams of the 11-9 variety, and that's why teams that have clinched appear to be .023 points better than expected.

The same is true for the "opposition clinched" dummy variable (which comes in at .046 points), and the "opposition eliminated" variable (at .093 points).

All four of the indicator variables for "clinched" and "eliminated" are markers for "winning percentages are more reliable because of sample size." And it's clear from the text that the authors are unaware of this bias.

I'm not sure we can disentangle the two causes, but perhaps we can take a shot.

Suppose a .333 team is facing a .667 team. The first week of the season, the chance of the 1-2 team beating the 2-1 team is maybe .490. The last game of the season, the chance of the (likely to be) 27-54 team beating the (likely to be) 54-27 team is maybe .230. The middle of the season, maybe it's .350, which is what the regression would have found for a "base" value. (I'm guessing at all these numbers, of course.)

So even if eliminated and clinched teams played no differently than ever, the study would still find a difference of .120 just based on the late-season situation. The actual difference the study found was an "eliminated facing clinched" difference of .266 (.046 for "opposition clinched" plus .220 for "team eliminated"). Therefore, by our assumptions, the real effect is .266 minus .120. That's about .150 points – still a lot.

But that's a back-of-the-envelope calculation, and I may have done something wrong. I'd be much more comfortable just rerunning the study, but using full-season winning percentage instead of only-up-to-the-moment winning percentage.

Here are how the predicted marginal winning percentage changes for "eliminated" compare to the other seasons:

1983-84: -.220 (as discussed above)1984-85: -.0691989-90: -.192

and the changes for "opposition eliminated":

1983-84: +.2371984-85: +.0931989-90: +.252

The middle year, 1984-85, is the year the authors expect the "eliminated" effect to be zero – because, once eliminated, there's no further way to improve your draft choice by losing. The results partially conform to expectations – the middle year shows a significantly lower effect than the other two.

The results for that middle year are not statistically significant, in the sense of being different from zero. The authors therefore treat it as zero – "nonplayoff teams were no more likely to lose than playoff-bound teams." I don't agree with that conclusion, as I complained here. However, the effect as seen is not that much different from our (probably unreliable) estimated size of the late-season effect. Subtract the two, and it might turn out that eliminated teams actually played no worse after being eliminated – just as the authors hypothesize. The standard errors of these middle-year estimates, though, are pretty high. As commenter Guy points out (in a comment to the post I linked to a couple of sentences ago), it would be better if the authors used more than one year's worth of data in studies like these. Although transcribing every NBA game for three seasons must have been a hell of a lot of work – is there a Retrohoop? – I agree that one season just isn't enough.

And, also, it's possible that 1984-85 unfolded in such a way to make the coefficients look different. If, that year, teams played the early part of the season entirely consistently with their eventual record, that would cause the "late-season" factor in the coefficients to be small. That is, if the standings after one week were exactly the same as the standings at season's end, the difference in reliability between late games and early games would be zero. That could account for all the apparent difference in the "eliminated" effect. It's unlikely, but possible – and I don't think there's any easy way to figure out a confidence interval for the effect without running a simulation.

As it stands, my personal feeling is that the authors have found a real effect, but I can't justify that feeling enough that anyone should take my word for it.

My bottom line is that the authors had a great idea, but they failed in their execution. Rerunning the study for more than one season per dummy, and using full-season winning percentages instead of just season-to-date, would probably give a solid answer to this important question.(Hat tip to David Berri at The Sports Economist.)

Tuesday, December 19, 2006

Dollars increase wealth, but cents don't

Is there a correlation between twenty-dollar bills and money? I think there is. Here's what I did: I took a thousand random middle-class people off the street. I counted how many twenties they had in their wallet, and how much money they had altogether. Then, I ran a regression.

It turns out that there is a strong relationship between twenties and money. The result:

Coefficient of twenties = $19.82 (p=.0000)

Because the coefficient for twenties was very strongly statistically significant, we can say that every twenty dollar bill increases wealth by about $20.00.

I was so excited by this conclusion that I wondered whether a similar result holds for pocket change. So I added quarters to the regression. The result:

Coefficient of quarters = $0.26 (p=.25)

As you can see, the p-value is only .25, much higher than the .05 we need for statistical significance. Since the coefficient for quarters turns out not to be statistically significant, we conclude that there is no evidence for any relationship between quarters and money.

This is surprising – in the popular press, there is a widespread theory that quarters are worth $0.25. But, as this study shows, no statistically significant effect was found, using 1,000 people and the best methodology, so we have to conclude that quarters aren't worth anything.

--------

That sounds ridiculous, doesn't it? We have a regression that shows that quarters are worth about 25 cents, but we treat them as if they're worth zero just because the study wasn't powerful enough to show statistical significance.

But we're biased here because of the choice of example.

So suppose that instead of adding quarters to the equation, we had added something else, something that we could agree was completely irrelevant. Say, number of siblings.

And, suppose that we got exactly the same results for siblings: for each sibling in our random subject's family, he winds up with an 26 cent increase in pocket money. The signficance level the same 0.25. (The result is certainly not farfetched: one in four times, we'd find an effect of at least this magnitude.)

In this case, if we said "there is no evidence for any relationship between siblings and money," that would be quite accceptable.

What's the difference between quarters and siblings? The difference is that there is a good reason to believe that the quarters result is real, but there is no good reason to believe that the siblings result is. By "good reason," I don't mean just our prior intuitive beliefs. Rather, I mean that there's a good reason based partly on the results of the study itself.

The study showed us that twenty-dollar bills were highly significant. We therefore concluded that there was a real relationship between twenties and wealth. But we know, for a fact, that 80 quarters equal one twenty. It is therefore at least reasonable to expect that the effect of 80 quarters should equal the effect of a twenty – or, put another way, that the effect of one quarter should be 1/80 the effect of a twenty. And that was almost exactly what we found.

How does it make sense to accept that twenty dollar bills have an effect, but 1/80ths of twenty-dollar bills do not? It doesn't.

If the convention in these kinds of studies is to treat any non-significant coefficient as zero, I think that's wrong. A reasonable alternative, keeping in mind the "sibling" argument, might be that if a factor turns out to be statistically insignificant, and there is no other reason to suggest there should be a link, only then can you go ahead and revert to zero. But if there are other reasons – like if you're analyzing cents, and you know dollars are significant – reverting to zero can't be right.

The study was attempting to figure out how much an NFL player's production correlates to draft position. They broke production into something similar to dollars and cents.

"Dollars" were the most obvious attributes of the player's skill. Did he make the NFL? Did he play regularly? Did he make the Pro Bowl?

"Cents" is what was left after that. Given that he played regularly but didn't make the Pro Bowl, was he a very good regular or just an average one? If he made the Pro Bowl, was he a superstar Pro Bowler, or simply an excellent player?

The study found strong significance for "dollars" – players drafted early were much more likely to play regularly than players drafted late. They were also more likely to make the Pro Bowl, or to make an NFL roster at all.

But it found less signficiance for the "cents." The authors did find that players with more "cents" were likely to be better players, but the result was significant only at the 13% level (instead of the required 5%). From this, they concluded

"there is nothing in our data to suggest that former high draft picks are better players than lower draft picks, beyond what is measured in our broad ["dollar"] performance categories."

And that's got to be just plain wrong. There is not "nothing in the data" to suggest the effect is real. There is actually strong evidence in the data – the significance of the other, broader, measure of skill. If you assume that dollars matter, you are forced to admit that cents matter.

There's an expression, "absence of evidence is not evidence of absence." This is especially true when you find weak evidence of a strong effect. If you find a correlation that's significant in football terms, but not significant in statistical terms, your first conclusion should be that your study is insufficiently powerful to be able to tell if what you found is real. Ideally, you would do another study to check, or add a few years of data to make your study more powerful. But it seems to me that you are NOT entitled to automatically conclude that the observed effect is spurious based on the significance level alone, especially when it leads to a logical implausibility, such as that dollars matter but cents don't.

In this particular study, I think that if you do accept the coefficient for cents at face value, instead of calling it zero, you reach the completely opposite conclusion than the authors do.

-----

The reason I'm writing about this again is that I've just found another occasion of it, in the study discussed here (full review to come). The authors come up with an estimate of a coefficient for three separate NBA seasons. The coefficient is (to oversimplify a bit) the amount by which you'd expect a team that's been eliminated from the playoffs to underperform in winning percentage.

Their three results are .220 (significant), .069 (not significant), and .192 (significant).

My conclusion would be to say that the effect of being eliminated in the middle season is much lower than the effect in the other two seasons, and whether the difference was statistically significant. I would point out, for the record, that the .069 is not signficantly different from zero.

But the study's authors go further – they say that the .069 should be arbitarily treated as if it were actually .000:

"Our results show that teams that were eliminated from the playoffs [in the .069 year] were no more likely to lose than noneliminated teams." [emphasis mine.]

That's is simply not true. The results show those teams were .069 more likely to lose than noneliminated teams.

That is: given that our prior understanding of how basketball works, given the structure of the study (which I'll get to in a future post), given that the study shows a strong effect for other years, and given that the effects are in the same direction, there is certainly enough evidence that the .069 is likely closer to the "real" value than zero is.

In this case, the conclusions of the study -- that the second year is different from the other two -- don't really change. But the conclusion turns out much more punchy when the authors say there is no effect, instead of a small one.

Saturday, December 16, 2006

NHL faceoff skill adjusted for strength of opposition

Last season, Yanic Perreault led the NHL in faceoff winning percentage with 62.2% (559-340). But might that be because of the quality of opposition he faced? Maybe Perreault took more faceoffs against inferior opponents, and that inflated his numbers.

Meanwhile, Sidney Crosby was one of the worst in the league, winning faceoffs at only a 45.5% rate. Was he really that bad, or were his numbers lowered because he faced a lot of skilled opposition front-line centers?

In this study, Javageek tries to figure that out. He assumed that each player in the league has an intrinsic faceoff winning percentage. Then, he assumed that when a player faces another, his chance of winning is determined by pythagorean projection (The log5 method (explained here) might have been a better choice, but I don't think it matters a whole lot).

He then took 121 faceoff men, and looked at their records against each other. That's 7,260 possible faceoff pairs. Javageek figured out (confession: I didn't really read the algebra) that the question could be answered by solving 121 equations in 121 unknowns. He did that, and came up with an adjusted faceoff percentage for each player, corrected for the quality of opposition.

Bottom line: the adjustment doesn't matter much. Javageek didn't give any metrics comparing actual vs. theoretical, but a look at the two charts shows that in most cases, they're almost the same. Find any player in the left (theoretical) column, and he can be found not too far away in the right (actual) column. Yanic Perreault still leads, with an adjusted 63.8%, and Sidney Crosby drops to 43.8%.

It's important to note that these are still not unbiased the most appropriate esimators of the players' actual faceoff skills – you still have to regress to the mean to get an estimate of their true talent. You'd probably want to use this technique to do that.

One thing that bothers me a bit about the results is that the top players become less extreme after the adjustment, but the bottom players become more extreme. You'd expect both halves of the data to be less extreme -- closer to the mean -- after you've adjusted out some of the luck. We don't see that with the bottom players, and I'm not sure what to make of that.

Thursday, December 14, 2006

NFL teams win more often with 13 points than with 14 points

NFL teams who score exactly 13 points in a game win more often than teams who score exactly 14 points.

That bit of information comes from a Doug Drinen Chase Stuart post at pro-football-reference.com here.

Overall, teams who scored 13 points were .285 in those games. Teams who scored 14 were only .199. The same unexpected result holds for 20 and 21 points: 20-point teams were .566, but 21-point teams were .461. (There are other such pairs, too: see the study for details.)

It turns out the reason for the anomaly is that the lower-scoring teams held down opposition offenses much better than the higher-scoring teams. For instance, the 20-point teams held their opponents to 19 or less 56% of the time, but the 21-point teams did that only 41% of the time.

So it's not that the extra point hurts you, it's that it somehow makes the opposition score more. Why might that be? Stuart suggests it might be time of possession. A 20-point team (most likely) scored on four possessions, while a 21-point team scored on three. The three-possession game gave the other team more time to score and beat them.I'm not sure that's right: teams get roughly an equal number of possessions, regardless of time – so a team with low time of possession contributed to that with bad offense and/or bad defense. That is, it seems to me that how much you score causes time of possession, not the other way around.

Another theory by Doug Drinen is that a team that's behind by four or more late in the game won't go for a field goal. Instead, they'll gamble on fourth down. Therefore, teams with more field goals are more likely to be leading (and eventually winning) than with than those with fewer field goals. Put another way, losing teams score more touchdowns and fewer field goals than you would expect, which makes touchdown teams slightly less likely to win than field-goal teams with identical points.

I think Doug's right, and that's most of the answer. Also related is the fact that teams are more likely to kick game-winning field goals with no time left than score game winning touchdowns.

There's also a follow-up post from Doug, in which he finds something even more interesting – teams who score 13 points are not just more likely to win that particular game, but are also more likely to win games in the future! He writes,

Home teams that scored 13 points won 46.8% of the rest of their games (N=389).Home teams that scored 14 points won 44.8% of the rest of their games (N=422).

However, the effect in this study wasn't as strong as in the single-game study (as you might expect), and, in fact, it didn't hold for 20 and 21 points at all (with future winning percentages .481 and .494, respectively).

And, even in the 13/14 case, winning was much more important than points. Teams that lost with 13 or 14 points had a much lower future winning percentage than teams who won with 13 or 14 points. That strongly suggests that it's the likelihood of winning that "causes" the team to score only 13, and the likelihood of losing that "causes" the 14.

Again, I think Doug's don't-go-for-3-when-behind theory is the right one.

(Thanks to curling afficionado Bob Timmermann for an e-mail linking to the study.)

Wednesday, December 13, 2006

Estimated salary differences for NFL positions

I always thought it was conventional wisdom in football that quarterbacks were paid much, much more than the players on the offensive line who protect him. At least that was what I inferred from "The Blind Side." The book talks about how left tackles used to be underrewarded until recently, when the NFL suddenly realized that it's the second most important position on the team.

But, according to the Massey/Thaler study of the NFL draft, the salary differential isn't that huge. Actually, the study doesn't give any detail on actual free-agent salaries by position, but they do give estimated salaries, based on their regression. Since they used dummy variables for each position, the differences between positions should be reasonably reliable. I'd expect the actual numbers to be not too far off either.Here's their chart for estimated salary (base plus bonus) by position, for 1996-2002, for hypothetical sixth-year players who made the Pro Bowl all of their first five years. Players in their sixth year should all be earning free market salaries. (All salaries are in 2002 dollars.) DB $6,192,617DL $7,103,115LB $6,117,273OL $5,985,725QB $9,208,248RB $6,071,787TE $5,781,453WR $6,779,927

The best offensive linemen project to make almost two-thirds as much as the best quarterbacks. I would have thought the difference would be bigger than that. In terms of fame, all the members of the offensive line combined get probably 10% the media mentions that the quarterback does.

There's also a chart that breaks out the more recent years 2000-2002; in that time frame, the offensive linemen now earn only 54% as much as the quarterbacks. "The Blind Side" story would make you think the gap would be moving lower, rather than higher.

Those numbers are somewhat unrealistic, because they assume the player made the Pro Bowl all of his first five years in the NFL. Here's the same chart, from the same study, but for (again hypothetical) 6th-year players who started 8 or more games each of their first five years but never made the Pro Bowl:

Monday, December 11, 2006

Curlometrics

Sabermetrics has come to curling.

A Bob Weeks column in today's Globe and Mail describes a curling sabermetrics project (and book) by Dallas Bittle and Gerry Geurts. The pair operate the curlingzone.com website, on which the article is reprinted here.

For those who know even less about curling than I do, here's a quick summary. This is from memory. There's probably lots of stuff wrong in it, because I don't watch curling much.

---Curling is like shuffleboard on ice. Teams take turns sliding rocks towards a bulls-eye area at the other end of the rink, eight rocks for each team. (The area with the bull's eye is called the "house," and the bull's eye itself is the "button.") After all sixteen rocks are thrown, the team with the rock closest to the button gets a point. It gets an additional point for each additional rock that's closer than any of the opponent's rocks. So if the rocks closest to the button are, in order, red, red, red, and yellow, the red team gets three points. Sometimes there are no rocks left in scoring territory, in which case both teams score zero.

I think it's called "curling" because if the rock is spun when slid, it will hook to one side instead of sliding in a straight line. This allows curlers to curl one rock in behind another that would otherwise be in the way. When the rock is sliding, the other team's players can sweep the ice in front of the sliding stone. This causes the rock to curl less, and also to go faster. The player who threw the stone (or the team's "skip," which is the player/manager/captain guy) will watch the stone's progress and scream at the sweepers to tell them when to sweep more vigourously ("hard, hard, haaaaaaaaaard!").

Each sixteen-rock sequence is called an "end," which I think of as like an inning in baseball. A game is ten ends. The team that scored points in the previous end has to go first in the next end. This is a disadvantage, as the team with the last rock usually can find a way to score. The advantage of last rock is called "the hammer." When the team without the hammer is the one that scores, it's called a "steal."

There are four players on each team; each throws two consecutive stones. The positions are named "first", "second", "third", and "skip." The skip is the best player on the team – he or she goes last, and therefore gets all the glory. The guy that goes first doesn't seem to me to have a very interesting job, but I'm sure real curlers would disagree.

In Canada, the biggest men's tournament of the year is called the "Briar," or, officially, the "Tim Hortons Briar." Women have the "Scott Tournament of Hearts." (Tim Hortons makes the best coffee on earth. Scott is the toilet paper company. "Hearts" are little red shapes stereotypically associated with women – I don't know why men don't get their own stereotype, like the "Tim Horton's tournament of Tonka Toys" or some such.)

"Bittle's newest measure is something he calls the power triad, a combination of three statistics. There's hammer efficiency [how often a team scores more than one when they have the hammer], steal efficiency (how often it steals a point...), and scoring differential.

"They also devised new ways to score individuals ... that ... present a more detailed look at a player's skills."

I couldn't find anything about those statistics on the website, but that might be because a curlstat.com, which is linked to, is down. But there's a link to the book ($18.95 Canadian).

Also on the website is an heading "Curling with Math." It's got one article in it, calculating the best strategy for a given situation in the ninth end. One interesting thing is that it gives some win probabilities for the tenth end:

Odds of winning if tied with hammer (x) = 75.7%Odds of winning if one down with hammer (y) = 39.5%Odds of winning if two down with hammer (z) = 11.7%

These aren't probabilities for the end itself, because I think they include the chances of tying the game in the 10th and winning in extra ends.

The fact that the article casually throws win probabilities around suggests that the field of curling sabermetrics is reasonably well-advanced. I'm looking forward to reading more about it.

Friday, December 08, 2006

Underhanded free throws

Would NBA foul shooters hit for a higher percentage if they threw underhanded?

In this column (print subscription required) from this week's Sports Illustrated, Rick Reilly says they would.

Using himself as guinea pig, Reilly took a bunch of shots overhand and found he hit 63%. After tutoring in underhand by hall-of-fame NBA player (and underhanded free thrower) Rick Barry, and a couple of weeks of practice, he was hitting 78%.

Reilly points out that if Ben Wallace, a career 49% shooter, learned to throw underhand and raised himself to 69%, he'd have made 60 more shots last season.

Why don't players try it? Players don't like how it looks. "I would shoot negative percentage before I shot like that," Reilly quotes Shaquille O'Neal as saying. He says Wilt Chamberlain did it for a few years, improved, but then went back to overhand. "I felt silly – like a sissy," Chamberlain wrote.

Reilly writes,

"I ... asked [a few players] a simple question: 'What would it take to get you to shoot free throws like Rick Barry?' Not one called me back. Or e-mailed. Or texted. ... None.

"Do you know why? Because NBA players care more about looking cool on SportsCenter than winning games for their teams."

But: does the technique actually work better?

It worked for Barry himself, whose career mark was 90%, second all-time. In 1979, he went 160-for-169. (This page links to a video of Barry taking an underhanded shot.)

And here is an article from "Discover" arguing that it does work, for reasons of physics.

So there's fairly convincing evidence that shooting underhand can work. And it can probably create a lot of wins. The Wages of Wins says that it takes an extra 30 points to add one win. If Reilly and Barry are correct, Ben Wallace could create two extra wins for his team just by switching to the "granny shot." Two wins is worth several million dollars in salary, isn't it?

And even if Wallace and Shaq refuse, why doesn't someone try it? Is there really such a strong cultural taboo? Is this another market irrationality?

The study is quite readable, and you can safely ignore the more complex-looking math (as I did).

Massey and Thaler (M&T) start by identifying the market value of a draft choice. Their method is ingenious – they just look at all cases where teams traded draft choices for other draft choices. They find that draft choice values are very consistent; indeed, teams have internalized these rules, to the extent that they develop charts of their relative values. Each team has almost the same chart, and so when M&T superimpose actual trades on their theoretical curve, they fit almost perfectly. That is, all thirty teams in the league have independently reached the same conclusions about what draft choices are worth – or at least act as if they have. It turns out, for instance, that teams think the first pick is worth the 10th and 11th picks combined, or the sum of the last four picks of the first round.

But M&T conclude that all thirty teams are wrong.

Here's what they did. They divided all drafted players into one of five groups, based on their status for a given season: (1) not on the roster; (2) on the roster but did not start any games; (3) started between 1 and 8 games; (4) started more than 8 games but didn’t make the Pro Bowl; and (5) made the Pro Bowl.

Then, they ran a regression on free agent salaries, to predict what a player in each group at each position should earn. Just for fun, here are the values for quarterbacks:

$0 ........... not on the roster$1,039,870 ... on the roster but didn't start$1,129,260 ... started between 1 and 8 games$4,525,227 ... started more than 8 games$9,208,248 ... made the Pro Bowl

Then, for each draft position, they computed the average free-agent value for the player, and compared it to the salary he was actually paid. So, a Pro Bowl quarterback draftee who made only $4 million would have earned the team a surplus of $5,208,248.

As it turns out, for their first five seasons in the league (free agency begins in year six), drafted players produced an average surplus of about $470,000 per year. The surprise is that you'd expect the early picks to be the most valuable. But they're not. The surplus is highest roughly between picks 25 and 75 (about $700,000). It's lower for the first few picks. In fact, the number one pick in the draft produces a surplus of only about $500,000.

That's because there's a rookie salary scale that's based on draft position, and it's very steep – first picks make a lot more than, say, tenth picks. And so, although first picks turn out to be better players than later picks, they are also paid much more. The pay difference is higher than the ability difference, and so first picks don't turn out to be such big bargains after all.

And this is why M&T argue that teams are irrational. To get a single first pick overall, teams are willing to give up a 27th pick, plus a 28th pick, plus a 29th pick, plus a 30th pick. Any one of those four later picks is worth more than the number one pick. To trade four more valuable picks for one less valuable pick, the authors say, is clearly irrational – caused by "non-regressive predictions, overconfidence, the winner's curse, and false consensus."

I'm somewhat convinced, but not completely.

My problem with the analysis is that the authors (admittedly) use "crude performance measures" in their salary regression. Their five performance categories are extremely rough. Specifically, the fourth category, starters who don't make the Pro Bowl, contains players of extremely different capabilities. If you treat them the same, then you are likely to find that (say) the 20th pick is not much different from the 30th pick – both will give you roughly the same shot as a regular. It may turn out that the 20th pick will give you a significantly *better* regular, but the M&T methodology can't distinguish the players unless one of them makes the Pro Bowl.

(For readers who (like me) don't know football players very well, consider a baseball analogy. Suppose AL shortstops Derek Jeter and Carlos Guillen go to the All-Star game. A study like this would then consider Miguel Tejada equal to Angel Berroa, since each started half their team's games, and neither was an All-Star. Of course, Tejada is really much, much better than Berroa.)

In fact, the study does note the difference in quality, but ignores it. The salary regression includes a term for where players were picked in the draft. They found that keeping all things equal, including the player's category, players drafted early make more money than players drafted late. Part of that, no doubt, is that players drafted early can negotiate better first-year contracts. But, presumably, for years 2-5, players are paid on performance, so if early draftees make more than late draftees in the same category, that does suggest that players drafted earlier are better.

But M&T don't consider this factor. Why? Because the regression coefficient doesn't come out statistically significant. For their largest sample, the coefficient is only 1.7 standard deviations above the mean, short of the 2 SDs or so required for significance. And so, they ignore it entirely.

This may be standard procedure for studies of this sort, but I don't agree with it. First, there's a very strong reason to believe that there is a positive relationship between the variables (draft choice and performance). Second, a significant positive relationship was found between draft choice and performance in another part of their study (draft choice vs. category). Third, it could be that the coefficient is strongly positive in a football sense (I can't tell from the study -- they don't say what the draft variable is denominated in). Fourth, the coefficient was close to statistical significance. Fifth (and perhaps this is the same as the first point), ignoring the coefficient assumes that all non-Pro-Bowl starters are the same, which is not realistic. And, finally, and most importantly, using the coefficient, instead of rejecting it and using zero instead, might significantly affect the conclusions of the study.

What the authors have shown is that if you consider Miguel Tejada equal to Angel Berroa, draft choice doesn't matter much. That's true, but not all that relevant.

There's a second reason for skepticism, too. The author's draft-choice trade curve is based on trades that actually happened. Most of those trades involve a team moving up only a few positions in the draft – maybe half a round. But a team won't make that kind of trade just for speculation; they'll make it because there's a specific player they're interested in. It's quite possible that a first pick is worth four later picks only in those cases when the first pick is particularly good. By evaluating trades as exchanges of average picks for average picks, the authors might be missing that possibility.

It wouldn't be hard to check – just find all those trades, see the players that were actually drafted in the positions traded, and see how the actual surpluses worked out. It could be that there's no significant difference between random trades and real trades – but shouldn't you check to be sure?

M&T do give one real-life example. In the 2004 draft, the Giants had the fourth pick (expected to be Philip Rivers). They could trade up for the first pick (Eli Manning), or they could trade down for the seventh pick (Ben Roethlisberger). Which trade, if any, should they have made? According to the historical trends the authors found, they should have traded down – a seventh pick is actually worth more than a fourth, and the Giants would have even received an extra second round pick as a bonus! But, in this specific case, the Giants would have to consider the relative talents of the actual three players involved. The authors assume that the Manning, Rivers and Roethlisberger are exactly as talented as the average first, fourth, and seventh picks. But not every draft is the same, not every team is equal in their player evaluations, and, most importantly, you can't assume that untraded draft choices are the same as traded draft choices.

So I'm not completely convinced by this study. But I'm not completely unconvinced either. I think there's enough evidence to show that high draft picks aren't all they're cracked up to be. But, because the authors' talent evaluations are so rough precisely where they're most important, I think there's a possibility their actual numbers may be off by quite a bit.

Friday, December 01, 2006

Alan Ryder on NHL "Offensive Engagement"

Last season, it seemed like Dean McAmmond's linemates couldn't get anything done without him. McAmmond got a point on 88% of the even-strength goals the Blues scored when he was on the ice.

Chris Campoli was the Dean McAmmond of defensemen. Campoli led the league in this category among blueliners by scoring or assisting on 52% of his plus-minus "plusses".

These figures are from Alan Ryder's latest article on globesports.com. Ryder believes that players who excel in this category, which he calls "offensive engagement" (OE), are showing themselves to be capable of playing higher in the depth chart than they already play. Of course, many players high in this category (Jagr at 85%, Sundin at 80%) are already treated like star players. But Ryder argues that guys like McAmmond show themselves to be better than their point totals indicate – "close your eyes and imagine an eagle trying to soar with the turkeys."

Seems logical, but there's probably a fair bit of random luck in the stats too. Campoli's league high OE is based on 21 points out of 40 plusses. If the league average for defensemen is 35% (I'm guessing here because Ryder doesn't tell us), Campoli is only 7 points above average. That's 2.3 standard deviations, significant for a randomly-chosen player but perhaps just random for the league leader.

In any case, it shouldn't be too hard to do a study to check if players high in OE tend to show improvement in the future.