Turning the table

We recently showed an example of when data tables worked well to clarify the data. Last week, there was an example from the Times which did the opposite.

The accompanying article boldly claimed that

the 40-yard dash stands above them all as having the strongest correlation to success in the NFL. The three-cone drill, the shuttle run, the bench press -- none correlate to NFL success. The 40 is king.

Further, it cited Bill Barnwell from FootballOutsiders.com who created an "index" using both 40 time and body weight that is "an even better predictor than 40 time alone". In other words, this formula

does the trick.

The data table, shown above, presumably clinched the case.

We were mystified when we put the data to the test, however. Among the set of 15 running backs, the Index did not predict the Yards Per Carry at all! The Index explained only 8% of the variation in Yards Per Carry between the backs.

The data table obscures this bivariate relationship. As it was sorted by the Index, we would look for the column showing Yards Per Carry to be naturally sorted in the same order. But it is hard to tell the trend from the noise in a table.

What went wrong? It turned out neither 40 Time nor Body Weight had any relationship with Yards Per Carry.

These variables did not explain the range of Yards Per Carry attained by this set of running backs.

Finally, we found strong correlation between 40 Time and Body Weight. (The heavier you are, the slower you run!) This meant that both variables contained similar information and some unlikely formula involving the two would be unlikely to perform significantly better than each variable alone.

So we are left to turn the table on the table. More pertinent evidence is needed to prove the case.

The entire analysis suffers from survivorship bias as only the top
running backs are examined, and no adjustment is made to deal with
wide-ranging tenures. Apparently, there is more data available in a book. There is no indication of how the model shown above was validated.

Reference: "The Race of Truth: 40-Yard Times Can Tell the Future", New York Times, April 27, 2008.

Comments

You've got three problems with your analysis:

1) Y/C isn't the best measure of a RBs skill. Football Outsiders DPAR or DVOA have some problems as well, but they're much better metrics.

2) You're mixing all RBs together. There are plenty of RBs that will never be successful for other reasons than their 40-yard index score. If you eliminate all RBs likely to fail for those reasons, or at least group them together based on those criteria, Barnwell's index becomes much more predictive.

3) It's not a linear variable. RBs with index scores above ~93.0 have the physical tools to succeed in the NFL. Those below 88.0 or so almost certainly do not. In-between it becomes much more a matter of the situation the RB is in.

I agree with the first comment. This post seems like a pretty reckless interpretation of both the article and the table.

First, let's look at the article:

The statement made in the Times article is that Barnwell's equation yields "a number that is, on average, about 100 for an N.F.L. running back, with big, fast players having higher numbers and small, slow players having lower numbers." The claim is that there is a range of index values, with 100 as the average, and that the index has some relationship to body weight and speed.

Now the table:

The table shows the top 15 players in the last 10 years, all of whom have an index higher than 115. This is an extremely small subset, with an average index of 119, almost 20% higher than the average N.F.L. running back. The table clearly states that this is a small subset of the highest index values.

And now let's look at your analysis:

You say that the table presumably contains all the data needed to "cinch the case" and prove that the average index is 100, and that higher numbers are equated with more success. The table itself does not claim to contain enough data for a proof, nor does the article. In fact nowhere does it say that the table is meant to explicitly prove the correlation between index and success.

You then claim that the index should directly predict yards per carry. Nowhere in the article or the table is this stated, so it's puzzling where you get that assumption from. (Yes, yards per carry is one measure of success, but it's unclear how you get from there to expecting a linear relationship among the top 15 values in a large data set.) You then make a chart to disprove your own assumption.

After that, you go on to extrapolate a relationship between 40 time and weight, again using only the top 15 data points from a large data set. We already know that the index has, by definition, some relationship with weight, but claiming that weight alone predicts speed is a surprisingly broad claim given only 15 data points.

What are the flaws here? You take a small set of maximum values and treat it as a complete set of data. You base an analysis on only 15 data points. You claim to disprove linear relationships where none are claimed to exist. You don't make any attempt to acquire the full set of data.