The Winning Ways of Winners

Main menu

Beatpaths Historical Algorithm Comparison

Every year I pick an “official” algorithm. This year’s official algorithm was a basic vanilla algorithm, mostly because I realized I didn’t like the sudden dramatic shifts in rankings in the “beatflukes” variant that I used last year. Throughout the season, a couple more variants were developed and discussed in the comment threads. The following was submitted by MOOSE:

In preparation for the Super Bowl, I have run historical numbers for all games, including the playoffs, for the Super Bowl contenders (from 1970 forward). The purpose of these trials was to evaluate the predictive qualities of each algorithm towards the result of the Super Bowl. For each Super Bowl, the ratings for each contending team was calculated through their respective Conference Championship games, and the difference found to determine how strongly favored a team should be. If the difference between the ratings was greater than 4, I classified it as the algorithm guaranteeing a victory for the favored team. If the rating fell between 3 and 4, I classified it as the algorithm considering the game a very likely victory in favor of the higher rated team. In examining these games where one team is clearly superior to another, predicting victory should yield a highly successful pick rate.

The first thing that stands out is that the weighted system has by far the best picking percentage when it comes to games it feels confident about. Beyond that, it also is confident about more games than either of the other systems, though not by a statistically significant amount. When it is confident however, it is significantly stronger about its confidence than the other methods, with an average rating difference of 3.50. On the other hand, the standard method which is designed to be more conservative, is more reluctant to strongly back a team and only shows an average difference of 2.72.

We must take some things into consideration before we can fully understand the results that we get. First, the three methods do not select the same games as their favorites. While there is a lot of agreement, about one third of all games are disputed as either more closely contested, or sometimes favoring the other team. Also, an incorrect prediction of how a game will turn out doesn’t necessarily mean the prediction was poor. Sometimes upsets happen, especially in a one game elimination format such as the NFL playoffs. We will examine these as necessary.

Taking a look at the standard version’s favorite picks, you quickly find that the 3-3 record isn’t as bad as it looks. The three games it had correct were the ’79 Steelers over the LA Rams in the final Super Bowl of the Steeler dynasty, the ’85 Bears over the overmatched Patriots in a blowout, and last year’s Colts over the Bears which minus an opening kick-off returned for a touchdown, would also have been a blowout. All three were solid wins. The standard method calls the ’99 Rams win over the Titans a huge upset, but it wasn’t perceived that way at the time, and certainly isn’t considered that way in retrospect with the Greatest Show on Turf’s only Super Bowl victory. This one definitely counts against the method. Another loss was the 1983 Raiders over the Redskins. The ‘Skins were defending Super Bowl champs and steamrolled through the regular season at 14-2. The Raiders were no push-overs though having come in at 12-4 themselves. You can see why the system backed the Redskins, but perhaps it shouldn’t have been so adamant about it. The final loss however, nobody could see coming. The 2001 Patriots defeat of the Rams is considered one of the biggest upsets in NFL history. Any system ever devised will get this one wrong.

From there, the standard system’s record improves greatly. In an interesting twist, the final two games on its list are both of the SF/CIN Super Bowls. Both also have the exact same rating spread of 3.16. However, San Francisco was favored in the 1981 game while Cincinnati was favored in 1988.

The iterative method appears to pick the wrong games to back. Like the standard method, it only gets three correct out of its favorites, and agrees that the ’79 Steelers over Rams was the most obvious pick. Its other two correct choices were the ’89 49ers over the Broncos which was a blowout, and the 2003 Patriots over the Panthers which was a crazy game decided on a last minute field goal. The incorrect picks included the already mentioned ’01 Patriots, ’83 Raiders, and ’88 Bengals in addition to the ’77 Cowboys win over the Broncos and the ’05 Steelers over the Seahawks. Having been born in 1977, I’m not really aware of the popular sentiment going into the game, but both teams entered with 12-2 records which were best in their conferences. The ’05 Steelers game wasn’t an upset either especially since Roethlisberger had been injured during the season and the Steelers were 15-1 the previous year. This is also considered an AFC dominant era. The method improves its accuracy from there, but not by enough to save face from poor choices at the top.

The apparent winner in this challenge is the weighted method. While some of its choices are questionable, the results are hard to argue with. While the 2004 Patriots were a very strong team, the Eagles weren’t thought to be completely outmatched. Yet the weighted method thinks that this was the single most lop-sided matchup ever played. The result of the game was close, but the pick was right. Its next favorite game was the ’01 Patriots/Rams game which we already discussed was a huge upset. After that, the weighted method gets 10 consecutive picks right before it incorrectly picks the ’02 Raiders over the Buccaneers. Rounding out the favorite picks, the weighted method falls into the same trap as the other two in picking the ’77 Broncos. The method’s five extra picks suffered two losses. One being the ’05 Steelers/Seahawks matchup, and the other in ’71 when the Cowboys defeated the Dolphins.

The question of interest becomes, “Where does this year’s Super Bowl fall?” In both the standard and the weighted method the upcoming game is the most lopsided Super Bowl ever. According to the iterative version, it is the third biggest difference. The iterative method lists a difference of 6.62, the standard 6.68 and the weighted a whopping 7.54. How does this help us to more closely evaluate the results we have already examined? Since all three algorithms agree that this is a one-sided matchup, let’s eliminate all games where they disagree. This immediately reduces the list to nine games. In each of these nine, the three agreed on which team was the heavy favorite, so they all posted the same record. They all went 7-2. Here are the games in question.

Again, I’m sure we’ve already discussed and can all agree that the 2001 game was an epic upset. But there, once more, is that strange 1977 game that each method gets completely wrong. What’s even more strange is that in the final week of the regular season, Dallas defeated Denver and all three versions beatlooped that victory away. Overall, the three methods simply agreed that Denver’s victories were over more successful opponents, and by enough of a margin to strongly back them in the Super Bowl despite the Cowboys’ victory.

In the remaining 7 games the favored team wins by an average of 18.7 points. Obviously the two extreme blowouts skew this average, but only one of the seven games is not a double digit victory. Considering the results of this analysis, the fact that the Giants were mediocre through most of the season and the Patriots are going into the final game unbeaten, should the Giants win it would be the most incredible Super Bowl upset ever.

6 Responses to Beatpaths Historical Algorithm Comparison

In 1977, the Cowboys were not that much better than Denver, overall and had only beaten them by 14-6, at home, 4 weeks prior. So I don’t think that’s really all that much of an anomaly. The Broncos could very well have won that game. Morton was in a position to redeem himself against his old team, had made the Super Bowl before, the AFC dominated the 1970′s, and the team was playing good football.
2001 was an epic upset and really shouldn’t be worrisome.

I’m not sure I’d read too much into these sample sizes. Honestly, I’d be more interested in each algorithm’s performance over a broad swath of games – say, week 9 until the end of the regular season. I hope this doesn’t sound like sour grapes – I really would be interested to see this approach applied over more games.

It is interesting to see the weighted algorithm perform well here, though – mainly because we know from looking at its overall rankings that it’s not a very accurate algorithm in terms of overall team ranking (‘skins at #2, Bears at #5, Broncos at #30, et cetera). I wonder if some link can be drawn between these results, and the “guts and stomps” piece football outsiders did. Maybe the blowouts are junk data in general, but blowouts by elite teams do give us some special insight into which teams are capable of really dominating. Just an idea.

How it works for all versions now is that each path is the sum of the weight of the links. In the standard system each link is 1 (or 2 for in-division season sweeps, or even 3 if there’s a 3rd defeat in the playoffs). In the weighted system each link is the point differential for the matchup. In the iterative system it is the remaining strength of the link.

I decided not to use a multiplicative system because if a link was close to 0 it may cause a team to be ranked below a team it has a beatwin over which goes against the general purpose of these rankings. In other systems, multiplications cause teams with sweeps to get more credit than they deserve, and it creates a high frequency of teams exceeding the 10 point scale.

With the current measurements and scale, no team has ever exceeded 10 during the regular season and the only team ever to exceed 10 in the playoffs is the ’89 49ers, who fell back below 10 after winning the Super Bowl.

I expect though, that over the offseason I will take a look at the methods and see if there is a better way.