References

The Details

FiveThirtyEight has an admitted fondness for the Elo rating — a simple system that judges teams or players based on head-to-head results — and we’ve used it to rate competitors in basketball, baseball, tennis and various other sports over the years. The sport we cut our teeth on, though, was professional football. Way back in 2014, we developed our NFL Elo ratings to forecast the outcome of every game. The nuts and bolts of that system are described below.

Game predictions

In essence, Elo assigns every team a power rating (the NFL average is around 1500). Those ratings are then used to generate win probabilities for games, based on the difference in quality between the two teams involved, plus adjustments for changes at starting quarterback, the location of the matchup (including travel distance) and any extra rest days either team had coming into the contest. After the game, each team’s rating changes based on the result, in relation to how unexpected the outcome was and the winning margin. This process is repeated for every game, from kickoff in September until the Super Bowl.

For any game between two teams (A and B) with certain pregame Elo ratings, the odds of Team A winning are:

ELODIFF is Team A’s rating minus Team B’s rating, plus or minus the difference in several adjustments:

A home-field adjustment of 55 points at base, depending on who was at home, plus 4 points of Elo for every 1,000 miles traveled. This means the Giants get a 55-point Elo bonus when “hosting” the Jets (despite both teams calling MetLife Stadium home), while the Patriots would get a 65-point Elo bonus when, say, the Chargers come to visit. There is no base home-field adjustment for neutral-site games such as the Super Bowl1 or international games, although the travel-distance adjustment is included for the Super Bowl.

A rest adjustment of 25 Elo points whenever a team is coming off of a bye week (including when top-seeded teams don’t play during the opening week of the playoffs). Our research shows that teams in these situations play better than would be expected from their standard Elo alone, even after controlling for home-field effects.

A playoff adjustment that multiplies ELODIFF by 1.2 before computing the expected win probabilities and point spreads for playoff games. We found that, in the NFL playoffs, favorites tend to outplay underdogs by a wider margin than we’d expect from their regular-season ratings alone.

A quarterback adjustment that assigns every team and each individual QB a rolling performance rating, which can be used to adjust a team’s “effective” Elo upward or downward in the event of a major injury or other QB change. (See below for more details about how this adjustment works.)

We also tested effects for weather and coaches (including both head coaches and coordinators) but found that neither improved the predictive value of our model in backtesting by enough to warrant inclusion.

Fun fact: If you want to compare Elo’s predictions with point spreads like the Vegas line, you can also divide ELODIFF by 25 to get the spread for the game. Just be sure to include all of the many adjustments above to get the most accurate predicted line.

Once the game is over, the pregame ratings are adjusted up (for the winning team) and down (for the loser). We do this using a combination of factors:

The K-factor. All Elo systems come with a special multiplier called K that regulates how quickly the ratings change in response to new information. A high K-factor tells Elo to be very sensitive to recent results, causing the ratings to jump around a lot based on each game’s outcome; a low K-factor makes Elo slow to change its opinion about teams, since every game carries comparatively little weight. In our NFL research, we found that the ideal K-factor for predicting future games is 20 — large enough that new results carry weight, but not so large that the ratings bounce around each week.

The forecast delta. This is the difference between the binary result of the game (1 for a win, 0 for a loss, 0.5 for a tie) and the pregame win probability as predicted by Elo. Since Elo is fundamentally a system that adjusts its prior assumptions based on new information, the larger the gap between what actually happened and what it had predicted going into a game, the more it shifts each team’s pregame rating in response. Truly shocking outcomes are like a wake-up call for Elo: They indicate that its pregame expectations were probably quite wrong and thus in need of serious updating.

The margin-of-victory multiplier. The two factors above would be sufficient if we were judging teams based only on wins and losses (and, yes, Donovan McNabb, sometimes ties). But we also want to be able to take into account how a team won — whether they dominated their opponents or simply squeaked past them. To that end, we created a multiplier that gives teams (ever-diminishing) credit for blowout wins by taking the natural logarithm of their point differential plus 1 point.\begin{equation*}Mov Multiplier = \ln{(Winner Point Diff+1)} \times \frac{2.2}{Winner Elo Diff \times 0.001 + 2.2}\end{equation*}This factor also carries an additional adjustment for autocorrelation, which is the bane of all Elo systems that try to adjust for scoring margin. Technically speaking, autocorrelation is the tendency of a time series to be correlated with its past and future values. In football terms, that means the Elo ratings of good teams run the risk of being inflated because favorites not only win more often, but they also tend to put up larger margins in their wins than underdogs do in theirs. Since Elo gives more credit for larger wins, this means that top-rated teams could see their ratings swell disproportionately over time without an adjustment. To combat this, we scale down the margin-of-victory multiplier for teams that were bigger favorites going into the game.2

Multiply all of those factors together, and you have the total number of Elo points that should shift from the loser to the winner in a given game. (Elo is a closed system where every point gained by one team is a point lost by another.) Put another way: A team’s postgame Elo is simply its pregame Elo plus or minus the Elo shift implied by the game’s result — and in turn, that postgame Elo becomes the pregame Elo for a team’s next matchup. Circle of life.

We also adjust each starting quarterback’s rating based on his performance in the game, adjusting for the quality of the opposing defense. (Read on for more details about how that process works.)

Elo does have its limitations. Aside from changes at quarterback, it doesn’t know about trades or injuries that happen midseason, so it can’t adjust its ratings in real time for the absence of an important non-QB player. Over time, it will theoretically detect such a change when a team’s performance drops because of the injury, but Elo is always playing catch-up in that department. Normally, any time you see a major disparity between Elo’s predicted spread and the Vegas line for a game, it will be because Elo has no means of adjusting for key changes to a roster and the bookmakers do. (But this should be much less frequent after the addition of our QB adjustments, since oddsmakers don’t tend to shift lines much — or at all — in response to changes at non-QB positions.)

The quarterback adjustment

New for 2019, we added a way to account for changes in performance — and personnel — at quarterback, the game’s most important position. Here’s how it works:

Both teams and individual quarterbacks have rolling ratings based on their recent performance.

Performance is measured according to “VALUE,” a regression between ESPN’s Total QBR yards above replacement and basic box score numbers (including rushing stats) from a given game, adjusted for the quality of opposing defenses.

This metric is also adjusted for opposing defensive quality by computing a rolling rating for team QB VALUE allowed, subtracting league average from the VALUE an opponent usually gives up per game, and using that to adjust a QB’s performance for the game in question. So for example, if a team usually gives up a VALUE 5 points higher than the average team, we would adjust an individual QB’s performance downward by 5 points of VALUE to account for the easier opposing defense.

This implies that short-term “hot” and “cold” streaks by individual QBs have predictive value, which can trigger a nonzero pregame QB adjustment even when a team has had the same starter for each of its previous 20 games.

The rolling rating represents the VALUE we’d expect a quarterback (whether at the individual or team level) to produce against a passing defense of average quality in the next start. To convert between VALUE and Elo, the rolling rating can be multiplied by 3.3 to get the number of Elo points a QB is expected to be worth compared with an undrafted rookie replacement.

The quarterback Elo adjustment is applied before each game by comparing the starting QB’s rolling VALUE rating with the team’s rolling rating and multiplying by 3.3.

For example: when Aaron Rodgers was injured midway through the 2017 season, he had a rolling VALUE rating of 66. The Green Bay Packers’ team rolling VALUE rating was 68, and backup Brett Hundley had a personal rating of 14. So when adjusting the Packers’ Elo for their next game with Hundley starting instead of Rodgers, we would have applied an adjustment of 3.3 * (14 – 68) = -1764 to Green Bay’s base Elo rating of 1586 heading into its Week 7 game against the Saints. This effectively would have left the Packers as a 1409 Elo team with Hundley under center (before applying adjustments for home field, travel and rest), dropping Green Bay’s win probability from 63 percent to 39 percent for the game despite playing at home. In cases like these, the QB adjustment can have a massive effect!

You can track these quarterback ratings on a team-by-team and division-by-division basis using this interactive page, which shows the relative quality of every QB in the league. The average team QB VALUE rating going into the 2019 season was about 49.5 (or about 163 Elo points), a leaguewide number that has increased substantially over the history of the NFL as passing has become more prevalent and efficient. So a rolling rating that would have made a QB one of the best in football in the 1990s would rank as only average now, even though the zero-point in our ratings remains the replacement-level performance of an undrafted rookie starter.

One last note on these ratings involves how they are set initially. We’ll explain preseason team Elo ratings below, but here is how preseason ratings are set for the quarterback adjustment:

Before a season, each starting quarterback is assigned a preseason rating based on either his previous performance or his draft position (in the case of rookies making their debut start).

For veterans with between 10 and 100 career starts, we take their final rating from the end of the previous season and revert it toward the rating of the average NFL QB start by one-fourth before the following season.

For players with fewer than 10 or more than 100 starts, we don’t revert their ratings at all.

For rookies making their starting debuts, we assign them initial ratings based on draft position. An undrafted rookie is always assigned a rating of zero for his first start. The first overall pick, by comparison, gets a rating of +113 Elo points before his first start.

Preseason QB ratings are also assigned at the team level. These consist of one-third weight given to the team’s previous end-of-season rolling QB rating and two-thirds weight given to the preseason rolling rating of the team’s projected top starter.

Pregame and preseason ratings

So all of that is how Elo works at the game-by-game level and what goes into our quarterback adjustments. But where do teams’ preseason ratings come from, anyway?

We use two sources to set teams’ initial ratings going into a season:

At the start of each season, every existing team carries its Elo rating over from the end of the previous season, except that it is reverted one-third of the way toward a mean of 1505. That is our way of hedging for the offseason’s carousel of draft picks, free agency, trades and coaching changes. We don’t currently have any way to adjust for a team’s actual offseason moves, aside from changes at quarterback, but a heavy dose of regression to the mean is the next-best thing, since the NFL has built-in mechanisms (like the salary cap) that promote parity, dragging bad teams upward and knocking good ones down a peg or two.

For seasons since 1990, we also use Vegas win totals to help set preseason Elo ratings, converting over-under expected wins to an Elo scale. (This addition to the model helped significantly improve predictive accuracy in backtesting, by a little more than half the improvement that adding the QB adjustment did.) As a side note, this is partly why we mix the projected startIng QB’s rolling rating into the preseason team QB rating — we assume that changes at quarterback are “baked into” Vegas over/unders and must be adjusted for to avoid double-counting the improvement added by an upgrade at QB.

These two factors are combined, with one-third weight given to regressed Elo and two-thirds weight given to Vegas-wins Elo. This blend is what forms a team’s preseason Elo rating.

Note that I mentioned “existing” teams when mentioning end-of-season ratings from the previous year. Expansion teams have their own set of rules. For newly founded clubs in the modern era, we assign them a rating of 1300 — which is effectively the Elo level at which NFL expansion teams have played since the 1970 AFL merger. We also assigned that number to new AFL teams in 1960, letting the ratings play out from scratch as the AFL operated in parallel with the NFL. When the AFL’s teams merged into the NFL, they retained the ratings they’d built up while playing separately.

For new teams in the early days of the NFL, things are a little more complicated. When the NFL began in 1920 as the “American Professional Football Association” (they renamed it “National Football League” in 1922), it was a hodgepodge of independent pro teams from existing leagues and opponents that in some cases were not even APFA members. For teams that had not previously played in a pro league, we assigned them a 1300 rating; for existing teams, we mixed that 1300 mark with a rating that gave them credit for the number of years they’d logged since first being founded as a pro team.

This adjustment applied to 28 franchises during the 1920s, plus the Detroit Lions (who joined the NFL in 1930 after being founded as a pro team in 1929) and the Cleveland Rams (who joined in 1937 after playing a season in the second AFL). No team has required this exact adjustment since, although we also use a version of it for historical teams that discontinued operations for a period of time.

Not that there haven’t been plenty of other odd situations to account for. During World War II, the Chicago Cardinals and Pittsburgh Steelers briefly merged into a common team that was known as “Card-Pitt,” and before that, the Steelers had merged with the Philadelphia Eagles to create the delightfully monikered “Steagles.” In those cases, we took the average of the two teams’ ratings from the end of the previous season and performed our year-to-year mean reversion on that number to generate a preseason Elo rating. After the mash-up ended and the teams were re-divided, the Steelers and Cardinals (or Eagles) received the same mean-reverted preseason rating implied by their combined performance the season before.

Season simulations

Now that we know where a team and quarterback’s initial ratings for a season come from and how those ratings update as the schedule plays out, the final piece of our Elo puzzle is how all of that fits in with our NFL interactive graphic, which predicts the entire season.

At any point in the season, the interactive lists each team’s up-to-date Elo rating (as well as how that rating has changed over the past week and how any changes at QB alter the team’s effective Elo), plus the team’s expected full-season record and its odds of winning its division, making the playoffs and even winning the Super Bowl. This is all based on a set of simulations that play out the rest of the schedule using Elo to predict each game.

Specifically, we simulate the remainder of the season 100,000 times using the Monte Carlo method, tracking how often each simulated universe yields a given outcome for each team. It’s important to note that we run these simulations “hot” — that is, a team’s Elo rating is not set in stone throughout the simulation but changes after each simulated game based on its result, which is then used to simulate the next game, and so forth. This allows us to better capture the possible variation in how a team’s season can play out, realistically modeling the hot and cold streaks that a team can go on over the course of a season.

Our simulations also project which quarterback will start each game by incorporating injuries, suspensions and starters being rested. For example, we might know that a quarterback is out for Weeks 1 and 2 but back for certain in Week 3. Or our forecast might have some uncertainty around a quarterback’s injury and project that he has only a 10 percent chance of playing next week but a 50 percent chance of playing the following week, and so on. In cases where we don’t know for sure which quarterback will start a game, the team’s quarterback adjustment is a weighted average of the possible starting quarterback adjustments.

Late in the season, you will find that the interactive allows you to experiment with different postseason contingencies based on who you have selected to win a given game. This is done by drilling down to just the simulated universes in which the outcomes you chose happened and seeing how those universes ultimately played out. It’s a handy way of seeing exactly what your favorite team needs to get a favorable playoff scenario or just to study the ripple effects each game may have on the rest of the league.

The complete history of the NFL

In conjunction with our Elo interactive, we also have a separate dashboard showing how every team’s Elo rating has risen or fallen throughout history. These charts will help you track when your team was at its best — or worst — along with its ebbs and flows in performance over time. The data in the charts goes back to 1920 (when applicable) and is updated with every game of the current season.An important disclaimer: The historical interactive ratings will differ from the ratings found in our current-season prediction interactive because the historical ratings do not contain our quarterback adjustments. (If you’re interested in looking at the historical QB adjustment data, it’s available on our data homepage.)

Model Creators

Nate Silver The founder and editor in chief of FiveThirtyEight. | @natesilver538

Footnotes

Unless a team somehow makes the Super Bowl in its host year.

Special note: In the case of a tie, the multiplier becomes 1.525, or 2.2 times the natural log of 2 (which, based on the formula above, effectively assumes the absolute margin of victory in any game must be at least 1).

For seasons before game-level sack logs are complete (pre-1981), the sack term is zeroed out.

After rounding.

Nate Silver is the founder and editor in chief of FiveThirtyEight. @natesilver538