Life through nerd-colored glasses

Well, the Academy Awards are over (congrats to the winners; I know you’ve all been eagerly awaiting my approval) — how did my grand forecasting experiment turn out?

For some context, I’ll compare my forecast to some other statistically driven models: Nate Silver’s forecast, and the 3 other models covered along with mine in the Wall Street Journal. Briefly:

Nate Silver and Ben Zauzmer developed a model largely similar to mine, making predictions by looking at precursor awards ceremonies, and weighting each ceremony by its historical accuracy. Not surprisingly, our predictions are very similar.

David Rothschild from Microsoft Research used a fancier / more mysterious combination of market data and crowdsourcing to make predictions. Most of his predictions are quoted at high confidence. He was also able to make predictions for all 24 categories (including the 3-4 categories where it’s hard to find relevant data from precursor ceremonies).

Farsite also seemed to aggregate information from a broader pool than precursor ceremonies. Their predictions tended to fall between the “precursor models” and the Microsoft model.

I wrote about my predictions in the major categories on this blog, and published my final set of predictions on a google spreadsheet and NYTimes ballot before the ceremony. Unfortunately, some last-minute debugging changed one of my screenwriting predictions an hour before the ceremony (the revised prediction was correct, the original was wrong). Since I had previously published a different prediction, I will count that category as a mis-prediction here. Note also that I skipped the 3 short-subject categories and Production Design category, as I didn’t have enough relevant data to make a prediction. I’ve excluded those categories from the following analysis.

Here are the results

Top 6 Categories (Directing, Acting, Picture)

Beaumont: 5/6

Zauzmer: 5/6

Farsite: 5/6

Nate Silver: 4/6

Microsoft Research: 4/6

Only Farsite correctly predicted that Christoph Waltz would win for Best Supporting Actor. Ben and I lucked out when Ang Lee wonBest Director (everyone else predicted a Spielberg win). We both claimed this category was essentially a toss-up, so we can’t claim to be too prescient here. However, our prediction seems more plausible than the Farsite/Microsoft predictions, which had Spielberg as a >3:1 favorite (again, though, this is only 1 data point).

Top 20 Categories (excluding Production Design and short subject categories)

Beaumont: 17/20

Zauzmer: 17/20

Microsoft: 17/20

Farsite/Silver: No Predictions

We each made 1-2 additional mistakes in the minor categories, albeit in different categories: I mis-called the Animated category and (thanks to the bug discussed above) mis-predicted Adapted Screenplay up until an hour before the awards. Ben missed Original Screenplay and Cinematography. Microsoft missed the Makeup Category.

Calibration

The raw accuracy is the most interesting statistic, but not the entire story. A good model should also be calibrated — a prediction made with X% confidence should be correct about X% of the time. Predictions substantially more or less accurate than this are mis-calibrated.

At first glance, it seems that the Microsoft model is best calibrated. The average prediction confidence in this model is about 80%, comparable to the overall accuracy. By contrast, Ben’s and my models makes predictions at ~55% confidence on average. In other words, our models were too conservative, based on how well they did.

In a few previousposts, I’ve discussed a method for visualizing model calibration. In essence, the idea is to use a model’s prediction confidence to simulate outcomes. For each simulation, we can plot the differences between the model prediction and simulated outcome. Finally, we overplot the actual outcome on top of these simulations. If the model is well-calibrated, the “reality line” should overlap the simulations. Let’s take a look at that:

Microsoft Model

Each line in these plots shows how many mis-predictions were made at a given confidence level or greater (for example, the Microsoft model made 3 mistakes at confidence > 0%, 2 mistakes at confidence >50%, 1 mistake at confidence > 70%, and no mistakes at confidence >~75%). The red lines show 1000 simulations, the black line is the average simulation, the light/dark bands are the central 40%/80% of simulations, and the blue line is the actual performance.

These plots confirm and quantify what I said above — the models that Ben and I put together are too conservative, while the Microsoft model seems nicely calibrated. I have to admit I am both impressed by how well the “magic” Microsoft model calibrated itself, and curious about the under-confidence of my own model. I’ll be chewing on that in the coming days. Here are a few things I’ll be thinking about

My predictions did better than I expected, based on testing on historical data. I was expecting to miss ~5 categories. After catching my screenwriting bug, I only missed 2. That’s mild evidence that this years’ Oscars played out more like the precursor awards, which explains some of the under-confidence.

I used regularized regression to optimize my model; the “regularization” means the overall confidence of the model is adjusted up or down to match historical data. I’m starting to wonder if there’s any asymmetry in that process such that, given the relative simplicity/inflexibility of my model, the regularization prefers under-confidence. Who knows.

So… yay math?

So at the end of the day, is this worth it (I’m asking in the sense of ‘is there a significant edge to model-based predictions’, and not ‘is this a waste of time’)? Certainly, basing predictions off of precursor awards is a huge advantage over ignoring the data — my oscar guesses from previous years were usually <~50% accurate, and not 85%.

What about the harder question of whether modeling is better than the naive strategy of looking at the precursor awards, and predicting whichever film has won the most awards? That simple strategy largely yields the same set of predictions. The few categories where modeling matters are close calls, where there is no obvious nominee with a plurality of precursor wins. The best example from this year was the Best Actress Category; both Jennifer Lawrence and Jessica Chastain won precursor awards. However, the mathematical models all realized that Jennifer Lawrence was more successful in the more influential awards (e.g. SAG), and correctly predicted her as a clear favorite.

Anyways, that’s a rather large brain dump. This was fun. And I FINALLY eeked out an Oscar pool victory against my brother. Mission accomplished. Next year, I’m going after Microsoft.

UPDATE

Last post, I described my mildly obsessive strategy for making predictions in my Oscar pool this year. I’ve been driven to such measures by the repeated and humiliating losses to my brother Jon for the past quarter-score.

To recap that post: the common wisdom among Oscar pundits is that the “precursor” awards which happen before the Academy Awards (e.g. the Golden Globes, Screen Actors / Directors / Producers Guild, BAFTA, and the Critics Choice awards) tend to correlate with who wins Oscars. From what I understand, Jon looks over these when choosing a winner. I tried the same thing last year, and was on track to win until Meryl Streep won Best Actress in an upset (jerk). That night i made a vow – never again.

So, I decided to take the same approach this year, but make it more systematic. I grabbed 20 years worth of award data from the Internet Movie Database, and built a model that takes into account the degree to which each precursor Ceremony predicts the Oscars (different ceremonies do better in different categories). My theory is that this might provide an edge for close calls. Our Oscar pool also incorporates an interesting twist, in that we have some freedom to down-weight predictions that we aren’t sure of. My model makes probabilistic estimates, and thus also gives a strategy for how to weight each prediction.

Now that all of the precursor awards have taken place, I’m able to apply the model for the 2013 awards. Here are the main results (who really cares about Best Live Action short? Sorry, guy nominated for Best Live Action Short), with some punditry for good measure:

Best Picture

Argo (58%)

Les Miserables (12%)

Argo has swept the dramatic awards, and Les Mis won for best Golden Globe (Comedy/Musical). The last time a movie swept the dramatic awards and lost Best Picture was when Brokeback Mountain lost to Crash.

Best Actor

Daniel Day Lewis (65%)

Hugh Jackman / Denzel Washington (10%)

Another straightforward call, as Daniel Day Lewis has swept the dramatic acting categories this year. Plus, people love that guy. A DDL loss would be a repeat of 2002, when Denzel Washington (Training Day) unexpectedly beat Russel Crowe (A Beautiful Mind), who also swept the dramatic awards. The SAG and Critics Choice awards best predict this category.

Best Actress

Jennifer Lawrence (70%)

Jessica Chastain (20%)

Sorry, Quvenzhané Wallis. You may be who the Earth is for, but you aren’t who this award is for. Choosing between Lawrence and Chastain is tricky. The former won the Golden Globe comedy award and the SAG award. Jessica Chastain won the Golden Globe drama award and the critics choice award. The SAG is the best predictor, and thus the model prefers Jennifer Lawrence. If I hadn’t used a model, I would have guessed Jessica Chastain, thinking that dramatic movies have a better shot at winning Oscars than rom-coms. C’mon, math…

Supporting Actress

Anne Hathaway (87%)

Everyone else (2-5%)

The easiest prediction of the bunch. She’s been unbeatable in other ceremonies, so there’s no reason not to pick her based on the data.

Supporting Actor

Tommy Lee Jones (47%)

Christoph Waltz (30%)

This seems to be the most controversial category. The New York Times is predicting that Robert DeNiro will win, based on his aggressive oscar campaigning and his icon status (two factors not present in my model, which thinks he has a 5% shot). Tommy Lee Jones won the SAG, which best predicts the acting categories. Christoph Waltz won both the Golden Globe and BAFTA. Historically, this is a difficult category to predict based on precursor ceremonies.

Director

Who knows? Ben Affleck has swept the other ceremonies, but was notoriously not nominated for an Oscar this year. Thus, there’s very little award information to go off of.

My model gives a slight preference to Ang Lee/Life of Pi (45%) over Steven Spielberg/Lincoln (40%), based on which ceremonies each was nominated for. My model isn’t really precise to within 5%, so it isn’t a statistically significant edge. Personally, I’m inclined to think that Steven Spielberg will win (since, you know, he’s Steven Spielberg, and it’s been a while since he won. Poor little guy.)

Animated Feature

Another tossup between Wreck-it Ralph (47%) and Brave (40%). Both BAFTA and the critics choice awards tend to predict the correct winner about 90% of the time but, this year, they awarded different films (BAFTA->Brave, and Critics Choice -> Wreck-It Ralph). I love Pixar, but their recent movies aren’t as good as they were 5 years ago, and I loved Wreck It Ralph. I’m rooting for that movie, and its sweet 80s Nintendo soundtrack.

Foreign Film

Amour (56%)

A Royal Affair (15%)

Historically, this is a hard award to predict from precursor ceremonies. This year Amour won BAFTA, the Critics Choice Award, and the Golden Globe, so I think its odds are pretty good. It is also nominated for Best Picture, for which it doesn’t stand a chance. I think voters will feel bad and give it the Foreign Oscar instead.

Original Screenplay

Django Unchained (73%) (58%)

Flight (15%) Zero Dark 30 (36%)

Update (Feb 24, 7PM): I noticed an error in my Writers Guild Award data (1 hour before the ceremony!). I had incorrectly stored the Original Screenplay winner as Flight instead of Zero Dark 30, and the Adapted category as The Silver Linings Playbook instead of Argo. This changes the predictions in the two writing categories

Adapted Screenplay

The Silver Linings Playbook (65%) Argo (65%)

Argo, Lincoln (12% each) Lincoln, Silver Linings Playbook (11% each)

Update: See above.

Thats about it for the major-ish categories. Most of the minor categories don’t have many equivalent awards in other ceremonies, so the model predictions aren’t very compelling (sorry, lady nominated for Best Makeup and Hairstyling)

This was an interesting exercise — one of the key lessons (if you see this stuff as teaching moment material) is that there is a fairly high amount of unpredictability in predicting Oscar winners based on the other awards — typical forecast accuracies are around 60% (clearly better than random guessing from a field of 5-7 nominees, but nowhere near a lock).

Perhaps models with more information could do better (genre information seems particularly relevant). However, even the people who do this stuff for a living usually only get ~75% of the categories right. So maybe it’s just hard to predict.

Or maybe we just haven’t seen the Nate Silver of Oscar forecasting yet.

Update

Apparently, Nate Silver is the Nate Silver of Oscar forecasting. His method and conclusions are largely the same as what’s posted here. That’s encouraging.