Tuesday, March 31, 2015

I've been busy enjoying the Tournament and the discussions over at Kaggle, but I thought I'd take some time to run predictions for the Final Four.

#7 MSU vs. #1 Duke: Duke by 3.5 points

MSU has had (another) amazing Tournament run, but my predictor still considers them the weakest team in the field by a substantial margin.

#1 Wisconsin vs. #1 Kentucky: UK by 1.5 points

The predictor agrees with most pundits that Wisconsin is the second best team, and the most likely to knock off UK. 1.5 points is basically a toss-up. Wisconsin's ratings rose slightly following a good win over Arizona, and UK's dropped after a relatively poor performance against Notre Dame.

Monday, March 23, 2015

I'm just back from watching UCLA win two games in Louisville and am not yet caught up, but here's a quick update from the Machine Madness side of the competition.

Perhaps not unsurprisingly, Monte McNair leads the competition with 56 points and I suspect will win if Kentucky wins out. (Monte's in the Top Ten in the Kaggle competition right now.) BlueFool is
in second place just a point behind Monte and the only competitor with
Duke as champion, so she'll likely win if that happens. Jason Sumpter
is in third place but has Kentucky as champion, so he'll need some
breaks to beat out Monte -- specifically, I think he needs Xavier to
beat Wisconsin.

Nothing But Neural Net (great name, btw)
is the only competitor with Wisconsin as champion. Likewise I'm the
only competitor with Arizona, so obviously we'll be rooting for those
teams to win out.

Thursday, March 19, 2015

I'm off in Louisville to watch the first round games, so updating the blog is difficult, but I wanted to wish good luck to everyone in both the Kaggle and the Machine Madness contests. Enjoy the games!

Monday, March 9, 2015

When the NCAA Tournament rolls around there's an inevitable flurry of blog posts and news articles about some fellow or another who has predicted the Tournament outcome by running a Tournament simulation a million times! Now that's impressive!

Or maybe not.

These simulations are nothing more than taking someone's win probabilities (usually Pomeroy or Sagarin, since these are available with little effort) and then rolling a die against those probabilities for each of the 63 games. On a modern computer you can do this a million times in a second with no real strain.

More importantly, though, does running this sort of simulation a million times actually reveal anything interesting?

Imagine that we decided to do this for just the title game. In our little thought experiment, the title game this year has (most improbably) come down to Duke versus Furman, thanks in no small part to Furman's huge upset of the University of Kentucky in their opening round game.

(Furman -- one of the worst teams in the nation and who have only
managed 5 wins in the lowly Southern Conference -- has somehow won
through to the conference title game and actually does have a chance to get to the Tournament. If this happens, they'll undoubtedly be the worst 16 seed and matched up against UK in Louisville. So this is totally a plausible scenario.)

We look up the probability of Duke beating Furman in our table of Jeff Sagarin's strengths (or Ken Pomeroy, whomever it was) and we see that Duke is favored to win that game 87% of the time. So now we're ready to run our simulation.

We run our simulation a million times. No, wait. We want to be as accurate as possible for the Championship game, so we run it ten million times.

(We have plenty of time to do this while Jim Nantz narrates a twenty minute piece on the unlikely Furman Paladins and their quixotic quest to win the National Championship. This includes a long interview with a frankly baffled Coach Calipari.)

We anxiously watch the results tally as our simulation progresses. (Or rather we don't, because the whole thing finishes before we can blink, but I'm using some dramatic license here.) Finally our simulation is complete, and we proudly announce that inten million simulated games, Duke won 8,700,012 of the games! Whoo hoo!

But wait.

The sharp-eyed amongst you might have noticed that Duke's 8,700,012 wins out of a 10,000,000 is almost the same percentage as our original winning probability that we borrowed from Ken Pomeroy. (Or Jeff Sagarin, whomever it was.) Well, no kidding. It had better be, or our random number generator is seriously broken.

Welcome to the Law of Large Numbers. To quote Wikipedia: "[T]he average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed." The more times we run this "simulation" the closer we'll get to exactly 87%.

This is why the whole notion of "simulating" the tournament this way is silly. The point of doing a large number of trials (simulations) is to reveal the expected value. But we already know the expected value: it's the winning probability we stole from Jeff Sagarain. (Or Ken Pomeroy, whomever it was.) It's just a waste of perfectly good random numbers to get us back to the place we started.

To be fair, there's one reason that it makes some sense to do this for the entire Tournament. If for some reason you want to know before the Tournament the chances of a particular team winning the whole thing, then this sort of simulation is a feasible way to calculate that result. (Or if you're Ed Feng you create this thing.) And if that's your goal, I give you a pass.

On the other hand, if you're doing all this simulation to fill out a bracket for (say) the Machine Madness competition, then it makes more sense to run your simulation for a small number of trials. The number of trials is essentially a sliding control between Very Random (1 trial) and Very Boring (1 billion trials) at the other end. Arguably it is good meta-strategy in pool competitions not to predict the favorite in every game, so by lowering the number of trials you can inject some randomness into your entry. (I don't think this is necessarily a good approach, but at least it is rational.)

Wednesday, March 4, 2015

I've recently put up a few posts about the Kaggle competition including one about reasonable limits to performance in the contest. So it's natural to wonder how I'm doing / have done in the Kaggle competition.

Fair enough.

Last year, my entry ended up finishing at 60th on the Kaggle leaderboard, with a score of 0.57. At one point that was exactly at the median benchmark, but apparently the post-contest cleanup of DQed entries changed that slightly. 2014 wasn't a particularly good year for my predictor. Here are the scores for the other seasons since 2009:

Year

Score

2009

0.46

2010

0.53

2011

0.62

2012

0.52

2013

0.51

2014 was my worst year since 2011. (2011 was the Chinese Year of the Upset, with a Final Four of a #3, #4, #8 and #11 seed.) Ironically, I won the Machine Madness contest in 2011 because my strategy in that contest includes predicting some upsets; this led to correctly predicting Connecticut as the champion.

My predictor isn't intended specifically for the Tournament. It's optimized for predicting Margin of Victory (MOV) for all NCAA games. This includes the Tournament, but those games are such a small fraction of the overall set of games that they don't particularly influence the model. There are some things I could do to (hypothetically) improve the performance of my predictor for the Kaggle competition. For one thing, I could build a model that tries to predict win percentages directly, rather than translating from predicted MOV to win percentage. Secondly, since my underlying model is a linear regression, I implicitly optimize RMSE. I think it's likely that a model that optimizes on mean absolute error would do better1 but I haven't yet found a machine learning approach that can create a model optimized on mean absolute error with performance equaling linear regression.

I haven't put much effort into building a "Tournament optimized" predictor because (as I have pointed out previously) there is a large random element to the Tournament performance. Any small gains I might make by building a Tournament-specific model would be swamped by the random fluctuations in the actual outcomes.

1 I say this because RMSE weights outliers more heavily. Although there are a few matchups in the Tournament between teams with very different strengths (i.e., the 1-16 and 2-15 matchups in particular) in general you might suppose that there are fewer matchups of this sort than in the regular season, and that being slightly more wrong on these matchups won't hurt you much if you're also slightly more correct on the rest of the Tournament games. That's just speculation on my part, though.↩

Monday, March 2, 2015

The first stage of the Kaggle competition involves Kagglers testing out their models against data from the past few basketball seasons, and these scores appear on the first stage leaderboard. Invariably new Kagglers make some fundamental mistakes and end up submitting entries with unreasonably good performance. The administrators of the contest have taken to removing these entries to avoid discouraging other competitors. The line for removing entries is somewhat fuzzy, and it begs the question1 "What is a reasonable long-term performance for a Tournament predictor?" There are probably many ways to answer this question,2 but here's one approach that I think is reasonable: Calculate the performance of the best possible predictor over an infinite number of Tournaments.

I am reminded at this point of an old joke.

A man is sitting in a bar complaining to his friend -- who happens to be a physicist -- about his awful luck at the racing track, and wishing he had some better way to know what horse was going to win each race.

"Well, that strikes me as a rather simple physics problem," his friend says. "I'm sure I could build a model to predict the outcome.""Really?" says the man, visibly excited. "That's fantastic. We'll both get rich!"So the physicist goes off to build his model. After a week, the man has still heard nothing, so he calls his friend. "How are you doing on the model?" he asks."Well," says the physicist. "I admit that it is turning out to be a bit more complicated than I imagined. But I'm very close.""Great," says the man. "I can't wait!"But another week goes by and the man hears nothing, so he calls again."Don't bother me," snaps the physicist. "I've been working on this day and night. I'm very close to a breakthrough!"So the man leaves his friend alone. Weeks pass, when suddenly the man is awakened in the middle of the night by a furious pounding on his front door. He opens the door and sees his friend the physicist. He looks terrible -- gaunt and strained, his hair a mess -- and he is clutching a sheaf of crumpled papers. "I have it!" he shouts as the door opens. "With this model we can predict the winner of any horse race!"The man's face lights up. "I can't believe you did it," he says. "Tell me how it works.""First of all," says the physicist, "we assume that the horses are perfect spheres racing in a vacuum..."

Like the physicist, we face a couple difficulties. For one thing, we don't have the best possible predictor. For another, we don't have an infinite set of Tournaments. No matter, we shall push on.

We don't have the best possible predictor (or even know what its performance would be) but we do have some data from the best known predictors and we can use that as a substitute. The Vegas opening line is generally acknowledged to be the best known predictor (although a few predictors do manage to consistently beat the closing line, albeit by small margins). The Vegas opening line predicts around 74% of the games correctly "straight up" (which is what the Kaggle contest requires). I'm personally dubious that anyone can improve upon this figure significantly3 but for the sake of this analysis let's assume that the best possible predictor can predict an average game4 correctly 80% of the time.

We also don't have an infinite number of Tournaments to predict, but we can assume that the average score on an infinite number of Tournament games will tend towards the score on an average Tournament game. For the log-loss scoring function, the best score in the long run comes from predicting our actual confidence (the 80% from above). If we predict an infinite number of games at 80% and get 80% of those games correct, our score is:

`0.80*log(0.80) + (1-0.80)*log(1-0.80)`

which turns out (fortuitously) to be just about 0.50. (If we use a performance of 74%, the score is about 0.57.)

This analysis suggests that the theoretical best score we can expect predicting a large number of Tournament games is around 0.50 (and probably closer to 0.57). This agrees well with last year's results -- the winner had a score of about 0.52 and the median score was about 0.58.

As far as "administrative removal" goes, there are 252 scored games in the Kaggle stage one test set. That's not an infinite set of games, but it is enough to exert a strong regression towards the mean. The Kaggle administrators are probably justified in removing any entry with a score below 0.45.

On a practical level, if your predictor is performing significantly better than about 0.55 for the test data, it strongly suggests that you have a problem. The most likely problems are that you are leaking information into your solution or that you are overfitting your model to the test data.

Or, you know, you could be a genius. That's always a possibility.

1 Yes, I know I'm misusing "beg the question".↩2 I suspect that a better approach treats the games within the Tournament as a normal distribution and sums over the distribution to find the average performance, but that's rather too much work for me to attempt. ↩3
If for no other reason than Vegas has a huge financial incentive to improve this number if they could. ↩4 The performance of the Vegas line is an average over many games. Some games (like huge mismatches) the Vegas line predicts better than 74%; some (like very close matchups) it predicts closer to 50%. I'm making the simplifying assumption here that the average over all the games corresponds to the performance on an average game. Later on I make the implicit assumption that the distribution of Tournament games is the same as the distribution of games for which we have a Vegas line. You can quarrel with either of these assumptions if you'd like. A quick analysis of the Tournament games since 2006 shows that the Vegas line is only right 68% of the time, suggesting that Tournament games may be harder to predict than the average game.↩