This week, thanks to Amazon, who replaced my unreadable Kindle copy of David W Miller's Fitting Frequency Distributions: Philosophy and Practice with a dead-tree version that could easily be used as a weapon such is its heft (and assuming you had the strength to wield it), I've been reminded of the importance of motivating my distributional choices with a plausible narrative. It's not good enough, he contends, to find that, say, a Gamma Distribution fits your data set really well, you should be able to explain why it's an appropriate choice from first principles.

This result has troubled me for a while, as the Normal distribution is most often motivated by an argument that the variable in question arises from the sum of a large number of independent random variates - which seems hard to apply to AFL football with any finesse. Exactly what accumulated, independent "errors" lead to an excess of one team's score over another relative to expectations? Defensive lapses perhaps? Granted it's been found that the Normal distribution is broadly applicable to a range of phenomena where the accumulated errors explanation seems equally tenuous or completely non-viable, but it would be satisfying to build a narrative to led to, rather than assumed, the Normal distribution for game margins.

So I found myself this week wondering what that simpler, more empirical motivation might be.

Scoring shots, I surmised, might be thought of as "rare events" - and if you're a Demons fan, they're especially rare in all-too-many weeks - and they're clearly integers, which leads naturally to considering as a candidate the Poisson Distribution, one of the more famous uses of which was in fitting the number of soldiers in the Prussian army killed accidentally by horse kicks each year over a 20 year period.

These Scoring Shot opportunities are converted into Goals - and the unconverted opportunities into Behinds - a bit like a Binomial Distribution, but with some potential correlation between the conversion of one shot and another, which leads directly to the Beta-Binomial with an overdispersion parameter as a promising choice for the Goals distribution. The need to account for the potential correlation between Scoring Shot conversions for a given team stems purely from empirical observation - while teams might, on average, convert about 53% of their opportunities in the long run, on some days or for some portions of games that conversion rate appears to be elevated or depressed. Knowing that a team had converted one particular Scoring Shot might prompt you to lift your estimation of how likely they were to convert another.

In summary, what I'm proposing is summarised in the equations that appear above, where, in the Scoring Shot equation, lambda is the expected number of Scoring Shots for a team, and in the Goals equation, rho is the probability of converting any single Scoring Shot and theta is the overdispersion parameter that allows for intra-shot conversion correlation.

That's all very well theoretically, but this entire post was inspired by an empiricist, so let's fit these distributions to some real data, in particular the scoring data for home teams and away teams from 2000 to Round 14 of 2014. I'll assume that the Expected Scoring Shot parameter - the lambda in the Poisson distribution - varied independently across games and between the home and the away teams in any given game, but I'll also assume that there is some (negative) correlation between the Scoring Shot production of the home and the away team in any single game so that an unexpected excess of Scoring Shot production by one team tends to lead to an unexpected paucity of Scoring Shot production by the other. (Technically, I'm therefore assuming that Scoring Shot production for the two teams in a game can be modelled as a bivariate Poisson with a mean vector equal to the Expected Scoring rate and a covariance matrix with the mean vector entries on the diagonal - the Poisson distribution having its mean equal to its variance - and some non-zero entries on the off-diagonals).

While the parameters of the Scoring Shot production distribution are assumed to vary from game to game, those of the Goals distribution are assumed to be the same for all home teams across games, and the same for all away teams across games.

This assumption allows me to fit Beta-Binomials to home team and away team Scoring Shot data across the entire 15 season period, which I do using themle2 function of the bbmle package in Rto obtain the following estimates:

Home team Conversion probability: 53.6%

Home team Dispersion parameter: 228.9

Away team Conversion probability: 53.5%

Away team Dispersion parameter: 220.2

All of which means that, empirically, home teams convert their Scoring Shots at a very marginally higher rate than away teams, and that the standard deviation of the goals produced from a given quantity of opportunities is slightly smaller for home teams than for away teams (a consequence of the larger dispersion parameter for home teams)..

So, I now have empirical distributions to use for Goals scored but, before I can run any simulations, I need to decide on the lambdas and the covariance matrix to use for the bivariate Poisson that will generate Scoring Shots for that distribution.

For the lambdas I'll use the all-game averages for home and for away teams which turn out to be 26.7 Scoring Shots per game for home teams and 24.4 Scoring Shots per game for away teams. For the Poisson distribution to be appropriate it needs to be assumed that the variance of home team Scoring Shot production is the same as the mean, and that the same is true of the away team Scoring Shot production. Absent an empirical model for Expected Scoring Shot production for a series of games, I can't test the appropriateness of this assumption directly from the data I have. More on that in a future blog.

Lastly, I'm going to set the correlation between home team and away team Scoring Shot production to be equal to the correlation between these two statistics across the period in question, which is -0.36. This estimate, I recognise, is not a direct estimate of the correlation we really need, which should be calculated about the Expected Scoring Shot production in each game rather than about the all-game average Scoring Shot production, but again this is all I have for now. More too on this in future.

Returning to our empirical bent, we now ask what happens if we simulate 1,000,000 games where the Expected Scoring Shot production for the home and the away teams is as per the bivariate Poisson parameterised as described, and the conversion of those opportunities follows the Beta-Binomials with the empirical parameters also just described? Specifically, what does the distribution of the home team score less the away team score (ie the game margin) look like?

It looks, as the density and QQ-plot below shows, a lot like a Normal Distribution, except perhaps in the tails.

As further confirmation of the Normal-like nature of the distribution we find that the skewness is 0.008 (the Normal's is zero), and the kurtosis is 3.039 (the Normal's is 3). So it's quacking a fair bit like a duck (with a Normally-distributed beak). The standard deviation of the game margin is 35.9 points per game, which is consistent with estimates of this parameter that we've derived empirically in the past. The mean of the distribution - the "handicap" that a Bookmaker would set, if you like - is 8.6 points (2.3 extra scoring shots at about 3.7 points per shot) and the home team score exceed the away team score with probability 58.9%.

So, if we simulate the results using a fixed set of parameters for the bivariate Poisson and for the Beta-Binomial we obtain a distribution for the game margin that looks suspiciously Normal. What if we use the Beta-Binomial with the same parameters and we use the same correlation between home team and away team Scoring Shot production for the bivaroate Poisson, but systematically vary the expected number of Scoring Shots for the home and the away teams (the lambdas in the earlier equations)? Specfically, I allowed the expected number of Scoring Shots to vary between 15 and 40 for the home and the away teams, and then simulated 10,000 games with each set of parameters.

For these simulations, the skewness for every set of Scoring Shot parameters was always in the range (-0.09,+0.08) and the kurtosis is the range (2.98, 3.08). These value ranges are very consistent with Normal-like distributions.

Another interesting aspect of the simulations is the empirical standard deviation of the game margin (ie of home score less away score).

Across the entire range of simulated Expected Scoring Shots for the home and the away teams, the empirical standard deviation ranged from about 27 to about 45 points, but for the more realistic set of Expected Scoring Shots, which are those boxed in the chart above, the empirical standard deviations range only from about 33 to 39 points. These boxed games correspond to Expected Scores of 84-113 points for the home team and 73-102 points for the away team.

In summary then, characterising team scores as resulting from:

a bivariate Poisson distribution with fixed covariance and game-specific lambdas for the home and away teams' expected Scoring Shot production

a Beta-Binomial distribution with fixed conversion rate and overdispersion parameters

produces realistic simulations of game scores that are consistent with empirical evidence and that generate Normal-like distributions for game margins with standard deviations similar to those we've estimated previously. That's what I set out to create.

I plan to explore this approach to characterising game scores more in future blogs, but for today I'll finish with one more interesting table from the simulations, this one recording the probability of a draw for games with varying Expected Scoring Shots for the home and the away teams.

Implications of note from this table are that:

Draws are more likely in low-scoring games than in high-scoring games. For example, a game in which the home and away teams are each expected to generate 25 scoring shots has an estimated 1.09% chance of finishing in a draw whereas a game in which both teams are expected to generate 35 scoring shots has only an estimated 0.95% chance of finishing in a draw. Even a game where the home team and away team are somewhat mismatched such that the home team is expected to generate 25 scoring shots and the away team only 20, has a higher chance of finishing in a draw.

Even in the drabbest of low-scoring games - played, say, in torrential rain - where each team is expected to generate only 15 scoring shots, the probability of a draw is only 1.5%.

(Footnote: Back in 2010 I did write a series of blogs on modelling AFL Scoring commencing with this post from April 24. Those posts, while empirically motivated, unlike this post made no distributional assumptions save for assuming that scoring shots were converted with a fixed probability.)