Planning Games Using The Central Limit Theorem – Game Planning With Science! Part 4

In Part 4 of “Game Planning With Science”, I’m going to wrap up the statistics primer I started in Part 3. This time, I’ll cover one of the most fascinating aspects of statistics: the Central Limit Theorem. Why does one aspect of statistics deserve its own post? BECAUSE IT’S FRIGGIN’ RAD, THAT’S WHY! Also (and probably more importantly) it allows us to make predictions when planning games, even if we don’t have a lot of data.

The article image for “Video Game Art Pipelines” is from GraphicStock. Used under license.

“Given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population. Furthermore, all of the samples will follow an approximate normal distribution pattern, with all variances being approximately equal to the variance of the population divided by each sample’s size.” (Source: Investopedia)

This concept is known as the central limit theorem.

I had to learn about the central limit theorem from three separate sources and sit through a graduate school lecture about it before I finally understood it. If you’re feeling a wee bit perplexed at the moment, I hear ya. Khan Academy has a great video about it. If you want to experiment with the central limit theorem yourself, there’s a nifty simulation here.

Great, Now Give It to Me in English

The main takeaway is that you don’t need to know the probability distribution of a given variable (for example, the average height of an American men) to make predictions about it. Take a random sample of that variable (for instance, you sample the height of 4 random American men). Calculate the mean value of that sample and repeat that process (typically at least 30 times).

Graph those means. The resulting graph will resemble a normal distribution. As the number of such samples means approaches infinity (otherwise known as the limit), the graph becomes increasingly normal. The mean and standard deviation of the graph will be the same as the those of the population itself.

I said “In ENGLISH“

In plain English, a population can have any number of distributions. But random samples of that population will always combine form a normal distribution. This knowledge enables us to use the formulas typically for normal distributions to analyze variables that might not have normal distributions.

I’ll put it even more plainly. The mean is a representation of a “typical” instance of a variable. A typical American man, a typical high school, or a typical feature in a game, for example. We can use the fact that means will always have normal distributions to make predictions about the outcomes of those typical variables.

Applying the Central Limit Theorem to Averages

Why is this useful? Lets say you have 40 features remaining in your backlog. You’ve completed 50 features previously with a mean development time of 10 days and a standard distribution of 5 days. You need to achieve a mean development time of 8 days per feature moving forward in order to hit a particular milestone.

Boot up Excel and enter the following formula: “=NORM.DIST(8, 10, 5/sqrt(4o),0)“.

The answer is a roughly 2.05% chance of seeing that improved mean velocity under your current MO.

Dissecting NORM.DIST

Let’s walk through what’s happening in that Excel function:

NORM.DIST() is Excel’s function to determine the probability of a particular outcome on a normal distribution

The first parameter (8) is your target mean

The second (10) is your observed mean based on past data

The third (5/sqrt(4o)) is your standard deviation divided by the square root of the number of variables for which you need a mean. In short, the larger the group, the smaller the impact of standard deviations on the mean of that group. But it’s not a linear relationship

The final parameter, called “Cumulative” is set to 0, or false, to indicate that you want the probability of a particular value, not a range*. I’ll give an example with that parameter set to 1 later in the post.

The NORM.DIST function returns the probability of seeing a given value (the red streak) from a defined normal distribution

But Wait, There’s More: The Central Limit Theorem Also Applies to Sums

Now, imagine that you have a backlog with 100 features. Each has approximately the same variance in development time. You record the time it took you to complete all 100. We then hop into a trans-dimensional DeLorean and go skipping across the multiverse. We visit your studio in 100 parallel universes and record the time it took each of them to complete the same feature set.

If we were to plot a probability distribution based on all of those total elapsed outcomes, guess what shape it would take?

Yet again, we see the awesome power of the central limit theorem. It also applies to large sums. If you have the mean and standard deviation of a population, you can calculate the probability of seeing a particular sum from that population.

Applying the Central Limit Theorem to Sums

Back to the features example. You have 40 features to develop with a target delivery date 430 days from today. You know that the average feature takes 10 days to develop with a standard deviation of 5 days. The probability that I will deliver on or before that deadline can be calculated in Excel: “=NORM.DIST(540,40*10,5*sqrt(40),1)“, which returns 82.86%.

The first parameter (430) is the number of days until your target date

The second (40*10) is the number of features you need to develop, multiplied by the mean time to complete one feature. This is known as the expected value.

The third (5*sqrt(40))is the standard deviation multiplied by the square root of the number of features. Unlike averages, sums get more volatile as they get larger. They scale at the square root of the standard deviation.

1 sets the “cumulative” parameter to “true”

The Output

The NORM.DIST() functions returns the probability of you completing all of your features on or before the target date: 82.86%. It’s not the probability of delivering exactly 430 days from now. It’s the total probability of all of the outcomes between negative infinity and 430. In mathematical terms, it’s the total area under the curve between negative infinity and 430.

When the “Cumulative” parameter is set to 1, The NORM.DIST function returns the probability of seeing a range of values (the red area) from a defined normal distribution

Further, there are really only two categorical possibilities: you will make the date or you won’t. These two scenarios cover 100% of the possible outcomes (the red and blue areas of the graph above). If you want the probability of overrunning the target date, simply subtract the output of the NORM.DIST() function from 100%: 100%-82.86 = 17.14%.º

Going From Probability to Date

What if, instead of calculating the probability of hitting some date, you wanted to calculate a date that you can hit with some probability? For instance, what if you want to give a publisher a date that you can hit with 95% certainty? Easy enough: just use Excel’s NORM.INV() function.

Using the stats for feature development, you want to be 95% confident you can hit a date or deliver early. In Excel, you would enter “=NORM.INV(95%,40*10,5*sqrt(50))“, which returns 458.1543 days. This is the upper bound of a range of probable outcomes, the lower bound being negative infinity.

In other words, you are 95% confident that you can deliver the project somewhere between negative infinity and 458 days from now. Put differently, your are 95% confident, based on your available data, that 458 days is your worst case scenario.

NORM.INV returns the upper bound of a range that encapsulates a given probability

What NORM.INV is and is not Telling You

Again, please keep in mind that NORM.INV is not telling you you have an 95% chance of delivering exactly 458 days from now. It is telling you that the range of outcomes, from negative infinity to 458 days covers 95% of your probable outcomes. You can be 95% certain the actual outcome will fall somewhere in that range. Similar to the second NORM.DIST example, the total area under the normal distribution curve, up to and including 458, is equal to 95%.

What Are The Odds?: Confidence Intervals

The above formulas make an important simplification. Namely, that the mean and standard deviation you are seeing in your data are the true mean and standard deviation of the population. In other words, that the mean and standard deviation you see in the data are the “correct” values.

You can’t prove that the values you’re seeing are universally true. But you can quantify your confidence that you are in the ball park using a simple, two-step process.

Finding Your t-Statistic

First, you need to calculate a t-statistic. This is simply an estimate of how far your observed values are from the true value. Once, again, excel comes to the rescue. Use the TINV function: “=TINV(Probability, n-1)“, where

Probability is 100%, less your desired confidence level. If you want to be 95% confident, enter 5% here. If you want to be 99% confident, enter 1%.

n-1 is your sample size (the number of observations you have) minus 1. This value is called the degrees of freedom

In the features example, I want to be 95% sure of my forecast. I have a sample size of 50, so I would enter “=TINV(5%, 49)“, which gives me a t-statistic of 2.009

Calculating the Confidence Interval

Now, take that statistic and plug it into the following formulas:

Where X is the target mean (in the example above, 10 days) +/- the t-statistic (in this case, 2.009) multiplied by the observed standard deviation (5) over the square root of the number of observations (50). What this equation provides is your margin of error based on the your desired confidence level and the number of observations you have to work with. This margin of error is called your confidence interval.

A confidence interval (the red area) provides a range outcomes, centered on the mean, which encapsulates a desired probability (in our example, 95%)

One More For The Road: Confidence Intervals In Context

Once again, lets use the data points from the features example. With those values, the confident interval equations result in 8.57 and 11.42. While your observed mean is 10, you are 95% confident that the true mean falls somewhere between 8.57 and 11.42. That is your confidence interval.

Finally, to project this confidence interval across all 40 features, multiply the mean and the standard deviation by the number of features you’re trying to track.

Based on the data you have on hand, you can say with 95% confidence that you will finish the 40 features somewhere between 343 and 456 days in the future. You will deliver somewhere within a 113 day window.

That’s a wide, wide delivery window. But it’s quantifiable – as the physicist Marie Curie once said, “Nothing in life is to be feared, it is only to be understood.” By using the confidence interval formula, you now understand, to a quantifiable level of accuracy (95%, to be exact) over what range of time you will deliver. In investor terms, this variability in the forecasted delivery date is your quantified measure of risk based on what you currently know.

If you want to narrow the interval, you need to either gather more data to reduce the t-statistic, reduce the variability in feature development time, or accept a lower confidence level (eg, 80% instead of 95%).

Necessary Caveats

These techniques do come with a caveat. You need to make two simplifying assumptions about the world. For any one variable (time per feature, for instance, etc), you need to assume:

That individual samples have similar variances

That individual samples are independent, meaning the time it takes to develop one is not impacted by the time to develop another

These assumptions may not completely align with reality. The 2nd assumption is troublesome in particular, especially when it comes to feature development. One feature may very well impact the time required to deliver another.

But remember that you are trying to understand the real world via a reasonable approximation, not model it exactly. Unless you develop every feature simultaneously or don’t ever take dependencies into account when prioritizing work, then you can reasonably assume that the respective individual time required to develop most features is independent. This means that the variance you’re seeing in your data will capture (and thus account for) any instances protracted development due to dependencies.

Where Do We Go From Here?

The more data you have, the more confident you can be that the values you calculate represent the real world. This is known as the law of large numbers. So measure, benchmark, and repeat. Then refine your estimates over time.

Some of the grizzled veterans out there might be shaking their fists at the screen. How in the name of all that is holy do I expect you to have data this specific in the middle of the insanity of game development?

Key Takeaways

No matter the probability distribution of a single variable, if you take consistently-sized, random samples of that variable and plot the sums or averages of those samples on a graph, the plot will form a normal distribution with a sufficiently large number of samples (>30).

Because of the central limit theorem, even if you don’t know the probability distribution of a variable you are trying to measure you can still calculate the probability that an average variable will have some statistical value.

Key Excel Formulas

=NORM.DIST() returns the probability of an outcome being less than a designated value, given a particular mean and standard deviation

Also listed as =NORMDIST() in older versions of Excel

=NORM.INV() returns a maximum probable outcome based on a specific probability, expected value, and standard deviation

Also listed as =NORMINV() in older versions of Excel

=TINV() returns the appropriate t-statistic based on the desired confidence level and the degrees of freedom (sample size minus 1)

Key Equations

Confidence Interval using t-statistics:

Note: this post was based primarily on the lectures of Professor Ronen Gradwohl of Northwestern University’s Kellogg School of Management

* If you read Part 3, then you may recall that the probability of seeing any one particular value is 0 on a continuous distribution. When you set the “cumulative” parameter to false Excel is actually calculating the probability of seeing a very narrow range of outcomes.

† There’s an important takeaway here. For averages, you divide the standard deviation by the number of variables to decrease its impact. For sums, you multiply to increase it. In other words, averages get less variant as they increase in size under the central limit theorem, while sums get more variant.

º This is known as the compliment rule.

The distribution graphs in this post were developed using an Excel spreadsheet downloaded from Excel Is Fun!