29 May 2011

Probability Management with Sample Distributions

This is a reprise of an earlier post. Not only was it not up to my usual quality, it was written before we had a good term for the kind of probability distribution that's central to Probability Management and avoiding the Flaw of Averages. "Sample Distribution" captures the key characteristic and fits with the 'sample' and 'sampling' that are an important part of it. Also, Sam Savage likes it; if the guy who invented the discipline approves, I know we have a winner.

A sample distribution quantifies the uncertainty in an uncertain variable. It's always a list of numbers--in programming terms a one-dimensional array or vector. Each element of the vector is a sample value drawn without bias from the possible values of the uncertain variable. For the rest of this article, when I use 'distribution' without qualification, it refers to a sample distribution.

When building a sample distribution, we do whatever it takes to make the probability of each sample in the distribution be close to 1/N, where there are N samples in the distribution. We're dealing with the real world, not a mathematical abstraction, so "close to" is appropriate.

As for all distributions, a sample distribution has a shape--what it looks like when used as the value set for a histogram. Maybe something like this:

Unlike parametric distributions--distributions defined by a random number generator, a formula, and some parameters--a sample distribution has a characteristic rank order. Rank order has no effect on the shape, or ordinary statistical measures like mean and median, but it's the whole show when correlation is a factor. Fiddling with rank order is a creative way to manufacture correlation.

A sample distribution can capture any parametric distribution to whatever level of precision is desired, but most sample distributions cannot be reduced to parametric distributions without introducing significant errors. A sample distribution is taken from the real world and can be as precise as needed by choosing how many samples to take and how carefully.

Another important difference is that you can do whatever arithmetic you want with sample distributions and get correct results. Anything you can do with ordinary numbers, you can do with sample distributions, using the same operations. This is not true of parametric distributions.

If you want to multiply two uncertain variables expressed as sample distributions, let's say marketPrice × sales to get revenue, all you need to do is get your computer to multiply each element of the one with the corresponding element of the other (in Excel you use an array formula or one of many add-ins). The result is the correct probability distribution for the revenue.

In general, whatever math you would do if you were dealing with single numbers like averages or actuals, you do with the corresponding elements of the distributions. Lets take a simple case of multiplication with only two uncertain variables and a few samples:

ua = [3,2,1,3]
ub = [5,7,9,4]
(ua × ub) = [15,14,9,12]

If you want to see the Flaw of Averages at work, compare mean(ua) × mean(ub)
with mean(ua × ub)

This is a Monte Carlo Simulation with four trials. We've taken four plausible cases of the real world data and done the calculation (run the model) for each case, preserving the results. Usually, we have hundreds or thousands of samples and trials. The more trials, the more precise the results. Since the simulation takes time, you don't want more trials than you need to get the precision you want, or than the source data warrants.

The other problem sample distributions solve is dealing with related variables. In the example, you would not expect sales and price to be independent. It depends on the market, but you might find that changes in demand result in related changes in market price and sales. So, in the real world, the two are connected and you can't just, willy nilly, make separate random choices of price and sales and multiply them. You'd be modeling results that would never happen in the real world.

In the two distributions, the corresponding elements would have to be values that happened together in the real world.

If your calculations matter, if they're going to be used to help make decisions, the results should accurately reflect the uncertainty in your data and deliver not just numbers, but the probabilities associated with them. The easiest and most reliable way to do this is to calculate with sample distributions.