This entry is a follow-on to my most recent entry. The idea is that random variables are are theway describe the uncertain quantities that arise in managing development efforts. They are a natural extension of the fixed variables we all grew up with. In fact, a fixed variable in a random variable that has probability one of taking a given value and probability zero of taking any other value. In this entry, I explain how you can (with computer assistance) calculate with random variables.

Suppose you want to add two random variables, v1 and v2. This need might arise if you have two serialized tasks in a Gantt chart, each described by a distribution as explain in the previous entry and you would like to know how long it would take to complete both of them, i.e. the sum of their durations.

How would you proceed? First note the sum would be another random variable. Therefore what you need is the probability distribution of the sum. There is no formula for that distribution, but there is an effective, commonly used numerical approach, known as Monte Carlo simulation.

The idea behind Monte Carlo simulation is to use a pseudo random number generator take a sample value of v1 and a sample value of v2 and then add them. For more detail, follow this link. Note that the values are selected according to the probability distributions of each of the variables. The more likely values are taken more often. Now save that sum and do the same thing many times, say 100,000 times, and store each of the sums. For each of the sums, you can compute its probability by looking at its frequency in the collection of saved sums (some sums are more frequent than others) and divide by the number of samples (actually you have to round the sums to get the counts). What you get is an approximation of the distribution of the sums.

Lets look at an example, if v1 has a triangular distribution with L = 3, E = 4, H=7, as shown in Figure 2 and v2 has a triangular distribution with L = 1, E = 6, H= 7 as seen figure 3.

Figure 2

The distribution of the sum, found using the Monte Carlo simulator in Focal Point, is given by figure 3.

Figure 3

First note the sum is not another triangular distribution. It is
smoother. This is to be expected from the mathematics of probability. On
the other hand, the distribution of the sum makes sense. For example,
we would expect the most likely value of the sum to be 10, the sum of
the two most likely values, but the simulation found 9.98. The
discrepancy is due to chance and would diminish with more samples. Also
note the probability is zero below 4, the sum of the lows, and above 14,
the sums of the highs.

For fun, here is the distribution of the product for the variables:

The reader can check if this looks sensible. Note also that the product does not have a triangular distribution. The peak is much smoother.

So random variables can be used in place of fixed variables in any computation. So they have all of the utility of fixed variables and enable us to express uncertainty. They may seem foreign at first, but they are worth the trouble to learn. Like anything else, they become intuitive after a while.

One of the common criticisms of estimation methods is that
the calculation is no better than the assumptions: garbage in, garbage out (affectionately known as GIGO). That is, if you make poor or
dishonest assumptions then you will get misleading forecasts. It is especially egregious
that occasionally someone might take advantage of the system by gaming the system
and intentionally feeding assumptions that lead to false forecasts to get a
desired business decision.

However, estimation is an essential part of any disciplined
funding decision process (such as program portfolio management). The funding decision
relies on estimates of the costs and benefits. But for reasons just described,
estimation is suspect.

So, what to do? I suggest the answer is not to abandon
estimation; the answer is the not input garbage, or if you do, detect it as
soon as possible to minimize the damage.

First note that the future costs and benefits are uncertain,
so any serious approach to the GIGO
problem is to treat the assumptions as random variables with probability distributions and work from there. Generally, this allows one to use the limited information at hand to enter the assumptions and calculate
the forecasts.

Douglas Hubbard, in How to Measure Anything, gives us one way to proceed. Briefly, when an uncertain
value is needed, ask the subject matter expert (sme) to give not one but three values:
low, high, and expected. The three values may be used to specify random
variables with triangular distributions [ref].

In this case, the greater the difference between the high
and low values, the wider the triangular distribution of the estimate reflecting
the uncertainty of the sme who is honestly making the assumptions.

One can use the random variables as values in the estimation
algorithm using MonteCarlo by repeatedly
replacing the single values with sampled values of the triangular distributions
and assembling the distribution of the estimated value. Note the estimate is
again just as good as the assumptions, however we assess our faith in the
estimate by the width of the 10%-90% range of its distribution.

For example, one might estimate to the total time for
completing s project by a project by entering, for each task, the least time,
the most time, and the most likely time. Then one could apply Monte Carlo
simulation or more or
more elementary methods to rollup the estimates to compute the distribution
of the time to complete.

Hubbard goes further by suggesting that as actuals in the
assumptions come available to review if they fall within the 10% -90% range of
the initial distributions.If they do,
fine. If they don’t, questions are asked about the underlying reasoning and
beliefs. Over time the organization becomes more capable and accountable at
making good assumptions.

Further, we can also deal with the garbage in garbage out
problem by using actual data whenever possible. There are at least two techniques.

In the first, as actuals in the assumptions become available
in the, they can used to replace the distributions. For example if there are
month-by-month sales projections captured as triangular distributions to
forecast sales volumes, the distributions are replaced by the actual sales numbers.Also, one should update the remaining triangular
distributions reflecting the actual sales trends. The resulting estimate will usually
have a narrower distribution.

A second technique is Bayesian trend analysis. In this case
we use actuals for evidence of the estimate. For example, if a project were on
track, then we can expect that certain measures, such as burn down rate and
test coverage reflect that. If a project were to ship on time, the number
unimplemented requirements would be going to zero, Similarly, the code coverage
measure would be trending towards the target. So these measures are evidence of a healthy
project. Using Bayesian trend analysis, we
can turn the reasoning around and update the initial (prior) estimate of the
time for completion using the actuals as evidence for an improved estimate. The
result is an improved probability distribution of the time to complete the
project. As more actuals become available, the distribution becomes narrower,
increasing the certainty of the forecast.

This way one can detect early if the system is being gamed
and at the same time, use the actuals to estimate the likelihood of an on-time
delivery.

So generally, one can use actuals to not only improve the
estimation process as Hubbard suggests, but also to apply Bayesian techniques,
to improve the estimates of the program variables.