Prediction Intervals

A class of people; all of different ages (when
measured to the nearest day). If I randomly sample 19
people, how do I predict the next (20th) person's age?

Consider first all 20 people. The 20th person is
equally likely to be the youngest, second
youngest, third youngest, etc., third oldest,
second oldest, oldest. That is, in the ordered
list of all 20 people, the 20th person selected
is equally likely to occupy any of the positions
1, 2, 3, . . . , 20.

Examine the picture below.

In this picture the first 19 people have been
isolated from the remaining 20th person. If the
20th person is the youngest, then the 20th person
"fits in" in the gap to the left of the
smallest value. If the 20th person is second
youngest, then the 20th person fits in the 2nd
gap, and so on. Rephrasing the point made above:
The 20th person is equally likely to fall into
each of the 20 gaps formed by the first 19
people.

Since there are 20 gaps, each gap carries a
probability of 1/20 or 0.05 (5% if you like).

The chance is 0.05 + 0.05 = 0.10 (or 10%) that
person 20 falls outside the entire range of the
first 19 people. The chance is 0.90 (90%) that
selection 20 falls inside the range. As a result,
the range from the smallest to largest of the
first 19 people is a 90% prediction interval (PI)
for the next (subsequent) observation. That's it!
That's what a prediction interval is; in specific
a 90% PI. Note that the probabilities are
obtained from the number of gaps, which is 1
greater than the number of observations. Each gap
has probability 1/(n + 1) where n is the sample
size.

The 90% PI is then (6516, 8546). We write
intervals like this with the small value first.
Read it: "Betweeen 6516 and 8546." The
values that define this interval are called the bounds
of the interval. 6516 is the lower bound and 8546
is the upper bound. (Some people use the term endpoint
in place of bound.) The percentage (here
90%) is called the confidence level or procedural
reliability for the procedure.

Of course maybe you don't need to be 90%
confident in your result. If we move in one
observation from each end, covering two more gaps
at 5% each, we obtain an 80% PI. For the data
above this 80% PI is (6648, 8064). Below you see
a dotplot that marks of a number of prediction
intervals. Make sure you grasp the relationship
between the confidence (or reliability) level of
the procedure (the %) and the width of the
interval.

Interpretation

Like almost all statistical intervals, this one can be
a little tricky to interpret -- it requires some thought.
For example, the 90% PI of (6516, 8546) given above is
intended to predict the age of the next randomly selected
person in the class. It turns out that, of the remaining
students in the class, 55 of 56-- that's 0.9821 or
98.21%--have age between 6516 and 6546. This is the
conundrum: The reasoning used to develop this only works
when talking about "random data." Once a sample
is selected it becomes non-random. This may be easier to
see by looking at the graph above. Before seeing any data
each gap had probability of 5%. Now that we have data,
compare the gap between the second and third smallest
values (6648 and 6657--only 9 days apart) to that between
the fifth and sixth smallest values (6752 and
7067--that's 315 days apart). Common sense tells us that
it's far more likely that the twentieth observation will
fall in the larger gap.

The 90% refers to the average predictive success of
the entire procedure. That is, if I repeated the
following

sample 19 people at random,

form the 90% PI,

sample a twentieth person at random

then 90% of the time the twentieth person falls within
the bounds of the interval. Another way of thinking about
it is that if repeated over and over, on average
90% of the remaining students would fall within the
prediction bounds. Not in any one case, but on average.