Statistical models

A statistical model is often conceived of as a family of probability distributions. For present purposes we will use examples of statistical models that are families of probability distributions parameterized by one or more realparameters.

For a random variableX with this distribution, the probability of falling within any interval (a, b) is

\[\Pr(a < X < b) = \int_a^b \varphi_{\mu,\sigma^2}(x)\,dx.\]

The family of all Poisson distributions is parameterized by the expected value λ, which can be any positive number. The probability that a random variable X with this distribution is in any subset A of the set {0, 1, 2, ...} of all nonnegative integers is

\[\Pr(X \in A) = \sum_{x \in A} \frac{\lambda^x e^{-\lambda}}{x!}.\]

The family of all continuous uniform distributions on an interval with left endpoint 0 is paramaterized by the maximum value θ. The probability that a random variable X with this distribution falls in any interval (0, a) within (0, θ) is just a/θ.

In the discussion below, we will imagine a statistical sample consisting of a sequence X1, ..., Xn of independent identically distributed random variables, distributed according to one of these three distributions.

The essential definitions

"Statistic"

A statisticT = T(X1, ..., Xn) is an "observable" random variable, i.e. a quantity depending on the "data" X1, ..., Xn but not depending on "unobservable" parameters (denoted in our three examples by lower-case Greek letters). For example, if a scale measures weights with an error that is normally distributed with expected value μ and variance 1, and X is the reported weight, then X is a statistic, but the measurement error X − μ is not a statistic, since μ is not observable and hence X − μ is not observable.

Sufficiency of a statistic

A statistic is sufficient for a statistical model (i.e. for a specified family of probability distributions) if the conditional probability distribution of the data X1, ..., Xn given the value of the statistic T does not depend on "unobservable" parameters, i.e. does not depend on which of the probability distributions in the family is the one from which the data X1, ..., Xn were drawn.

Intuitively, a sufficient statistic captures all information in the data that is relevent to guessing the values of the unobservable parameters, or more generally, to guessing the underlying probability distribution from which the data were drawn.

Sufficient statistics for three example models

For the family of normal distributions, the pair (X1 +...+ Xn, X12+ ... +Xn2) is sufficient. This means that the conditional probability distribution of the data X1, ..., Xn given the values of X1 +...+ Xn, and X12+ ... +Xn2 does not depend on μ or σ2.

For the family of Poisson distributions, the sum X1 + ... + Xn is sufficient. The conditional probability distribution of the data X1, ..., Xn given the value of X1 +...+ Xn does not depent on λ.

For the family of uniform distributions described above, the maximum max{X1, ..., Xn} is sufficient. The conditional probability distribution of the data X1, ..., Xn given the value of max{X1, ..., Xn} does not depend on θ.

The Rao-Blackwell theorem

A rough version of the Rao-Blackwell theorem, named after Calyampudi Radhakrishna Rao and David Blackwell, is that the conditional expected value of an estimator given a sufficient statistic is a better estimator. An "estimator" of an unobservable parameter θ is any statistic used to estimate θ ("estimate" in the sense of an educated guess, rather than in the mathematicians' sense of upper or lower bounds). The precise meaning of "better" is for many purposes of less interest than is the method of finding the "better" estimator, and we defer it until after some examples. In fact, the precise version of the theorem only says that this improved estimator is no worse than the estimator that was to be improved. In practical applications, however, it is enormously better.

Example: the uniform distribution on (0, θ)

Suppose \( X_1,\dots,X_n\, \) are uniformly distributed on the interval from 0 to some unobservable positive number θ. It was remarked above that the maximum observed value \(\max\{\,X_1,\dots,X_n\,\}\) is a sufficient statistic for the family of distributions parametrized by θ ∈ (0, ∞).

Clearly the sample mean

\[ \overline{X} = \frac{X_1+\cdots+X_n}{n} \]

has expected value θ/2. Therefore it might appear to make sense to use \(2\overline{X}\,\) as an estimator of θ. Notice, however, that this estimator has a flaw: in some cases its value is less that the maximum observed value \(\max\{\,X_1,\dots,X_n\,\}\ .\) That means we would be using as an estimator of θ a quantity that we know, based on the data, must be less than θ.

The Rao-Blackwell estimator δ is the conditional expected value of flawed estimator given the sufficient statistic:

\[ \delta = E( 2\overline{X} \mid \max\{\,X_1,\dots,X_n\,\}). \]

In general, the conditional expected value of a function of the data \(X_1,\dots,X_n\) given another function of the data, depends on θ. However, the sufficiency of the statistic \( \max\{\,X_1,\dots,X_n\,\} \) means precisely that the conditional probability distribution of the data given
\( \max\{\,X_1,\dots,X_n\,\} \) does not depend on θ. Therefore this conditional expected value δ does not depend on θ but only on the data, i.e. δ is a statistic. It is easily shown that

\[\delta = \frac{n+1}{n}\max\{\,X_1,\dots,X_n\,\}.\]

That statistic, then, is the "improved" estimator of θ yielded by the Rao-Blackwell procedure.