The Shadow Price of Power

Suppose we want to pick out some sort of signal from a background of noise.
As every schoolchild knows, any procedure for doing this,
or test, divides the data space into two parts, the one where
it says "noise" and the one where it says "signal".* Tests will make two kinds
of mistakes: they can can take noise to be signal, a false
alarm, or can ignore a genuine signal as noise,
a miss. Both the signal and the noise are stochastic, or we
can treat them as such anyway. (Any determinism distinguishable from chance is
just insufficiently complicated.) We want tests where
the probabilities of both types of errors are small. The probability
of a false alarm is called the size (or significance
level) of the test; it is the measure of the "say 'signal'" region
under the noise distribution. The probability of a miss, as opposed to a false
alarm, has no short name in the jargon, but one minus the probability of a miss
— the probability of detecting a signal when it's present — is
called power.

Suppose we know the probability density of the noise \( p \) and that of the
signal is \( q \). The Neyman-Pearson lemma, as many though not all
schoolchildren know, says that then, among all tests off a given size \( s \) ,
the one with the smallest miss probability, or highest power, has the form "say
'signal' if \( q(x)/p(x) > t(s) \), otherwise say 'noise'," and that the
threshold \( t \) varies inversely with \( s \) . The quantity \( q(x)/p(x) \)
is the likelihood ratio; the Neyman-Pearson lemma says that to
maximize power, we should say "signal" if its sufficiently more likely
than noise.

The likelihood ratio indicates how different the two distributions —
the two hypotheses — are at \( x \), the data-point we
observed. It makes sense that the outcome of the hypothesis test should depend
on this sort of discrepancy between the hypotheses. But why
the ratio, rather than, say, the difference \( q(x) - p(x) \), or a
signed squared difference, etc.? Can we make this intuitive?

Start with the fact that we have an optimization problem under a constraint.
Call the region where we proclaim "signal" \( R \) . We want to maximize its
probability when we are seeing a signal, \( Q(R) \), while constraining the
false-alarm probability, \( P(R) = s \)
. Lagrange
tells us that the way to do this is to maximize \( Q(R) - t[P(R) - s] \) over
\( R \) and \( t \) jointly. So far the usual story; the next turn is usually
"as you remember from the calculus of variations..."

Rather than actually doing math, let's think like economists. Picking the
set \( R \) gives us a certain benefit, in the form of the power \( Q(R) \) ,
and a cost, \(tP(R) \) . (The \( ts \) term is the same for all \( R \) .)
Economists, of course, tell us to equate marginal costs and benefits.
What is the marginal benefit of expanding \( R \) to include a small
neighborhood around the point \( x \) ? Just, by the definition of
"probability density", \( q(x) \) . The marginal cost is likewise \( tp(x) \)
. We should include \( x \) in \( R \) if \( q(x) > tp(x) \), or \( q(x)/p(x)
> t \) . The boundary of \( R \) is where marginal benefit equals marginal
cost, and that is why we need the likelihood ratio and not the
likelihood difference, or anything else. (Except for a monotone
transformation of the ratio, e.g. the log ratio.) The likelihood ratio
threshold \( t \) is, in fact, the
shadow price of
statistical power.

I am pretty sure I have not seen or heard the Neyman-Pearson lemma explained
marginally before, but in retrospect it seems too simple to be new, so pointers
would be appreciated.