Long, long ago, I tried to inveigle Larry into writing this, by promising I
would make it a guest post. Larry now
has his own blog, but a
promise is a promise. More to the point, while I can't claim any credit for
it, I'm happy to endorse it, and to pushing it along by reproducing it.
Everything between the horizontal lines is by Jamie and Larry, though I tweaked
the formatting trivially.

Robins and Wasserman Respond to a Nobel Prize Winner

Chris Sims is a Nobel prize winning economist who is well known for his work on macroeconomics, Bayesian statistics, vector autoregressions among other things. One of us (LW) had the good fortune to meet Chris at a conference and can attest that he is also a very nice guy.

Chris has a paper called On an An Example of Larry Wasserman. This post is a response to Chris' paper.

The example in question is actually due to Robins and Ritov (1997). A simplified version appeared in Wasserman (2004) and Robins and Wasserman (2000). The example is related to ideas from the foundations of survey sampling (Basu 1969, Godambe and Thompson 1976) and also to ancillarity paradoxes (Brown 1990, Foster and George 1996).

The function \( \pi \) is known. We construct it. Remember that \( \pi(x) = P(R=1|X=x) \) is the probability that we get to observe \( Y \) given that \( X=x \). Think of \( Y \) as something that is expensive to measure. We don't always want to measure it. So we make a random decision about whether to measure it. And we let the probability of measuring \( Y \) be a function \( \pi(x) \) of \( x \). And we get to construct this function.

Let \( \delta>0 \) be a known, small, positive number. We will assume that
\[
\pi(x)\geq \delta
\]
for all \( x \).

The only thing in the the model we don't know is the function \( \theta(x) \). Again, we will assume that
\[
\delta \leq \theta(x) \leq 1-\delta.
\]
Let \( \Theta \) denote all measurable functions on \( [0,1]^d \) that satisfy the above conditions. The parameter space is the set of functions \( \Theta \).

Let \( {\cal P} \) be the set of joint distributions of the form
\[
p(x) \, \pi(x)^r (1-\pi(x))^{1-r}\, \theta(x)^y (1-\theta(x))^{1-y}
\]
where \( p(x)=1 \), and \( \pi(\cdot) \) and \( \theta(\cdot) \) satisfy the conditions above. So far, we are considering the sub-model \( {\cal P}_\pi \) in which \( \pi \) is known.

3. Bayesian Analysis

To do a Bayesian analysis, we put some prior \( W \) on \( \Theta \). Next we compute the likelihood function. The likelihood for one observation takes the form \( p(x) p(r|x) p(y|x)^r \). The reason for having \( r \) in the exponent is that, if \( r=0 \), then \( y \) is not observed so the \( p(y|x) \) gets left out. The likelihood for \( n \) observations is
\[
\prod_{i=1}^n p(X_i) p(R_i|X_i) p(Y_i|X_i)^{R_i} = \prod_i \pi(X_i)^{R_i} (1-\pi(X_i))^{1-R_i}\, \theta(X_i)^{Y_i R_i} (1-\theta(X_i))^{(1-Y_i)R_i}.
\]
where we used the fact that \( p(x)=1 \). But remember, \( \pi(x) \) is known. In other words, \( \pi(X_i)^{R_i} (1-\pi(X_i))^{1-R_i} \) is known. So, the likelihood is
\[
{\cal L} (\theta) \propto \prod_i \theta(X_i)^{Y_i R_i} (1-\theta(X_i))^{(1-Y_i)R_i}.
\]

Combining this likelihood with the prior \( W \) creates a posterior distribution on \( \Theta \) which we will denote by \( W_n \). Since the parameter of interest \( \psi \) is a function of \( \theta \), the posterior \( W_n \) for \( \theta \) defines a posterior distribution for \( \psi \).

Now comes the interesting part. The likelihood has essentially no information in it.

To see that the likelihood has no information, consider a simpler case where \( \theta(x) \) is a function on \( [0,1] \). Now discretize the interval into many small bins. Let \( B \) be the number of bins. We can then replace the function \( \theta \) with a high-dimensional vector \( \theta = (\theta_1,\ldots, \theta_B) \). With \( n < B \), most bins are empty. The data contain no information for most of the \( \theta_j \)'s. (You might wonder about the effect of putting a smoothness assumption on \( \theta(\cdot ) \). We'll discuss this in Section 4.)

We should point out that if \( \pi(x) = 1/2 \) for all \( x \), then Ericson (1969) showed that a certain exchangeable prior gives a posterior that, like the Horwitz-Thompson estimator, converges at rate \( O(n^{-1/2}) \). However we are interested in the case where \( \pi(x) \) is a complex function of \( x \); then the posterior will fail to concentrate around the true value of \( \psi \). On the other hand, a flexible nonparametric prior will have a posterior essentially equal to the prior and, thus, not concentrate around \( \psi \), whenever the prior \( W \) does not depend on the the known function \( \pi(\cdot) \). Indeed, we have the following theorem from Robins and Ritov (1997):

Theorem (Robins and Ritov 1997). Any estimator that is not a function of \( \pi(\cdot) \) cannot be uniformly consistent.

This means that, at no finite sample size, will an estimator \( \hat\psi \) that is not a function of \( \pi \) be close to \( \psi \) for all distributions in \( {\cal P} \). In fact, the theorem holds for a neighborhood around every pair \( (\pi,\theta) \). Uniformity is important because it links asymptotic behavior to finite sample behavior. But when \( \pi \) is known and is used in the estimator (as in the Horwitz-Thompson estimator and its improved versions) we can have uniform consistency.

Note that a Bayesian will ignore \( \pi \) since the \( \pi(X_i)'s \) are just constants in the likelihood. There is an exception: the Bayesian can make the posterior be a function of \( \pi \) by choosing a prior \( W \) that makes \( \theta(\cdot) \) depend on \( \pi(\cdot) \). But this seems very forced. Indeed, Robins and Ritov showed that, under certain conditions, any true subjective Bayesian prior \( W \) must be independent of \( \pi(\cdot) \). Specifically, they showed that once a subjective Bayesian queries the randomizer (who selected \( \pi \)) about the randomizer's reasoned opinions concerning \( \theta (\cdot) \) (but not \( \pi(\cdot) \)) the Bayesian will have independent priors. We note that a Bayesian can have independent priors even when he believes with probabilty 1 that \( \pi \left( \cdot \right) \) and \( \theta \left( \cdot \right) \) are positively correlated as functions of \( x \) i.e. \( \int \theta \left( x\right) \pi \left( x\right) dx>\int \theta \left(x\right) dx \) \( \int \pi \left( x\right) dx. \) Having independent priors only means that learning \( \pi \left(\cdot \right) \) will not change one's beliefs about \( \theta \left( \cdot \right) \). So far, so good. As far as we know, Chris agrees with everything up to this point.

4. Some Bayesian Responses

Chris goes on to raise alternative Bayesian approaches.

The first is to define
\[
Z_i = \frac{R_i Y_i}{\pi(X_i)}.
\]
Note that \( Z_i \in \{0\} \cup [1,\infty) \). Now we ignore (throw away) the original data. Chris shows that we can then construct a model for \( Z_i \) which results in a posterior for \( \psi \) that mimics the Horwitz-Thompson estimator. We'll comment on this below, but note two strange things. First, it is odd for a Bayesian to throw away data. Second, the new data are a function of \( \pi(X_i) \) which forces the posterior to be a function of \( \pi \). But as we noted earlier, when \( \theta \) and \( \pi \) are a priori independent, the \( \pi(X_i)'s \) do not appear in the posterior since they are known constants that drop out of the likelihood.

A second approach (not mentioned explicitly by Chris) which is related to the above idea, is to construct a prior \( W \) that depends on the known function \( \pi \). It can be shown that if the prior is chosen just right then again the posterior for \( \psi \) mimics the (improved) Horwitz-Thompson estimator.

Lastly, Chris notes that the posterior contains no information because we have not enforced any smoothness on \( \theta(x) \). Without smoothness, knowing \( \theta(x) \) does not tell you anything about \( \theta(x+\epsilon) \) (assuming the prior \( W \) does not depend on \( \pi \)).

This is true and better inferences would obtain if we used a prior that enforced smoothness. But this argument falls apart when \( d \) is large. (In fairness to Chris, he was referring to the version from Wasserman (2004) which did not invoke high dimensions.) When \( d \) is large, forcing \( \theta(x) \) to be smooth does not help unless you make it very, very, very smooth. The larger \( d \) is, the more smoothness you need to get borrowing of information across different values of \( \theta(x) \). But this introduces a huge bias which precludes uniform consistency.

5. Response to the Response

We have seen that response 3 (add smoothness conditions in the prior) doesn't work. What about response 1 and response 2? We agree that these work, in the sense that the Bayes answer has good frequentist behavior by mimicking the (improved) Horwitz-Thompson estimator.

But this is a Pyrrhic victory. If we manipulate the data to get a posterior that mimics the frequentist answer, is this really a success for Bayesian inference? Is it really Bayesian inference at all? Similarly, if we choose a carefully constructed prior just to mimic a frequentist answer, is it really Bayesian inference?

We call Bayesian inference which is carefully manipulated to force an answer with good frequentist behavior, frequentist pursuit. There is nothing wrong with it, but why bother?

If you want good frequentist properties just use the frequentist estimator. If you want to be a Bayesian, be a Bayesian but accept the fact that, in this example, your posterior will fail to concentrate around the true value.

6. Summary

In summary, we agree with Chris' analysis. But his fix is just frequentist pursuit; it is Bayesian analysis with unnatural manipulations aimed only at forcing the Bayesian answer to be the frequentist answer. This seems to us to be an admission that Bayes fails in this example.