What are the advantages of using the robust variance estimator over
the standard maximum-likelihood variance estimator in logistic regression?

Title

Advantages of the robust variance estimator

Author

Bill Sribney, StataCorp

I once overheard a famous statistician say the robust variance
estimator for (unclustered) logistic regression is stupid. His reason was
that if the outcome variable is binary then it’s got to be a Bernoulli
distribution. It’s not like linear regression with data that stand a
good chance of being nonnormal and heteroskedastic.

Well, it’s not as simple as this; there are a bunch of subtle
nuisances. Let me lay them out here. I'm sure the famous
statistician is aware of them, but they don’t necessarily lead to his
conclusion.

It’s true that it has got to be a Bernoulli distribution. That
is, if \( Y_i \) is a random variable for the outcome of the
i-th unit, then

\[ P({Y_i=1}) = p_i \]

or, equivalently,

\[ E(Y_i) = p_i \]

This has to be true. Note how I indexed the RHS by \( i \). The
term \( p_i \) is dependent on \( i \). It’s certainly
not true in general that \( P(Y_{i=1}) = p \), where \( p \) is a
constant independent of \( i \).

In logistic regression, we model \( p_i \) with a likelihood that
assumes

\[ \text{logit}(p_i) = x_i.b \]

So these are our assumptions:

The \( \text{logit} \) link function is correct.

\( \text{logit}(p_i) = x_i.b \); i.e., that the relation
is linear and all necessary predictors are in the model; i.e., the
model is correctly specified.

The \( i=1,\dots,N \) observations are independent.

The robust variance estimator is robust to assumptions (1) and (2). It
does require (3), but you can specify clusters and just assume independence
of the clusters if you wish.

The MLE is also quite robust to (1) being wrong. If the link function is
really probit and you estimate a logit, everything’s almost always
fine.

Now if (2) is wrong, strictly speaking, you are in trouble with the
interpretation of the point estimates of your model, never mind the variance
estimates. Imagine \( \text{logit}(p_i) \) is truly quadratic in
\( x_i \), but you fit it linear in \( x_i \).
What’s the interpretation of the coefficient? It’s the best fit
of a straight line to something that’s not straight! It’s got
some meaning, but it’s somewhat problematic.

Well, the robust variance estimator will do a good job of giving you
variance estimates and confidence intervals for this problematic case of a
misspecified model. That is, if one imagines resampling the data and each
time fitting the same misspecified model, then you get good coverage
probabilities with respect to the “true” population parameters
of the misspecified model, i.e., the best-fitting straight line in the
population to something that’s not straight.

On the other hand, if you have confidence that your model is not
misspecified, then the ML variance estimator is theoretically more
efficient.

Advice

In summary, my personal advice (and I have respect for conflicting opinions)
is

I never worry about whether (1) is true. I assume the \( \text{logit} \)
link is OK.

If I think the model is reasonably specified, I use the ML
variance estimator for logistic regression.

Only if I have good reason to believe that the model is poorly
specified would I use the robust variance estimator. That is, if the
model fails goodness-of-fit tests, etc. Sometimes one just has to live
with missing predictors and badly fitting models because data were
collected for only a few predictors. In this case, I’d use the
robust variance estimator.

And, obviously, I’d use the robust variance estimator if I had
clustered data.

This recommendation is in contrast to the advice I’d give for linear
regression for which I’d say always use the robust variance
estimator.

The robust variance estimator is only approximate for ML models

There are also other theoretical reasons to be keener on the robust variance
estimator for linear regression than for general ML models. The robust
variance estimator uses a one-term Taylor series approximation. In linear
regression, the coefficient estimates, \( b \), are a linear function of
\( y \); namely,

\[ b = (X'X)^{-1} \; X'y \]

Thus the one-term Taylor series is exact and not an approximation.

For logistic regression and other MLEs, the ignored higher-order terms in
the Taylor series are nonzero. So it’s truly an approximation in
these cases.

Follow-up question

What are the “true” population parameters for which the robust
variance estimator gives good coverage properties?

Linear regression model

First let me assume a linear regression model, and later I’ll discuss MLEs.

Consider the entire population from which the sample is drawn. Let \( X \)
be the matrix of independent variables and \( Y \) the vector of dependent
variables for the entire population. Then consider

\[ B = (X'X)^{-1} \; X'Y \]

The parameter \( B \) is the coefficient vector for the linear model for
the entire population. \( Y \) may be linear in \( X \) or it may not.
\( B \) is simply the best least-squares coefficients for the entire
population. \( B \) is what I was referring to when I said “the
‘true’ population parameters” in my above explanation.

The parameter \( B \) is what the robust variance estimator considers you
to be estimating. The sample estimate is

\[ b = (x'x)^{-1} \; x'y \]

where \( x \) is the matrix of the sample values of the independent
variables and \( y \) is the vector of sample dependent variables. If
there were sampling weights, the above equation would have weights added in
the appropriate places.

Anyhow, \( b \) is an estimate of \( B \). The robust variance estimator
estimates \( V(b) \) such that nominal \( (1 - \alpha) \) confidence
intervals constructed from it have \( B \) in the interval about \( (1 - \alpha) \)
of the time if one was to repeatedly resample from this population.

If \( Y \) is not linear in \( X \) because of incorrect functional form
or missing predictors, then the interpretation of \( B \) is problematic.
\( B \) can be considered to be the best least-squares linear fit for this
set of predictors. \( b \) and \( V(b) \) are “robust to
misspecification” in that \( b \) estimates \( B \) and that
\( V(b) \) is a valid estimate of the variance of \( b \) even though
misspecification is present.

\( L(B; Y, X) \) is not necessarily the true likelihood for the population;
i.e., it is not necessarily the correct distribution of \( Y|X \). The
theory doesn’t require it; it can be any function.

Standard MLE theory, on the other hand, requires \( L(b; y, x) \) to be the
true distribution function for the sample.

When there are sampling weights or clustering, \( L(b; y, x) \) is in no
sense a valid likelihood; it’s clearly not the distribution of the
sample when there are weights or cluster sampling. \( L(b; y, x) \) merely
has to estimate the arbitrary \( L(B; Y, X) \) for our theory to hold.
This is why the survey theorists call \( L(b; y, x) \) a pseudolikelihood,
and it’s also why you can’t do standard likelihood ratio tests
with it.

However, if \( L(B; Y, X) \) is not close to the true distribution, its
interpretation is problematic, just as in the case of a misspecified linear
regression.