Re: st: How much can we trust Stata's non-linear solver(s)?

+------+
|Part 1|
+------+
Joachim Wagner <wagner@uni-lueneburg.de> has read McCullough and Vinod's
American Economic Review (AER) article and is now concerned about the results
he gets from statistical software. There haven't been many followup questions
or responses, so perhaps everyone on the list just knows that Stata is doing
things right, or perhaps there is just not much interest. As someone who does
optimization for a living, we find the issues interesting and have quite a bit
to say, so here is most of Joachim's question and a rather long response,
> at the weekend I read a paper by McCullough and Vinod on "Verifying
> the solution from a nonlinear solver: A case study" in the June 2003
> issue of the American Economic Review. For those of you who are not
> in economics, this is an absolute top-journal in our discipline, and
> the authors are considered to be top-experts in the field. I was
> shocked (really, believe me!) to learn how much can go wrong and
> goes wrong when one tries to find the ML solution of such day-to-day
> problems as the parameters of a probit model using a PC and several
> programs. The authors give examples, and they argue that "the
> researcher's job is not done when the program reports convergence -
> it is only beginning" (p. 876), and ask us to examine the gradient,
> inspect the solution path, evaluate the Hessian, including an
> eigensystem analysis, and profile the likelihood. Uff. Do I really
> have to learn how to do all this? Do I have to use three or more
> different packages to compare the solutions? Shall I demand this
> when I sit down to write my next referee report later this week?
> Must I be aware that the next report I receive will demand me to do
> all this?
>
> Unfortunately, the authors do not name horses and riders (as we say
> in Germany - in German, of course), so I have no idea whether Stata
> is among the programs one can trust or not. [...]
Joachim goes on to ask
> From a slightly different perspective, would it be a good idea to
> implement the steps suggested by the authors for a
> "post-convergence" check in Stata ?
> [...]
----------------------
The VERY short answers
----------------------
There are really two main issues taken up by McCullough and Vinod:
1) You may not be able to trust such simple estimators as probit
because a forthcoming article by Stokes (2003) got different
results from 7 different software packages. Of particular concern,
the model and data were taken from a simple example in G.S.
Madalla's Introduction to Econometrics text where the reported
result was also "wrong". (Note, this is from the data in table 8.4
in the 3rd edition)
2) You should extensively evaluate any maximum likelihood (ML) result
because the authors could not reproduce the results from one
user-written ML estimator of voter participation, the results of
which were published in an earlier AER paper.
They tried replication with several packages (though we infer not
Stata). Four packages were unable to maximize the likelihood, a
5th was able to maximize to a likelihood that agreed to 11
significant digits with the 6th, and author's "gold standard"
result, was obtained using automated analytic derivatives and TSP.
They go on to conclude that while the likelihood could be maximized
using analytic derivatives, the combination of data and model had
insufficient information to draw inferences, mainly because the
condition number of the Hessian was too large and because two of
the profile likelihoods were deemed insufficiently quadratic in
shape.
The authors suggest 4 steps to evaluate an estimate.
There was a larger issue about replication of results in the social sciences
and the difficulty in obtaining the replication data and programs, but that is
beyond this post.
Short answer to 1.
------------------
Stata performs flawlessly on the probit example. We call the "problem" in
this data "perfect prediction". It is treated correctly in Stata and has
been for 12 years, it is discussed over 3 pages in [R] probit, and there
are even options in -predict- after -probit- for handling such cases --
see option -rules-.
Extending this answer to other official estimators
--------------------------------------------------
Great care has been taken to ensure that Stata's official estimators have
good optimization behaviour. A few of the things that are routinely done
include: identify and correct problems in the data, get good starting
values, transform the metric in which parameters are estimated, maximize
concentrated likelihoods, use alternate optimization methods or alternate
model realizations. For almost all official estimators this is enough to
ensure that your estimates are stable and robust to virtually all
datasets.
Even so, optimization can be hard and some likelihoods are particularly
difficult. For official estimators with difficult likelihoods, e.g.
-heckman-, -arch-, -xtprobit-, -boxcox-, the difficulty is discussed in
the manual entry and suggestions are made or tools are provided to help
diagnose when convergence may not have been achieved, and to evaluate the
stability of the estimates (e.g. -quadchk-).
Answer to 2 (OK, not that short)
--------------------------------
Writing ML estimators can be tricky and we encourage all those who write
estimators to look carefully at their results. The authors' four
suggested checks strike me as a fine beginning with two and almost three
of these easily automated by specifying options to -ml-. More
importantly, always simulate some data with known coefficients and run
your estimator on it so you can check all of your computations.
When we wrote the user-programmed ML estimator (estimating the parameters
of a model of voter participation) in Stata (using method lf) and turned
our estimator loose on the data graciously provided by Bruce McCullough,
it marched straight to a solution in 8 iterations. Given the difficulties
encountered by the authors, we expected there to be many more iterations,
some with "not concave" messages and possibly some missing or very large
standard errors in the regression table, the latter indicative of a poorly
conditioned covariance matrix. We saw none of these things.
The log-likelihood of our estimates agreed with those reported as the "gold
standard" by the authors to "only" 8 significant digits. This did not
surprise or bother us -- we benchmark ML estimators on about 15 different
platforms where there are differences in operating systems, math
libraries, and CPUs. We would not be happy with 8 digits, but that is
because we control the algorithms. The theoretically best that can be
hoped for is to agree to 15 digits (sometimes 16), but that is never
achieved. Even changing from an Intel Celeron to an Intel Pentium will
often lead to numbers that agree at "only" the 13th or 14th decimal point.
With the current problem we are talking about different computers,
different operating systems (we ran under Linux), different math libraries,
and different optimization algorithms.
The important question is do the solutions support the same statistical
inferences and that we did agree with the "gold standard"? Both packages
reported from 4 to 8 significant digits for the coefficient and standard
error (SE) estimates. When rounded to the number of digits reported, all
of SEs agreed exactly and all but two of the coefficients agreed exactly,
with the two coefficients still agreeing to 5 significant digits.
McCullough and Vinod do not report SEs for any of the estimators because
their eigenvalue analysis of the hessian could not clearly establish that
the VCE was of full rank. They are particularly concerned with the
condition number of the hessian and what it implies about the accuracy of
the eigenvalues and by inference about the rank of the VCE. While a
smaller condition number would be nice, our results indicate that the
parameter and SE estimates for this model are just fine, which casts some
doubt on the utility of using the condition number when deciding whether
you have sufficient information for inference. Two results are
particularly telling,
B1. You can easily increase or decrease the condition number by
scaling one or more variables in the model. We tested scalings
that decreased the condition number (this should be better) down
to 2.4e5 from the original scalings 6.5e9 and also up to 5.2e10.
In all cases the coefficients remained the same to the number of
digits shown in output, of course they were scaled. More
importantly, the z-statistics and associated p-values remained the
same regardless of the scaling and implied condition number.
B2. We bootstrapped our ML estimator using 1000 replicates and
importantly had no trouble with convergence in any of the 1000
replicates. Bootstrapped standard errors and confidence intervals
were generally in agreement with their asymptotic analogues
reported by the ML estimator. They were different, and in two
cases marginally significant regressors became marginally
insignificant, but on the whole they were similar.
McCullough and Vinod make 4 specific suggestions:
C1. Examine the gradient -- is it zero?
C2. Inspect the solution path (i.e. trace) -- does it exhibit the
expected rate of convergence? <that is to say quadratic>
C3. Evaluate the Hessian, including an eigensystem analysis -- is the
Hessian negative definite? Is it well-conditioned?
C4. Profile the likelihood to assess the adequacy of the quadratic
approximation.
These are all good suggestions, though there is no clear-cut way to
establish a well-conditioned hessian. That was the point of item B1
above.
There is also some overlap. Suggestions C2 and C4 both have something to
say about the shape of the hessian in the neighborhood of the solution,
though C4 is more specific.
We claimed earlier that two and almost three of these suggestions can be
automated by Stata. Specifying the -nrtolerance(#)- option on most
official estimators or on -ml maximize- for user-written estimators
specifies that the hessian scaled gradient
-1
tol = gH g'
be less than #. This implies that the hessian must be positive definite
when convergence is declared, because the full inverse is required to
compute the tolerance. It also implies that the gradient is 0 within our
ability to define 0 using the shape of the likelihood -- the hessian. We
have been considering making -nrtolerance()- the default tolerance for
Stata's official estimators. A good value for # is 1e-6.
Alternately, specifying the -gradient- option on most estimators or on -ml
maximize- will cause -ml- to display the gradient at each iteration.
Suggestion C3 can be obtained for most official estimators and all
user-programmed estimators by typing -ml graph- after estimation. A graph
tracing the last 20 iterations of the log likelihood is show. These
values are also available in the matrix e(ilog).
The profile likelihood, suggestion C4, currently requires coding a loop
over the values of the coefficient to be profiled and using either the
-constraint()- or -offset()- options to lock one of the coefficient
values. Profile likelihoods are just one of many ways to evaluate the
validity of asymptoticly justified statistics, and one of the least
direct. More direct options include bootstrapping when evaluating
performance on specific samples or Monte Carlo simulation when evaluating
an estimator in general.
What do we suggest?
If you are using an official estimator:
The asymptotic properties of the SEs and CIs reported by official
estimators have already been established, otherwise Stata would not
report them. If you are concerned with the small-sample performance
of an official estimator on your specific data, bootstrap and, if you
are worried about asymmetry, look particularly at the percentile CI.
If you are programming your own estimator:
Simulate some data from you statistical model and use that to check
your estimator. This isn't really related to convergence, though you
could use it to establish correct coverage for the null, but you can
be much surer that you have coded you likelihood correctly. Use the
-nrtolerance()- option and look carefully at your iteration log. If
you have many "not concave" messages near the end of you iteration
log, you may have insufficient information to estimate the model. If
you are worried that you data is insufficient to support asymptotic
statistics, bootstrap. If you are failing to converge, look into
rescaling some of the variables.
<See Part 2 that is in another post>
-- Vince -- David
vwiggins@stata.com ddrukker@stata.com
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/