RE: st: Bootstrapping & clustered standard errors (-xtreg-)

Hi Tobias,
Ok, well your comments below remind me of:
Wang, J., Carpenter, J.R., & Kepler, M.A. (2006). Using SAS to conduct nonparametric residual bootstrap multilevel modeling with a small number of groups. Computer Methods and Programs in Biomedicine, 82(2), 130-143.
I don't know if Stata offers a similar procedure. In conjunction with the above paper, I also strongly recommend taking a look at:
Maas, C.J.M., & Hox, J.J. (2004a). The influence of violations of assumptions on multilevel parameter estimates and their standard errors. Computational Statistics & Data Analysis, 46, 427–440.http://igitur-archive.library.uu.nl/fss/2007-1004-200713/Maas(2004)_influence%20of%20violations.pdf
Maas, C.J.M., & Hox, J.J. (2004b). Robustness issues in multilevel regression analysis. Statistica Neerlandica, 58, 127–137.http://joophox.net/publist/sn04.pdf
Cam
> From: tobias.pfaff@uni-muenster.de
> To: statalist@hsphsun2.harvard.edu
> Subject: RE: st: Bootstrapping & clustered standard errors (-xtreg-)
> Date: Mon, 12 Sep 2011 17:51:48 +0200
>
> Dear Stas, Bryan,
>
> I was maybe not clear why I want to bootstrap at all:
>
> My fixed effects regression with clustered SE works fine.
> [-xtreg depvar indepvars, fe vce(cluster region) nonest dfadj-]
>
> However, my predicted residuals (-predict res_ue, ue-) are not normally
> distributed.
> Am I mistaken that I need normally distributed residuals for the
> t-statistics to be unbiased?
>
> If I'm not mistaken then I would like to do a robustness check with
> bootstrapped standard errors (where the normal distribution of residuals
> doesn't matter for the z-statistics to be unbiased) to see if my results
> change or not.
> And I still get the error message of insufficient observations when trying
> to bootstrap with clustered SE. Using -idcluster()- does not help.
> I have 76,000 obs., 8100 individuals, 108 clusters, and 36 regressors. I
> don't think that the bootstrap would produce a sample with fewer cluster
> id's than regressors.
> So I still don't know why I get the error message after -xtreg depvars
> indepvars, fe vce(bootstrap, reps(3) seed(1)) cluster(region_svyyear) nonest
> dfadj-?
>
> WEIGHTS:
> Your arguments regarding the usage of weights were convincing. However,
> -xtreg- only allows for weights that do not change for the individuals over
> the years. Our panel dataset has a variable for the design weight that does
> not change over the years, but this weight does not contain information on
> non-response. Another weight variable in the dataset contains information on
> selection probabilities and non-response, but it obviously changes over the
> years for each individual, and cannot be used with -xtreg-. So I wouldn't
> know how to incorporate information on non-response with -xtreg-?
>
> Earlier in this thread Cameron said that bootstrap only makes sense in my
> case if I would use "custom bootstrap weights computed by a statistical
> agency for a complex sampling frame". It seems that bootstrap cannot be used
> with weights, anyway. I guess that weighted sampling is still not
> implemented in bootstrap, as stated 8 years ago
> (http://www.stata.com/statalist/archive/2003-09/msg00180.html).
>
> Thanks very much for your help,
> Tobias
>
> P.S.: I cited the PNAS paper since it is a rare exception in my field
> (happiness economics) that an empirical paper says something about
> regression diagnostics at all.
>
>
> -----Ursprüngliche Nachricht-----
> > Date: Thu, 08 Sep 2011 17:20:35 -0400
> > Subject: Re: st: Bootstrapping & clustered standard errors (-xtreg-)
> > From: Bryan Sayer <bsayer@chrr.osu.edu>
> > To: statalist@hsphsun2.harvard.edu
>
> ... The
> sampling weights control mostly for unequal probabilities of
> selection, and for well-designed and well-conducted surveys,
> non-response adjustments are not that large, while probabilities of
> selection might differ quite notably.
>
>
> I disagree with the part about non-response adjustments not being that
> large. It really depends on the survey. Surveys in the U.S. may have
> response rates as low as 25 to 30%, meaning that the non-response
> adjustments may be pretty large.
>
> However, it is really the difference in response rates for different groups
> that matters. For example a survey I am working with shows a noticeable
> difference in response rates between the land-line phone and the cell phone
> only group.
>
> The design effects for surveys can be broken into pieces for clustering,
> stratification, and weighting. And weighting can be further classified into
> the design weights and the non-response adjustments. If one really wanted to
> pursue the matter.
>
> But more related to the point Stas is making, often the elements of the
> survey design and weights that are incorporated into the survey will reflect
> information that is not available to the user. Simple put, it may not be
> possible to fully condition on the true sample design. This is because some
> of the elements used in the sample design and weighting process cannot be
> disclosed in public files for confidentiality reasons.
>
> Working in sampling, I am obviously biased toward using the weights. But
> fundamentally, I believe that it is often impossible for the user to know
> whether they have fully conditioned on the sample design or not.
>
> Most likely, lots of smart people worked hard on the sample design and
> everything that goes into producing the data that you are using. Accept that
> they (hopefully) did their job well. So if you have the sample design
> information available to you, I don't see any reason to *not* use it.
>
> My impression is that bootstrapping of complex survey design data, while
> possibly past its infancy, is probably still not very fully developed. I
> know lots of very smart people who work on it, but it just does not seem to
> generalize very well, at least not as well as a Taylor series linearzation.
>
> Just my 2 cents worth.
>
> Bryan Sayer
> Monday to Friday, 8:30 to 5:00
> Phone: (614) 442-7369
> FAX: (614) 442-7329
> BSayer@chrr.osu.edu
>
>
> On 9/8/2011 4:28 PM, Stas Kolenikov wrote:
>
> Tobias,
>
> I would say that you are worried about exactly the wrong things. The
> sampling weights control mostly for unequal probabilities of
> selection, and for well-designed and well-conducted surveys,
> non-response adjustments are not that large, while probabilities of
> selection might differ quite notably. While it is true that if you can
> fully condition on the design variables and non-response propensity,
> you can ignore the weights, I am yet to see an example where that
> would happen. Believing that your model is perfect is... uhm... naive,
> let's put it mildly; if anything, econometrics moves away from making
> such strong assumptions as "my model is absolutely right" towards
> robust methods of inference that would allow for some minor deviations
> from the "absolutely right" scenario. There are no assumptions of
> normality made anywhere in the process of calculating the standard
> errors. All arguments are asymptotic, and you see z- rather than
> t-statistics in the output. In fact, the arguments justifying the
> bootstrap are asymptotic, as well. You can still entertain the
> bootstrap idea, but basically the only way to check that you've done
> it right is to compare the bootstrap standard errors with the
> clustered standard errors. If they are about the same, any of them is
> usable; if they are wildly different (say by more than 50%), I would
> not either of them, but I would first check to see that the bootstrap
> was done right.
>
> I know that PNAS is a huge impact factor journal in natural sciences,
> but a statistics journal? or an econometrics journal? I mean, it's
> cool to have a paper there on your resume, but I doubt many statalist
> subscribers look at this journal for methodological insights (some
> data miners or bioinformaticians or other statisticians on the margin
> of computer science do publish in PNAS, though). I would not turn to
> an essentially applied psychology paper for advice on clustered
> standard errors.
>
> The error that you report probably comes from the bootstrap producing
> a sample with fewer cluster identifiers than regressors in your model.
> Normally, this would be rectified by specifying -idcluster()- option;
> however in some odd cases, the bootstrap samples may still be
> underidentified. I don't know whether the fixed effects regression
> should be prone to such empirical underidentification. It might be,
> given that not all of the parameters of an arbitrary model are
> identified (the slopes of the time-invariant variables aren't).
>
> On Thu, Sep 8, 2011 at 3:30 AM, Tobias Pfaff
> <tobias.pfaff@uni-muenster.de> wrote:
>
> Dear Stas, Cam,
>
> Thanks for your input!
>
> I want to bootstrap as a robustness check since my residuals of the
> FE
> regression are not normally distributed.
> And bootstrapping as a robustness check because it does not assume
> normality
> of the residuals
> (e.g., Headey et al. 2010, appendix p. 3,
> http://www.pnas.org/content/107/42/17922.full.pdf?with-ds=yes).
>
> If I do bootstrapping with clustered standard errors as Jeff has
> explained I
> get the following error message:
>
> - insufficient observations
> an error occurred when bootstrap executed xtreg, posting missing
> values -
>
> Cam, you say that I would need custom bootstrap weights. My dataset
> provides
> individual weights with adjustments
> for non-response etc. I do not use weights for the regression
> because the
> possible selection bias is mitigated due
> to the fact that the variables which could cause the bias are
> included as
> control variables (e.g., income, employment
> status). Thus, I would argue that my model is complete and the
> unweighted
> analysis leads to unbiased estimators.
>
> 1. Would you still include weights for the bootstrapping?
>
> 2. Does bootstrapping need more degrees of freedom than the normal
> estimation of -xtreg- so that I get the above error message?
>
> 3. If bootstrapping is not a good idea in this case, what can I do
> to
> encounter the breach of the normality assumption of the residuals?
> (I already checked transformation of the variables, but that doesn't
> help)
>
> Regards,
> Tobias
>
>
> -----Ursprüngliche Nachricht-----
>
> Date: Wed, 7 Sep 2011 10:24:33 -0400
> Subject: RE: st: Bootstrapping& clustered standard errors
> (-xtreg-)
> From: Cameron McIntosh<cnm100@hotmail.com>
> To: statalist@hsphsun2.harvard.edu
>
> Stas, Tobias
> I agree with Stas that there is not much point in using the
> bootstrap in
> this case, unless you have custom bootstrap weights computed by a
> statistical agency for a complex sampling frame, which would
> incorporate
> adjustments for non-response and calibration to known totals, etc. I
> don't
> think that is the case here, so I would go with the -cluster- SEs
> too.
> My two cents,
> Cam
>
>
> Date: Wed, 7 Sep 2011 09:03:27 -0500
> Subject: Re: st: Bootstrapping& clustered standard errors
> (-xtreg-)
> From: skolenik@gmail.com
> To: statalist@hsphsun2.harvard.edu
>
> Tobias,
>
> can you please explain why you need the bootstrap at all? The
> bootstrap standard errors are equivalent to the regular
> -cluster-
> standard errors asymptotically (in this case, with the number of
> clusters going off to infinity), and, if anything, it is easier
> to get
> the bootstrap wrong than right with difficult problems. If
> -cluster-
> option works at all with -xtreg-, I see little reason to use the
> bootstrap. (Very technically speaking, in my simulations, I've
> seen
> the bootstrap standard errors to be more stable than -robust-
> standard
> errors with large number of the bootstrap repetitions that have
> to be
> in an appropriate relations with the sample size; whether that
> carries
> over to the cluster standard errors, I don't know.)
>
> On Tue, Sep 6, 2011 at 12:25 PM, Tobias Pfaff
> <tobias.pfaff@uni-muenster.de> wrote:
>
> Dear Statalisters,
>
> I do the following fixed effects regression:
>
> xtreg depvar indepvars, fe vce(cluster region) nonest dfadj
>
> Individuals in the panel are identified by the variable
> "pid". The
> time variable is "svyyear". Data were previously declared as
> panel
> data with -xtset pid svyyear-.
> Since one of my independent variables is clustered at the
> regional
> level (not at the individual level), I use the option
> -vce(cluster
>
> region)-.
>
> Now, I would like to do the same thing with bootstrapped
> standard
>
> errors.
>
> I tried several commands, however, none of them works so
> far. For
>
> example:
>
> xtreg depvar indepvars, fe vce(bootstrap, reps(3) seed(1)
>
> cluster(region))
>
> nonest dfadj
> .where I get the error message "option cluster() not
> allowed".
>
> None of the hints in the manual (e.g., -idcluster()-,
> -xtset,
> clear-,
>
> -i()-
>
> in the main command) were helpful so far.
>
> How can I tell the bootstrapping command that the standard
> errors
> should
>
> be
>
> clustered at the regional level while using "pid" for panel
> individuals?
>
> Any comments are appreciated!
>
>
>
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/