Tools of the Trade: Getting those standard errors correct in small sample cluster studies

Some of the earliest posts on this blog concerned the inferential challenges of cluster randomized trials when clusters are few in number (see here and here for two examples of discussion). Today’s post continues this theme with a focus on better practice in the treatment of standard errors. In the last few years this better practice has become more common, but it is still not universal in the papers we come across. Hence another post in our continuing series, "Tools of the Trade".

For many years, researchers have recognized the need to correct the regression parameter standard error estimates for observational dependence within clusters. The usual solution is the application to the cluster setting of Huber-White heteroskedastic-consistent estimates of OLS standard errors (known as the CRSE – the cluster robust standard error). Without such correction, the naïve OLS standard errors are downward biased which in turn leads to an elevated risk of erroneously rejecting the null hypothesis of no impact. As highlighted in the 2004 Bertrand, Duflo, and Mullainathan paper, the CRSE performs well with a sufficient number of clusters but still results in downward biased standard errors when clusters are few (say 30 or less). And an increasing number of IE studies look at aggregate interventions and face this challenge of having the number of observed clusters fall far far short of infinity.

Enter the 2008 paper by Colin Cameron, Jonah Gelbach, and Douglas Miller that explores whether cluster-bootstrapping the standard error reduces bias even further than the CRSE. In fact the authors investigate the relative performance of a multitude of bootstrap methods, and I refer interested readers to the paper for a taxonomy of approaches (all of the approaches, since they are cluster bootstraps, resample clusters with replacement from the original sample). Most variants of the cluster bootstrap outperform the CRSE in various Monte Carlo analyses, especially a method the authors call wild cluster bootstrap-t which cluster bootstraps the OLS residual and then generates the finite sample test statistic (the Wald statistic).

What are the gains from such an approach? Well in Monte Carlo simulations with 10 clusters and various error structures and cluster sizes, the wild cluster bootstrap-t rejects the null hypothesis 4.8% to 6.4% percent of the time (this figure should be assessed against the rejection rate for tests of nominal size of 5%). Meanwhile in the same scenarios the CRSE rejects the null 8.2% to 18.3% of the time, and the naïve OLS standard error rejects at a rate of 10.6% to 77.0%. While the CRSE is an improvement over the naïve standard error, it still over-rejects the null at an uncomfortably high rate.

The authors recommend in their conclusion “At the very least one should provide some small-sample correction of standard errors, such as … using a T-distribution with G or fewer degrees of freedom” (where G is the number of clusters). And the various bootstrap methods, especially the wild cluster bootstrap-t procedure, can lead to considerable improvement in further reducing the downward bias in standard errors. Douglas Miller graciously provides the stata code for various bootstraps, including the wild bootstrap-t, for interested researchers.

As I said, I was reminded of this good practice when reading several recent papers of on-going work. In one such paper, Julien Labonne investigates how the presence of a cash transfer program in a politician’s constituency affects the local incumbent’s electoral performance. Specifically, he investigates the impacts of a large scale CCT program in the Philippines on incumbent vote share in municipal elections. In his data of 19 municipalities, 11 randomly selected municipalities receive the program everywhere while in the other eight municipalities only half of the villages (again randomly selected) receive the program. He uses this exogenous variation in program coverage to test contrasting models of voter and politician behavior and finds that the incumbent vote share is 26 percentage points higher in the municipalities where every village benefits from the program.

In another paper, Dana Burde and Leigh Linden look at the effects of locating schools in Afghan villages that previously sent children to larger-scale traditional government schools located at some distance. The village school program placed local schools in each village and provided teacher training. The evaluation takes advantage of the randomized phase-in of the program across 12 equally sized village groups and finds a 42 percentage point increase in enrollment and 0.5 standard deviation increase in test scores among children in beneficiary villages. Equally notable, girls gain the most from this intervention.

Of course what is relevant for the topic today is the small number of municipalities or village clusters in the two studies. To avoid the pitfalls discussed above, both sets of authors assess the significance of impact parameters against the T-distribution with appropriately few degrees of freedom and also check the CRSEs against the wild cluster bootstrap standard errors (where they find no significance difference between the two). In both papers, despite the necessarily small number of clusters, the authors confidently find strikingly large and significant effects.