The Internal and External Validity of the Regression Discontinuity Design: A Meta-Analysis of 15 Within-Study-Comparisons | JPAM Featured Article

Article Introduction

Chaplin et al. (2017) tests the efficacy of regression discontinuity (RD) by comparing RD causal estimates at the treatment cutoff to those from Randomized Control Trials (RCTs) that are also estimated at this same cutoff. The study identifies 15 previously completed within-study-comparisons (WSCs) that explicitly examined this issue by assuming the RCT results are unbiased and then comparing them to RD results.

The differences between these results can be thought of as estimates of bias due to use of the RD method. The authors address the internal validity of RD by using the average estimated bias across all 15 WSCs. The study also addresses concerns about external validity by using meta-analystic methods to examine variation in estimated bias across studies. Existing theory predicts no difference between RD and RCT estimates on average, but difficulties with the implementation and analysis of RD in particular can lead to the possibility of deviations from theoretical expectations.

This article preview is from the Spring 2018 issue of the Journal of Policy Analysis and Management (JPAM). APPAM invites authors from each issue to asnwer a few questions about their research to further promote the quality work in the highly-ranked research journal. Check out this and other JPAM articles online.

Author Interview with Duncan Chaplin

What spurred your interest in this research into regression discontinuity efficacy, or about Within-Study Comparisons in general?

I’ve had at least two shots at doing RD in my career. Neither panned out because the treatments did not differ enough at the cut-points on the running variables. Nevertheless, I have remained very interested in the method so I was excited when this opportunity to evaluate the efficacy of RD presented itself. The opportunity came about because my firm, Mathematica Policy Research, was able to hire Tom Cook, a famous evaluation methods expert. He has a great deal of expertise in Within-study-comparisons and brought the idea for this paper to us. I jumped on it because I relished the chance to learn about WSCs from Tom and am very glad I did as I believe I’ve gained a great deal of valuable knowledge during our time together.

Why did you choose to examine the efficacy of RD for this study over other methods?

RD is generally acknowledged as the most rigorous non-experimental method for obtaining internally valid impact estimates. At the same time, my own challenges with implementing this method suggested to me that bias was possible. I also sensed that the topic would be of great interest to other researchers and thus of great value to the field.

Please discuss the difference between using contrasts verses studies as a unit of analysis. In what situation would a researcher use a less-reliable random contrast?

There are advantages and disadvantages to using contrasts rather than studies as the unit of analysis in a WSC. The main advantage of looking at contrast level variation is that it enables one to make use of variation within studies. In our case we did find substantial independent within-study variation in three contrast characteristics. These were whether the estimated RD impact used parametric or non-parametric methods to control for the running variable, the sample size of the RD, and the sample size of the RCT.

None of these turned out to affect bias on average. The first two of these factors did help to explain variation in bias at the contrast level. However, those results do not let us distinguish noise from signal—which highlights the major disadvantage of contrast-level analyses that you highlighted in your question—that they are less reliable—hence we did focus a great deal of our paper on study-level results. We do so in two contexts.

One is where the study-level estimates are not shrunken at all and are therefore relevant to estimate the bias that might be hidden in impact estimates from an individual RD study that was not part of a WSC. This analysis is of interest to all those who want to generalize to future RD results from individual papers. But that analysis fails to take account of the fact that for many topics, and especially those of strong policy interest, one can usually interpret new results in the context of results from many other previous studies that tackled the same issue.

In this case the more appropriate focus for a WSC is on study-specific average shrunken estimates of RD bias, where the shrinkage process expresses each study’s average effect as though it had the attributes of the studies with the largest sample sizes and most desirable other design and analysis features. We tend to emphasize these study-level shrunken estimates most since they use the most information. We estimate bias close to 0 regardless of how we do our analysis.

However, the variation does differ. Indiscriminately selected individual contrasts within studies show considerable dispersion around their zero mean. In contrast, the study-specific shrunken estimates, which also show no bias on average, have considerably less dispersion around this mean and also somewhat less dispersion than the unshrunken estimates that are also much less variable than the contrast-level estimates. Thus, the shrunken study-level results, which show the least variation in bias, are also most relevant for researchers who are in a position where they can contextualize their results within a set of results from a body of related literature, a practice that is strongly recommended whenever possible.

What are common pitfalls researchers should consider when constructing a study using RD?

Our results suggest that researchers should be particularly careful to avoid two common pitfalls—giving too much weight to individual RD impact estimates that are based on models that use parametric assumptions about the relationship between the running variable and the outcome and relatively small sample sizes (i.e. less than 1,100 observations).

That said, our evidence does suggest that there is little bias associated with using either of these on average so that it would be ok to use such evidence in combination with results from other studies. One other possible pitfall includes ignoring the possibility of manipulation of the running variable. Interestingly our data suggest no clear evidence of additional bias associated with papers that made this mistake, suggesting that manipulation may not be common among the RD studies covered in our paper. However, we still believe that the test is a valuable tool for helping to ensure that authors remain vigilant about avoiding RD when manipulation may have occurred.

A fourth possible pitfall is allowing one’s priors about the estimated impacts to affect decisions regarding functional form or bandwidth. We did not try to code this in our data. In addition, however, we suspect that this would be far less likely to happen in a WSC than in a situation where RD is being run by itself, with no RCT check. Hence, our results do not address this potential pitfall. A final pitfall is to overlook the need for correctly estimated standard errors. This can be particularly important when the running variable takes on a relatively small number of values in which case it is important to consider adjusting for clustering by the values of the running variable. We did not attempt to verify the accuracy of reported standard errors in our study but believe that doing such a study would be a valuable contribution to the field.

What challenges, if any, did you find when conducting this research? How can further study overcome these challenges?

We faced three key challenges when doing this study. First, we were not sure if we could identify all of the WSCs done on RD. We conducted an exhaustive search by emailing authors of all within-study-comparisons we had on hand as well as a number of other evaluation experts. By early 2016 we had identified over 70 WSCs, 15 of which covered RD. As far as we know that is all that existed on RD at that time, though we are now aware of one more that was conducted later in 2016.

The second key challenge we faced is that many of the studies did not provide us with all of the information we would have liked to have to do our analyses. One key piece that was often missing was the pooled standard deviation of the outcome. This was needed to standardize the impact estimates to make them comparable across studies and across outcomes within studies. We were able to use available information to estimate this pooled standard deviation in all cases, albeit sometimes with moderately strong assumptions. Another key piece of information often missing was the standard error of the bias estimate. We attempted to use a number of pieces of available information to estimate this but ended up deciding that our estimates varied too much to be credible so we relied primarily on just the sample sizes, which were always provided, to do a very approximate adjustment for precision across estimates. A final key piece of information that was generally missing was response rates. In theory RD estimates could be biased by non-response. However, since none of the studies we analyzed provided complete information on this topic and few provided any related information, we were not able to include it in our analysis.

A final challenge was that this was the first time a meta-analysis had been done of WSCs. Consequently we had to develop a code book. We were able to make some use of work done by the What Works Clearinghouse here at Mathematica, but for the most part we developed our own codebook using an iterative process that required us to revisit some studies a few times in order to ensure that the final dataset used consistent definitions across all of the variables included in our analyses.

How does this study impact or add to the existing research that can inform research practices? What would be the ideal next step for your research findings? How would you like to see your findings implemented?

As I mentioned, this is the first meta-analysis of WSCs on quasi-experiments of any type. We focused on RD because of the strong reputation it has for providing valid estimates and because we could code the 15 WSCs on RD relatively quickly. We believe that our work will be valuable for researchers who are considering RD as an evaluation tool. Our evidence should not be interpreted to suggest that an RD is as good as an RCT when both are possible, but when an RCT is not possible, RD may well be worth considering in spite of the fact that it can only be used to estimate impacts at the cut-point on the running variable. That said, our results also suggest that researchers should be particularly cautious when estimating models that rely on relatively small sample sizes and even more so when those are combined with parametric models. We hope that our work will be helpful for researchers who need empirical evidence that RD can be an effective evaluation tool and that it will help guide how and when they make use of this tool.

About the Authors

Duncan Chaplin, a Senior Researcher at Mathamtatica Policy Research, focuses on education, international development, and evaluation methods. Within education, he has worked on teacher evaluations, dropout rates, education technology, out-of-school time activities, at-risk youth, and child care. Chaplin’s international development work has focused on estimating impacts of improving access to grid electricity in Africa. Methodologically, he has used a variety of experimental and non-experimental evaluation methods and, more recently, he is evaluating the efficacy of non-experimental methods by comparing their results with those from experiments. One of his current projects is providing guidance to Regional Educational Laboratories in the United States. Internationally, Chaplin recently completed an 8-year evaluation of the Millennium Challenge Corporation’s energy sector program in Tanzania, and he started a 10-year evaluation of MCC's energy sector program in Ghana in 2017.

Thomas Cook, from Northwestern University's Institute for Policy Research, is interested in social science research methodology, program evaluation, school reform, and contextual factors that influence adolescent development, particularly for urban minorities.

Cook has written or edited 10 books and published numerous articles and book chapters. He received the Myrdal Prize for Science from the Evaluation Research Society in 1982, the Donald Campbell Prize for Innovative Methodology from the Policy Sciences Organization in 1988, the Distinguished Scientist Award of Division 5 of the American Psychological Association in 1997, and the Sells Award for Lifetime Achievement, Society of Multivariate Experimental Psychology in 2008, and the Rossi Award from the Association for Public Policy Analysis and Management in 2012.

Jelena Zurovac, a Senior Researcher at Mathematica Policy Research, has extensive experience in quantitative evaluation design and analysis. Zurovac has served in key design and analysis roles on several evaluations of hospital quality improvement initiatives, including the evaluation of the Quality Improvement Organizations, Partnership for Patients, and most recently the Community Care Transitions Program. In addition, she has worked on several randomized trials and non-equivalent comparison group studies of care coordination and disease management for people with chronic illnesses. Zurovac also has experience in pharmacoeconomic analyses and policy.

Jared S. Coopersmith is a statistician at Mathematica Policy Research. His work includes sample design for health surveys as part of an evaluation of primary care delivery, quasi-experimental design to evaluate the impact of changes in Medicare reimbursement policy, and systematic reviews of training programs for primary and secondary school teachers. Prior to his work at Mathematica he was a project officer at the National Center for Education Statistics.

Mariel M. Finucane, an Associate Director at Mathematica Policy Research, is an expert in using Bayesian hierarchical modeling and adaptive design to study health care and nutrition issues. Her most recent work has focused on primary care delivery, inpatient harm prevention, and hospital readmission reductions. She is currently the lead Bayesian statistician for an evaluation of CMS’s Comprehensive Primary Care (CPC) Initiative, a multiyear, comprehensive evaluation of CPC and its effects on Medicare costs, quality of care, and patients’ and providers’ experiences.

Lauren N. Vollmer is a Statistician at Mathematica Policy Research specializing in Bayesian methods and comparison group selection for quasi-experimental studies. Her research includes applications of Bayesian meta-regression to conventional meta-analyses as well as to large-scale health care evaluations.”

Rebecca Morris is a health policy doctoral student at George Washington University. She currently serves as a Milken Public Health Scholar through the Milken Institute School of Public Health. Previously she worked as a research analyst and programmer at Mathematica Policy Research and served as a research fellow at Stanford Law School. She holds a B.A. in Economics & Mathematics from Emory University.

Submit Your Research to JPAM

JPAM seeks contributions that span a broad range of policy analysis and management topics with an emphasis on research that conveys methodologically sophisticated findings to policy analysts and other experts in the field. Both domestic and international contributions in public management are welcome, as well as a broad range of policies related to social well-being, health, education, science, environment, and public finance. JPAM strives for quality, relevance, and originality. An interdisciplinary perspective is welcome as are articles that employ the tools of a single discipline.