Causal inference from observational data—that is, data that did not come from an experiment—is notoriously difficult: because the probability distribution of the treatment variable Z is unknown, measured or unmeasured variables that correlate with both Z and the outcome Y may confound causal estimates. This report will suggest methods for designing and modeling causal observational studies that combine design-based techniques with regression to account for measured covariates X.

Regression-Discontinuity designs occur when treatment assignment is a function of a variable T: when T exceeds a threshold c, treatment is assigned. Conventionally, researchers analyze RDDs by regressing Y on both T and Z. We argue for modeling RDDs as naturally-randomized experiments. Doing so involves two steps: modeling the relationship between Y and T, and using that design to infer and estimate effects of Z on Y. We illustrate this approach by reanalyzing a dataset used to estimate the effects of academic probation on students' grade point averages.

The rest of the report focuses on propensity-score stratification with high-dimensional data (p>>n). If treatment assignment is a random unknown function of X, researchers can adjust causal estimates for X by estimating propensity scores: the probability of treatment assignment conditional on X. Researchers then stratify subjects based on their propensity scores and model the data as if treatment were randomized within strata. However, when the dimension of X is large, propensity-score estimation is impossible. We propose a method in which a subset of X is used to estimate propensity scores. Next, the entire matrix X can be used to model Y, using a high-dimensional regression technique; the model is trained on subjects excluded from the stratification. The model's predictions of Y can then be used to test balance on, and adjust for, the entire set of covariates in X. We illustrate this method by evaluating two high-school educational programs.