Power Calculations for Propensity Score Matching?

I received a question this week from Kristen Himelein, a bank colleague who is working on an impact evaluation that will use propensity score matching. She wanted to know how to do power calculations for this case, saying that “Usually, whenever anyone asks me about sampling for matching, I tell them to do a regular sample size calculation to determine the size of the treatment, adjust for expected take-up rates, and then take 3-4 times more than the treatment for the control to get as good a match as possible. I found one paper in the Journal of Biopharmaceutical Statistics that deals with the calculations, but that's about it. Do you have any suggestions?”

I thought I’d share my thoughts in the blog, in case others are also facing this issue.

First, I looked at the paper she had found. It is likely not that helpful for economists/social scientists planning power calculations, as it is based on assuming you know the proportions of the treatment and control groups that will be in different propensity score ranges within the common support. However, one interesting remark it makes is that for non-randomized trials in medicine, researchers often estimate sample size in the same way as they do for randomized trials, but then as a result, the U.S. FDA usually gives a warning and requests sample size justification or an increase in sample size based on consideration of the degree of overlap in the propensity score distribution.

What do I think we should do?

First, it is important to recognize that the extent to which your control group overlaps with the treatment group depends on how comparable in the first place the two survey populations are. For example, in work on estimating the impacts of international emigration from Tonga, John Gibson, Steven Stillman and I use two control samples: a specialty sample we did ourselves of 183 individuals aged 18-45 who lived in the same villages as the migrants, as well as a national labor force survey which had 3,979 individuals aged 18-45 throughout Tonga. We find that after trimming propensity scores below 0.05 or 0.95, the large national sample shrinks from 4043 treatment + control observations to only 354, whereas when using the specialty sample taken in the same villages trimming on the propensity score only reduces the sample from 230 observations down to 200 observations. This echoes the conclusions of Smith and Todd who find that the vast majority of individuals in the CPS and PSID are quite dissimilar to the participants in the supported work program they study. A smaller, better targeted specialized survey may therefore offer more power than a larger random sample of a broader population.

Given, this, I think the steps in calculating sample sizes needed to achieve a given power in a propensity score matching design should be as follows:

a)Figure out how much you know about the characteristics of the treatment group. For example, are individuals all drawn from particular geographic areas, do they all have income below a certain level, etc. The more you know on this, the better you can target your control sample to make it comparable to the treatment group.

b)Next, check for the possibility of panel data for part of the sample at least. We know propensity score matching is more convincing when the same survey instrument is used, where multiple pre-period values of the outcome variable are used to match individuals on, and where individuals come from the same local labor markets. So if panel data is possible for some individuals but not others, this should in part determine who makes your sample.

c) Then use sampsi in Stata or Optimal Design, or your other favorite power calculation program to calculate the size of the treatment group and control group required under a balanced experimental design.

d)Blow up these numbers from c) by dividing by the proportion that in step a) you expect to have left after trimming to the common support in propensity scores – so if you don’t know much about the treatment group’s characteristics in advance, or can’t target your control group sample very finely you may need to allow up to 10 times the experimental treatment sample (as in my example above), whereas if you know a lot about the treatment group characteristics and can sample accordingly, you may need only take a control group sample that is say 20-200% larger than in the pure experiment case. Note that if the control group is very different, you may also have to increase the treatment group size since some of those treated may not overlap in the common support with any of the controls.

e)If you have more data, either from a pilot or other survey, which tells you something about the distribution of the propensity score distribution likely for both the treatment and control groups, you can then do some stratified sampling to try and get balanced treatment and control groups within different propensity score ranges.

Anyone else had experiences doing this in practice that they would like to share?

Comments

Hi David, apart from what you have written, when I think about power and PSM I imaging there may also be (moderate) power gains akin to blocking. Since PSM often restricts the analytic sample to a segment of the population it may also reduce the background variation in the outcome measure. Understanding the selection, as you say in point (A), would also help to assess this...

Hi David, interesting post.
The basic problem is that more parameters go into the process than with experimental design.
Besides effect size and (conditional variance), I think you need:
1) Correspondence between T and C groups in distribution of covariates X
2) Relationship between prop. score e(X) and outcome Y
3) The number of covariates considered p
Furthermore you have to decide
4) The number of matches for m-to-1 matching
5) The method of estimating e(X)hat
You comment that the stratification calculations of Jung et al. are not useful because you have to know the proportion in each e(X)hat stratum (sufficient for conditioning on 1 above). It seems to me that your point a) is attempting to get at the same thing through direct examination of some X.
Second, loss of sample size due to trimming (point d) is not the only reason for variance inflation relative to experimental design. The maximum attainable bias reduction by matching in a particular sample, and the other points 2-5 above also play a role. So even after inflating by the proportion dropped you can still be off in either direction.
Another issue not even considered here is the estimation of the propensity score itself.
Perhaps, considering these complications, it would be better to produce not just one number, but do the calculations under different conditions, perhaps following your advice to get some idea of the general context. And then to choose the sample size within the budget that is robust to some reasonable range of problems?
Would be interested to hear your thoughts on this.
- Daniel
P.S. Under some assumptions (MVN), Rubin and Thomas (1996) gave a variance formula that could be used for power calculations in eqn's (2.1)-(2.8) on p. 252.

Interesting topic. A couple of points:
1. I touched on one aspect of this problem in a poster presentation a few years back at the Society for Political Methodology annual meeting at Yale; poster is here:
http://cyrussamii.com/wp-content/uploads/2011/11/samii_samplingposter_small.pdf
The key contribution was a method for calculating how many controls one should sample for each treated in order to ensure, with reasonable likelihood, good balance on variables that are not available ahead of time, but have to measured once the sample is drawn. The method requires some a priori or auxiliary information on the distribution of treated and control variables. There is a rather pessimistic result that shows that this ratio is actually convex is in the degree of potential confounding. Note that these results get you an optimal control-treated ratio, but do not get you sample size.
2. Another point I'd make is to suggest that your colleague not use PSM but rather a more direct covariate matching strategy. A recent contribution by Gary King and colleagues demonstrates that often, PSM shows little improvement over randomly matching units (in other words, the PS models are not nearly as good as they need to be), and so direct covariate matching via either covariate stratification (aka, "coarsened exact matching) or minimum distance matching (via Mahalanobis metric or variants such as GenMatch) are much better. Here is the link to that paper:
http://gking.harvard.edu/publications/comparative-effectiveness-matching-methods-causal-inference
3. The last point I'd make is that for sample size calculations per se, it of course depends on the basis on which variances are being computed. A principled approach would be to (1) condition on the matching solution, (2) proceed under the hypothesis that matched units were indeed equally like to be assigned one way or the other, and (3) base inference on this hypothetical variability in possible assignments. Then, the usual power calculations under the assumption of unequal variance would be perfectly adequate, and even a bit conservative. (Only if the matching solution uses 1-to-1 matching, then the equal variance assumption is okay.)

Hi Cyrus,
If I understand you poster correctly, you are essentially using the result that the minimal sufficiency of the likelihood ratio of judging whether unit is treatment vs. control is equivalent to the balancing property (Rosenbaum & Rubin 1983, theorem 1). In survey methodology terms, you then suggest to select units proportional to the inverse of the LR (but binning into quantiles). So that should give you approximate unbiasedness (if ignorable) when matching. Then the inflation factor of the control group follows. Could you say it that way?
Also, what about the matching procedure itself and variance arising from that - estimation of the propensity scores, for instance?
Your point (3) seems a good suggestion, but I guess part 1 of it is precisely the problem; ie you might not know how to condition on all the relevant factors exactly, hence my suggestion to try a bunch of them (maybe consisting of a factorial).
d

Hi Daniel,
In reply to your reply: The method for determining the inflation factor is based on directly balancing covariates. Now, if you use the propensity score itself as the covariate, then I think the result amounts to what you are saying. With respect to accounting for the matching procedure: my proposal in point (3), combined with the suggestion in point (2) to use direct covariate matching, amounts to essentially ignoring that source of variation, under the claim that we are restricting inference to the sample at hand, conditional on the matching solution. So long as that point is understood by those reading the final product of the analysis, then we can all content ourselves that we've appropriately characterized the conditional variability, and that the unconditional variability should probably be assessed directly by looking at other studies. That is, rather than guessing how stochasticity in the covariates, for example, might affect the matching solution and then incorporating those guesses into our estimate of the variance for a given study, I would prefer to look at a collection of studies carried out in a bunch of different settings. My thinking is influence by Mosteller and Tukey (1997:Ch 7).
Cheers,
Cyrus

Hi Cyrus,
Thanks for the interesting explanation, I think I get it now. I'm not entirely sure I agree on the point of looking only at conditional variance. But I will be sure to read the reference for arguments some time.
best daniel