Differences-in-Differences estimation in R and Stata

DID estimation uses four data points to deduce the impact of a policy change or some other shock (a.k.a. treatment) on the treated population: the effect of the treatment on the treated. The structure of the experiment implies that the treatment group and control group have similar characteristics and are trending in the same way over time. This means that the counterfactual (unobserved scenario) is that had the treated group not received treatment, its mean value would be the same distance from the control group in the second period. See the diagram below; the four data points are the observed mean (average) of each group. These are the only data points necessary to calculate the effect of the treatment on the treated. The dotted lines represent the trend that is not observed by the researcher. Notice that although the means are different, they both have the same time trend (i.e. slope).

For a more thorough work through of the effect of the Earned Income Tax Credit on female employment, see an earlier post of mine:

Calculate the D-I-D Estimate of the Treatment Effect

We will now use R and Stata to calculate the unconditional difference-in-difference estimates of the effect of the 1993 EITC expansion on employment of single women.

Then you must do the calculation by hand (shown on the last line of the R code).(value 4 – value 3) – (value 2 – value 1)

Run a simple D-I-D Regression

Now we will run a regression to estimate the conditional difference-in-difference estimate of the effect of the Earned Income Tax Credit on “work”, using all women with children as the treatment group. This is exactly the same as what we did manually above, now using ordinary least squares. The regression equation is as follows:

Where is the white noise error term, and is the effect of the treatment on the treated — the shift shown in the diagram. To be clear, the coefficient on is the value we are interested in (i.e., ).

I have a question for you that you may or may not know the answer to. I’m trying to run a DID model in stata, only I’m first differencing my outcome variable, and some of the time-variant control variables, but not all of them. In particular, my program participation variable is not differenced (because I assume it to have effects over multiple years, not just in the first year). The second complication is that not all program participants enter the program in the same year, so that program participation occurs, for some in 2003, some in 2004, some in 2005, etc…. so that it is not possible to simply have a pre and post dummy variable to interact with the treatment / control variable.

How do I carry this out? My initial impression, and after reading some math for the past 6 hours, is that when I first difference, I simply am left with the program participation variable that indicates a 1 only after the program is initiated, and that there is no more interaction term. Is this correct? This is as opposed to if I did not difference, then I would need the pre/post variable, the treatment/control variable, and the interaction of the two, which represents the affect of the treatment on the treated.

One last question / clarification. Assuming I’m correct that I simply have a variable in the regression that becomes a 1 for the treatment once treatment begins… and I don’t difference it, but I do difference the outcome variable… is there anything wrong with this?

Hi Dan, what is your motivation for first differencing your data? It might be less complicated to not first-difference your data but instead include a set of individual specific dummy variables (i.e. fixed effects) which will accomplish the same thing (control for between individual differences a.k.a. time invariant differences). From there you should be able to follow the model as above.

It is perfectly fine to have the treatment dummy switch on (=1) at different time for different individuals. Interpretation is the same. Hope this helps. -Kevin

Hi Karen, If the outcome variable is binary, you do not want to use this OLS model because of the inherent heteroskedasticity in a Linear Probability Model (it’s inefficient). Also, interpretation of coefficients becomes difficult when predicted values lie outside [0,1].

If you can clearly state your research question I may be able to point you in the right direction — in the meantime check out log it and probit models. HTH. -Kevin

Hi Kevin, I write to you to consult you about a question I have. I estimated a differences-in-differences model as follows: (t = a + b1treatment b2dt + b3 + (treatment * dt) + e) plus state x year fixed effects. This specification use it with different dependent variables such as employment, productivity and sales

The results I get to b3 are not significant. However, if I include firm fixed effects coefficients are all significant. I have doubts if this latter specification picking another effect different from the original idea of ​​DID.

THANK YOU. It’s 9 months since I turned in my thesis, and I couldn’t for the life of me remember how to do this. I’m pretty sure it’s a stress-induced mental block haha. Anyway, this was super clear and helpful — much appreciated!! 🙂

I will like to run a regression diff-in-diff. Variables: a household has a child(=0) or not (1) Household must be predicted from predictors. two years 2008 and 2010. education (7 levels) and treatment (received social support=1, they didnt receive social support=0). So, i would like to run two regressions, Household(i,1)=b0+b2*educ(i)+a(i)+m(i,1), the second: Household(i,1)=b0+b1*treatment(i,2)+b2*educ(i)+d(treatment)(i,2)*educ(i)+a(i)+m(i,2), then i must substract first model from second one…. DHousehold(i)=a+b1(D)treatment(i)+dDtreatment(i)*educ(i)+Dm(i). . I have problem because Household and treatment are 0/1 vars, and educ is ordinal (none to high educ). I must use reg in STATA? or logit? And how can i calc effects? what effects i have? fixed? the treatment group? Can you tell me some calcs on stata in order to understand better my problem?

What if your dependent variable is continuous? What if all of your variables are continuous? Does that mean in the ‘manual’ / sapply computation (lines 23 – 27) a threshold needs to be chosen? In certain cases that threshold would be obvious, like zero, but in other instances it is unclear. I imagine thresholds could come from the literature, from theory, or in a data driven approach like mean, median, or breaking the data up by quantile…. can you please clarify what to do in these instances?

Dexter — in the example above, the dependent variable is continuous. It was also designed to be simple for illustration. If you are wanting to look at *incremental* impacts of continuous variables on the outcome of interest, you will have to think hard about precisely what you are trying to measure so that you can set up your model properly. I think a good place to start would be to search on “regression and causality” or “modeling incremental impacts”. HTH-

Hi Kevin, I have a question. Could i use perceived change of Y variable as proxy for DID method? I have two set of survey data in different year (2000 and 2007) but unfortunately my main variable of interest, the dependent variable is only asked in later survey. The variable come from these questions: 1)how is the level of Y in this year (2007); 2)how has the Y changed compared with 2000 (increasing, the same, worse). Could I use the second as proxy for delta Y in DID model? what are the statistical implication (bias?) and how i could overcome it? thanks in advance. -Rumayya-

Rumayya — thanks for reading my blog. I think you have an insurmountable problem with your data set because of the missing outcome in year=2000. This fundamentally leaves a gaping hole in your experimental design. It might be better to simply use a cross sectional approach, although it will be difficult to show causality.

Hi eda — bootstrapping is one method used to correct bias in standard errors and/or an estimator by resampling the original data many times — in your example above, 5000 times. A search on ‘bootstapped standard errors’ or the R package on CRAN called ‘boot’ is a good place to start investigating this method. HTH

Hi Eda — propensity score matching is a quasi-experimental approach that attempts to control for selection bias into the treatment group. Typically, this approach is used when the researcher cannot randomly assign treatment herself and therefore relies on observational data. While PSM is out of scope for the blog post above, it is a related topic and perhaps I’ll add it to my list for a future post. In the meantime, check out the CRAN package `Matching`. HTH, Kevin

Hi Kevin,
Great post!
I’m running diff-in-diff with propensityscore in stata and I’d like to know if the casual effect of my model will be exactly the interaction coefficient. Do you know anything about it?

so in terms of interpretation what does it imply if the coefficient of the DiD term is negative…does it imply that the effect of the program on the treated was negative or is it that it was positive but was decelerating over time?

Hey Kevin, I am facing with my first time coding in STATA, and I am completely confused. I have to interact a lot of variables and I do not have a clue how to fo it. I am examining the influence of smoking on wages, and I wanted to use DID method. But the problem is that I have to examine separately for female and male. Therefore I have to write the codes for dummies for wages of females who smoke year 2001- wages of females who smoke year 2000, and wages of females who do not smoke year 2001-wages of females who do not smoke year 2000. Please HELP .. :)))

This post is old, but hopefully I get a response. In the beginning you did a manual version of the D-D estimator. I am no expert, but I didn’t think that the mean at the beginning and end of a sample, was statistically the same as the beginning and end of a regression line. Kevin, if you still receive notifications could you please answer, as to this question pertains to a project I am currently working on 🙂

@GJ — “select = work” is used to select the “work” variable from the dataset. Try it out yourself by loading the dataset and trying “subset(eitc, post93 == 0 & anykids == 0, select=nonwhite), or any other variable. Kevin does this because he wants to take the average of the “work” variable for the 4 different groups.

@Ica – I don’t use stata, but how about you use stata (or excel) to seperate your dataset into two datasets: female and male? Then the code Kevin wrote should apply.

@Nicholas — the dependant variables, “post” and “anykids” only take the values 0 and 1. So, there is no regression “line”. Try writing out the formula for a regression and you’ll see that in the case of only 0 and 1 values, it simplifies to kevin’s statement. (Also verified by the fact that using a “regression” returns the same values!

Thank you very much for your blog. I have a question regarding stata command “diff” (please “help diff” in stata) for the pooled/repeated cross section data (whether arguments for this command change for this data). I will write my code here and please confirm if I get it right.

I have data for two countries (lets say, countryA and countryB) for 5 years from 2001 until 2005. The treatment happens to countryB in the year 2002. So the stata commands I am using for the difference-in-difference estimator are:

*code starts here
period=1 if year>=2002
period=0 if year<2002

treated=1 if country=="countryB"
treated=0 if country=="countryA"

diff outcomevar, period(period) treated(treated) cov(a,b,c,d)

I am aware that this coding is pretty much a standard for this command. However, I would like to confirm if it prevails for the repeated/pooled cross-sectional data, and also, I would like to confirm if I got everything right.

dear kevin,
thanks for your blog, i have a question. for you. How to set up data for running the diff in diff?can you give an example? i am running a simple DID Estimation with two periods, the before 1993 and after 2003. Is this acceptable estimation?

I wonder how to specify a D-in-D model when you have outcome data collected at three time points–one before implementation of a policy and two after. I have a mind to just lump together the data from the two post intervention time points, and estimate the model you show above (y = b0 + b1*treatment + b2*time + b3*(treatment*time). Is there a way of explicitly modelling the different time, for instance using time dummies in this case?

Sure. You can use the all the observations pre & post treatment either by averaging multiple time periods (before or after), or just including all the observations in your regression with a flag for all observations post treatment. As I recall, the key issue when including >2 time periods in your regression is that autocorrelation will exist and tend to bias estimates of the impact upwards (though you should double check me on this as it’s been awhile since I read any academic research on this aspect). Cheers,