We have all heard the phrase “correlation does not equal causation.” What, then, does equal causation? This course aims to answer that question and more!
Over a period of 5 weeks, you will learn how causal effects are defined, what assumptions about your data and models are necessary, and how to implement and interpret some popular statistical methods. Learners will have the opportunity to apply these methods to example data in R (free statistical software environment).
At the end of the course, learners should be able to:
1. Define causal effects using potential outcomes
2. Describe the difference between association and causation
3. Express assumptions with causal graphs
4. Implement several types of causal inference methods (e.g. matching, instrumental variables, inverse probability of treatment weighting)
5. Identify which causal assumptions are necessary for each type of statistical method
So join us.... and discover for yourself why modern statistical methods for estimating causal effects are indispensable in so many fields of study!

Enseigné par

Jason A. Roy, Ph.D.

Professor of Biostatistics

Transcription

In this video, we're going to go through a data analysis in R. And in particular, we'll go through an analysis where we fit a marginal structural model using IPTW, or inverse probability of treatment waiting. So as a data example, we'll use a publicly available data on right heart catheterization where this was data on ICU patients in five hospitals. The treatment was right heart catheterization or not, so RHC versus not. The outcome was death, yes or no? There were many confounding variables and there were 2 to 3,000 people in each group. And for this analysis, we'll use a few different R packages. So we'll use tableone, ipw, which is a package that is basically used to get inverse probability weights and so on. The sandwich package which is used to get robust variance estimation. And I also use the survey package which while it was designed for surveys, we can use it to get weighted estimators. And then you can begin by just reading in the data and viewing it and just get a better understanding of what the variables are. And in a previous video we looked at this code in detail. So I'll just very briefly talk about it. But this particular data set has a lot of character variables, and some R packages are sort of less accepting of that than others. And so a lot of times I like to just convert everything to numeric just to make things easier. And also then I can rename all the variables to make it a little more meaningful to me. So that when I see the variables in the table I know what they mean. So for example, Originally there was a variable called sex, but I'm going to change that to female. And that means I know it's an indicator variable and I know it takes value one if female. So I change these all to indicator variables. So for example, COPD will take a value of one if the person had a COPD diagnosis and take value zero otherwise. Here, what I'm calling treatment is this RHC yes or no kind of variable. So treatment takes value zero or one. One is the treated group. Our outcome I'm calling died. And then there's a bunch of confounders that we'll look at. And so I'm just creating a data set now, with just a subset of variables. And the reason I'm doing this is just to make things easier to follow along. But in practice you would want to use more variables than this. So I'm calling it mydata and I'm just putting a handful of these variables together into one data set. So first if we wanted to carry out inverse probability of treatment weighting, then a natural first step would be to fit a propensity score model. And so here I'm just going to fit a logistic regression model. And because it's a propensity score model, as a reminder, that means our outcome in the propensity score model was actually treatment. So yes or no. And then we include as confounders a few variables such as age, sex, blood pressure, and then some diagnoses. So these are different medical diagnoses. And I'm going to just use a logistic regression, so I tell it to use a logit link and I tell it it's binomial data. So this is going to fit a logistic regression model and I named it psmodel for propensity score model. So it fits this model but then what I actually want out of this is the propensity score itself. So I'm going to call that ps. And I can get that by using this predict command. Predict and then I'm saying my propensity score model was psmodel. And then give me the type equal response means it's giving me the predicted value. So this is actually one way to get the propensity score out of your propensity score model. And then this is just some output, if you wanted to look at that's why I fitted a logistic regression model where treatment is the outcome. And then these are coefficients and standard errors and so on. So you might be interested in what variables are associated with receiving the treatment, in this case right heart catheterization. And so as an example, let's just look at age. Age has a negative coefficient so older people tended to be less likely to get RHC, to get this treatment, although the p value is 0.08. So the evidence isn't extremely strong that it's related, but that's just an example. So you could look up and down this table and just see if things kind of make sense. And or maybe your collaborator might be an expert on this particular topic would probably want to look at it and make sure that things are generally in the direction that they would have expected. And you could also then look at a plot of the propensity score. This one is I say pre matching or pre waiting, this is just the raw propensity score distribution. And so this would be for the treated group, and this would be for the control. And you see a lot of overlap and so it doesn't look like there's anything too concerning here. But now I want to move on to, The thing that's more specific to IPTW. So so far, the previous slides were things that you could do if you wanted to do propensity score matching, for example. But now we're doing something a little different, now we're going to get into weights. So the first thing I'll do is create weights. So I'm calling it weight, and I'm using this ifelse command, there's a lot of different ways you could do it. But I'm saying if treatment equals 1, then the weight is 1 over the propensity score. If it's not, else make the weight 1 minus the propensity score. Because remember treated subjects the weight is 1 over the propensity score. Control subjects it's 1 over 1 minus the propensity score. And remember ps is my propensity score. So with one additional line, I'm able to get a weight for everybody. So there's a few ways that you could actually apply the weight to your data. So one thing I'll first show you is this survey design function. And all this is going to do here is I'm saying apply it to my, basically the big picture is apply weight, this particular weight to my data. That's essentially all it's doing. And then what it’s outputting is something called weighteddata, and now I can use weighteddata to get a weightedtable. Remember, we’re interested in a table one type of thing, where we can look at standardized differences. Well now I have a weighted sort of version of the data set. And now I can use the tableone R package, and so this is a command I could use, and I'm telling it, stratify on treatment. And I'm telling it to use weighted data. And another thing test equals false again just means, I'm not interested in significance testing here. I'm really more interested in standardized differences. So then when I print the weighted table I tell it yes, print this standardized differences. And then you'll see output that looks something like this. So we have treated versus control. We have standardized mean differences. And the actual table itself will be means and standard deviations. This is just automatic output from our I would actually just ignore the standard deviations here to because this is from weighted data so it's really based on these weighted sample sizes which aren't your real sample sizes. So I would actually just ignore the sample sizes and what they're calling standard deviations and I really did the means are right though. These are weighted means. These are means of the pseudo population and so this is what I care about. Now what you'll see is that there's very good balance here. So for example on age, the age is 61 versus 61. Very tiny. Standardized difference. If you look at all the standardized differences, they’re very small. I think the biggest one, this one is, 0.04 and we want them to be less than 0.1. So this waiting seemed to do really well, and it was pretty easy to have our outputted table like this for. But you could also, sort of get the a weighted mean directly using more brute force in a sense, where you could actually do the calculation yourself. So I'm just kind of reminding you here what a weighted mean formula would look like. So let's say I was interested, in fact what I'm going to do is I'm going try to get this value here. I'm going to try to get the weighted value of age in between the group. Okay, I'm going to try to get that to get that with group force as opposed to using this survey command. So let's imagine that X is age here, and I want it for the treated group. So I say, restrict to the treated group. So the syndicator function is just saying Restrict to the treated group, and then I'm weighting by the propensity score. So this is just the sum, this is like the sum of the weighted version of age in the numerator. And in the denominator is basically the weighted sample size. So we need to, Sort of account for the fact that we might be inflating our sample size, we’re using these Pi’s but our denominator here is going to add all that up and do the proper kind of division. So what you could really just think of this as I've created a pseudo population that's bigger than my original one. Now I'm going to just take the average of age in that pseudo population. So how do I do that in practice in R? Well so here's the division symbol. So to the right of that, is what appears here. So lets focus at that first. So all I am saying is check the mean of weight, which is 1 over pi among those who were treated. You're also good to say it take the sum because the effect to make this really to make these two things align, you would probably want to take the sum. But what's going to happen is you could also just take the mean because I'm also going to take the mean in the numerator and those 1/n's will just cancel out. So I could've said sum, I also could say mean. So that's all, but their t is that, I'm saying, what I want is the weight among the treated because weight is 1 over propensity score. Now with the numerator, it's the same kind of idea. I also have weight for the treated group, but I also am multiplying by the age in the treated group. And so then if you do that and do the division what we end up with here is a 61 [INAUDIBLE] so we can see it. The 61.4 and hopefully that is what we saw, and I'll just circle it in green. So we get the same answer. So I use kind of more a brute force. I'll just calculate it directly myself and get the same answer. That's all the survey, that survey kind of command is doing the some kind of thing. We're just taking this average of, in this case, of age in the treated population, but weighted, so it's a pseudo-population. Now let's get to marginal structural models and let's imagine that we're interested in fitting for simple marginal structural model, where, It’s just going to be a linear function of treatment. So here treatment is right-heart catheterization, yes or no? The outcome is died, yes or no? And then we have some link function g(). And I’ll look at two examples, one where we use a log link. And the reason you might want to use a log link is if you're interested in a relative risk. But I also look at an identity link which you would use if you were interested in a risk difference. So I'll look at both of those. So one way you could do this is remember, I already had my weights. So at this stage, I have my weights. So I could just fit a generalized linear model, but to the weighted, to the pseudo population. So that's what I'm going to do. I'm going to use a glm command. In fact, let me, I'll change to red so it's easier to see, glm command. My outcome is died, yes or no. Treatment is my only variable that's appearing on the right hand side. From remember that I said it's just linear in treatment. So the important thing here is to have wright know you want to weighted, sort of generalized linear model. And here I'm going to say, okay I have outcome data that's binary, but let's use a log link because I'm interested in a relative risk. So I fit that motto. I'm calling it just this generic glm.obj so just it's a GLM object, but from that then I can extract the coefficients, so I just say, give me the coefficients. So betaiptw, just the coefficients those ci zero and ci one from the marginal structural model. But now what I want is an asymptotic sort of sandwich variance, this robust variance estimator. So what I'm going to use is this vcovHC. This is coming from that sandwich package that gives you the sandwich kind of estimators. This is just to account for the fact that I have waited data. So the waits will make it seem like my sample size is bigger than it is. In a sense and we need to correct for that, so that's what this robust covariance matrix will do. So the v cove will give you in this case a two by two covariance matrix. because si zero and si one, those are some that's also covaried. But I really probably just care about the diagonal of it. So I'm just asking for the diagonal. That'll give me the two variance terms. And I take the square root to get the standard error. So my standard error of the intercept and the coefficient&nbsp;of treatment, the standard error of those is just SE. So now if I want a causal relative risk, I just have to exponentiate my parameters. And in fact, the causal relative risk is going to be e to the coefficient of treatment. Right? So if my model was expected value of Y A Log expected value of y a equals psi 0 + psi 1 a, then exponential of psi 1 is going to be a relative risk. And so, that's what I'm doing here. So what I'm calling causal rr is the relative risk. I can also get a lower confidence limit, and an upper confidence limit. And what I'm doing is, I'm just taking point estimate minus roughly two times the standard error, and then I exponentiate. So this inner part here is giving me, in this case, the upper bound for the log relative risk. But then I need to exponentiate to get the upper bound for the relative risk. And now I could just, I'll print them out for you, all three of them. So this is the point estimate of the cause of relative risk, and this is the lower 95 percent limit and upper 95 percent limit. So our 95 percent confidence interval is about 1.04 to 1.13. And a value of the relative risk greater than 1 is indicating higher risk of death for the treated group, for the right heart catheterization group. So, I can also quickly show you how to get a risk difference. So I just showed you a relative risk, but you might be more interested in a risk difference. So we will do the same thing as before, except now I will say link equals identity. So now the model I'm fitting is expected value of you = psi(0) + psi1(a). There is no log in front. So it's just this. That's what identity link means. I also, again, I'm telling it, it's a weighted situation. Do it on a pseudo-population, I can now put the coefficient, I can get the standard of error again. And then using the same kind of method, I can get a confidence interval. And now we see for the risk difference, the point estimate is about 0.05, the confidence interval goes from about 0.02 to 0.07. And since this is a risk difference, a value greater than 1 is saying that there's a greater risk of death in the treated group. So it's the same message but as a risk difference as opposed to a ratio. But we can do the same thing now using this IPW package. So I showed you how, if you have a generalized linear model kind of package, you can just do a weighted approach. But then you have to do this extra thing where you're asking it for the robust variance and so on. But we can also use this IPW package. So I'm going to show you that quickly. So the first thing you could, you would do is, I'm going to use this IPW point function. And the point part of it means, it refers to point treatment, which just means we don't have longitudinal kind of data. You don't have a bunch of treatments over time. So, it's a one-time kind of treatment situation. So we use IPW point, and we let it know our exposure is called "treatment". And at this point, what we're doing is really, we're kind of setting, we're basically specifying the propensity score model at this point. So this you could do instead of the whole propensity score model that's set up that I showed you, so this is an alternative. So you tell it link="logit". And then when it says denominator, what it means is that denominator of your weights, which is really this propensity score model. And then, you tell it the variables that go into your propensity score model. And what this is going to do is it's going to then get the weights for you. And so then I ask it to summarize, in what they call ipw.weights are the actual weights. And so we see the maximum weight here is 21.6. One nice thing about this package is that you can pretty easily get out plots, so for example, ipw plot. It's going to just plot the weights for you. You tell it what the weights are and I told it to restrict to 0 to 22 because I already know the maximum is 21.6. And then that's what it will do for you. It's basically like a density kind of plot and it's showing you the distribution of weights. And now you can fit the marginal structural model and the way you could do that. Another way you could do that, as opposed to the GLM function that I used with weights previously, is you could use the survey DLM command. And the main advantage of this is it's basically going to automatically give you the right variance estimator. It's going to do the robust sandwich estimator for you. So that's the main advantage of using survey glm. But again, you specify the marginal structuring model here. And now we have to use this design command, and this is really where you're going to tell it about your weights. And so it's not very, it doesn't end up being very complicated, you say survey design, and you tell it what your weights are. So from that, you can extract coefficients. So when we named this MSM, so I could say okay, tell me what the coefficients are. And those are the two coefficients. And this one is fitting, and this one is for identity link kind of situation. So this 0.052, should be the same as what I had on the previous slide, earlier. Let's take a look. Thankfully, it is the same. So these are just two different ways to get the same answer. And then if you want to do 95% confidence ANOVA for the treatment effect, it's right here. So this is maybe a little easier in that there's fewer steps. So you could use this IPW package along with survey deal and get this same kind of answer. And as one additional point, I just want to quickly show you how you could use, truncate the weights if you wanted to. So this particular example, I'm not very worried, because the largest weight was just over 20. So I probably wouldn't be very worried about weights that are too large, but I want to at least show you how to do it. So, I'll show you two quick ways you can do it. So one way is, remember earlier, where I created the variable called weight. Well, what I can do is I can just, I'm going to name a new variable truncweight, which means truncated weight. And I'm just going to say replace, use a replace function, replace weight if weight is greater than 10, so I'm going to truncate at 10. I just picked a value, I'm going to truncate at 10. If it's greater than 10, replace it with 10. So what's that going to do is anytime it sees a weight greater than ten, it's going to replace it with 10. So then my new maximum weight will be 10. And now, once I have the truncated weights, I could do the same thing as we did before, I could use GLM, but at this time i'll try to use truncated weights. So that's all there is to it, so if you want to fit a marginal structural model, using truncated weights for your weights, you can do that. Finally, I'll show you how you can do that in the ipw package. So, in the ipw package, they have a built-in truncation option. So again, we have ipwpoint, so most of this code should be the same as before. The thing that's new is right here. So you can use this trunc option, which means truncate. You don't put in an actual value of the weight; you have to put in a percentile. So, in this case I'm putting in .01. And what that means is that it's going to truncate at the first and the 99th percentile. So it's going to truncate the smallest weights, and it's also going to truncate the very largest weights. So it doesn't really want any observations to have a tiny, too tiny of an influence. So it'll truncate at the first percentile. And it also doesn't want any to have a huge influence so it'll truncate at the 99th percentile. To be extra clear, had I put like .02, then that would have been the 2nd and 98th percentile that it would have truncated at. So it's going to truncate at a percentile, that you pick. If you wanted to get truncated at a specific value, you would have to look at your weights and see what percentile corresponds to that value that you want to truncate at. But this is a thing that people commonly do, is truncate at a percentile. So now we could summarize our weights. But now it's called weights.trun, truncated weights. And you'll notice the max is 6, 6.4 about. So the 99th percentile was even smaller than a 10, the truncation of 10 that I did a minute ago. So that's truncating at about 6.4. And now, if we wanted to, we could fit the same model as before, but using the truncated weights. So we can use the survey GLM again, and we can use truncated weights this time. And then, what should happen is, we should end up with a slightly different answer this time than we got previously. So we end up with a point estimate of about 0.05, 95 percent confidence interval that goes from about a 0.03 to a 0.082, if we look back. You'll see that it is different, if you just look at the confidence interval. For example, this upper limit, when we didn't truncate, was about a 0.0797, and now it's 0.0816. So it didn't make a very big difference in this case and which we didn't expect to because we didn't have any extremely large weights. But at the same time, you might find that reassuring that we're getting a pretty consistent answer, whether we truncate or not. So you could also view this as a sensitivity analysis of sorts, where we did things two different ways, and we want to make sure we're coming to the same conclusion. But this is how you could go about truncating weights if you were interested in doing so.