Causal Analysis Introduction - Examples in Python and PyMC

Correlation is not causation is a common mantra. It's easy to find examples of spurious correlation in the world and in many data sets. It's wise not to assume causal effect in many cases, but there are times when we want to infer causal relationships in datasets.

One model that attempts to look at this is the Rubin model of causal effect. We'll explore this idea, and recreate some analysis. This model is typically useful for observational data. It's not the only paradigm for estimating causality, another approach revolves around ideas of structural equation modeling, and might be more general and useful.

The Core Problem in Measuring Causal Effect

Borrowing from the wikipedia article on the Rubin causal model, image we have a few people partaking in a study where we want to see if some treatment helps or hinders a subject's outcome. We first observe what happens to each subject without the treatment, then we grab a time machine and restart the experiment having administered treatment to all the subjects. From this we would be able to directly observe how the treatment effected each subject.

$Y(1)$

$Y(0)$

$Y(1) - Y(0)$

Joe

$-5$

$5$

$-10$

Mary

$-10$

$-5$

$-5$

Sally

$0$

$10$

$-10$

Bob

$-20$

$-5$

$-15$

Mean

$-10$

In this table we use the notation $Y(1)$ as outcome having been treated (treatment), and $Y(0)$ as the outcome if there was no treatment (control). With full observations we could then simply look at the difference in outcome $Y(1) - Y(0)$, then average them.

Causal Effect since we don't have a time machine

In general we're typically not going to be able to fully observe the outcomes for each subject. Typically we'll have a dataset with partial observations,

$Y(1)$

$Y(0)$

$Y(1) - Y(0)$

Joe

$?$

$5$

$?$

Mary

$-10$

$?$

$?$

Sally

$?$

$10$

$?$

Bob

$-20$

$?$

$?$

Mean

$?$

We could try averaging each treatment and control group, then estimating the differences.

Estimating Causal Effect

In general we cannot directly observe causal effect. We have to look at methods that estimate this, similar to how we averaged the treatment & control group above.

Substitutes to Potential Outcome

A common and sometimes accidentally applied method is substituting the potential outcome.

Examples:

Maybe you can repeat treatment, e.g. drinking tea before bed.

Dividing up a piece of plastic and exposing it to a corrosive chemical.

The effect of a diet over time by measuring weight.

These tend to carry strong assumptions that may be implicit in the choice of substitution. They can be regularly found in research, and though the assumptions can be left out or be part of a bad study, are generally ok in many situations.

Drinking tea before bed one night probably doesn't have a huge effect on successive nights, unless you're a habitual caffeine drinker, in which case you might have less than useful results.

The Randomized Controlled Trial

The typical gold standard when it comes to measuring causal effects is the randomized controlled trial. The goal is to attempt comparing apples with apples by controlling for various factors. We use randomization to try to avoid introducing bias.

This model is not always perfect and has a few issues that can come up, that might make it not possible to conduct an experiment in this manner:

It could be cost prohibitive.

Participants could self-select into the treatment group, e.g. company benefits programs.

You may not be involved in the study design, and only receive data post-hoc.

It might be unethical to control treatment.

Statistical Adjustment

The goal of statistical adjustment is to approximate what a randomized experiment would achieve. The idea is to use statistical ideas to find similar units to compare. This is what we'll explore further with our causal model, and with pairwise matching, though there are many things that could fall into this category.

The Rubin Causal Model

Switching gears for a moment, let's revisit our earlier example:

$Y(1)$

$Y(0)$

$Y(1) - Y(0)$

Joe

$?$

$5$

$?$

Mary

$-10$

$?$

$?$

Sally

$?$

$10$

$?$

Bob

$-20$

$?$

$?$

Mean

$?$

$?$

$?$

What if we can estimate the unknown values which we can't observe in practice? We could look at building a linear regression on whether or not a subject had treatment, we'll use the indicator $D_i$ to represent treatment for some subject $i$.

$$
\hat{Y}_i = \alpha + \tau D_i + \epsilon_i
$$

From this, we can make use of the typical interpretation of linear model coefficients and say that $\tau$ is a value the represents how the outcome differes whether the subject received treatment or not.

which can be shown to be equal to, $\hat{\tau} = \bar{Y}(1) - \bar{Y}(0)$. This estimator $\hat{tau}$ is going to be similar to the difference in average outcomes of treatment status. This is the core idea for estimating causal effect with this framework.

We can continue further by adding in features that help improve the model. One might wish to control for various covariates $X_i$, or might believe that the treatment has interaction effects with those covariates. These lead to the corresponding regression models:

In both of these models we're still only concerned with interpreting the coefficient $\tau$.

Assumptions in This Model

What are we assuming here?

We won't look in depth at these assumptions here, but it's important to know that the formulations of these models assume that the treatment is Strongly Ignorable. This means that we want the following two traits,

Unconfoundedness: $D$ is independent of $(Y(0), Y(1))$ conditional on $X=x$, e.g. the treatment of one group does not affect the other group.

Overlap: $c < \mathbb{P}(D=1 | X=x) < 1 - c$, for $c > 0$.

These assumptions are likely where you'll find the most criticism of this type of model, and warrants actual study of the source material which I've referenced below.

Matching

I mentioned that we were going to look at statistical adjustment, in particular matching. Matching is used when you have imbalance data sets. Usually a large amount of control samples and a small amount of treatment samples. It's an attempt to remove the imbalance in these groups and approximate what you might have had given a randomized or controlled experiment.

We'll look at two methods: propensity matching, and using Mahalanobis distance between samples. Propensity matching is definitely the most common method, though there are criticisms against using it blindly.

Propensity Score Matching

A propensity score is an approximate model of how likely a subject is to have been in the treatment group, $P(D_i = 1 | X_i)$. It's solved easily with an intermediate regression, typically something like

$$
D_i = \text{logistic}(\alpha + \beta X_i + \epsilon_i)
$$

Once built, we can then estimate propensities for both of our treatment and control groups. Hopefully we'll find that there are some subjects in the treatment group with low propensity, but the majority have high propensity (they did actually end up in the treatment group), and the opposite for the control group.

We would then find paired (or many) matches in the control group for each sample in the treatment group picking matches that minimize the differences in propensity score. Since we're dealing with estimates, it's ok if this isn't an optimal match, and we can usually use a simple algorithm like a greedy random match.

Distance Matching

Using a measure like Mahalanobis distance is also an effective way to match groups, and has the bonus of not introducing an intermediate model.

We would look at picking a match that minimizes distance between covariates,

$$
m(i) = \arg \min_{j:W_j \ne W_i} || X_j - X_i ||
$$

Where we define $||X_j - X_i||$ as a distance between the covariate vectors $X_j$ and $X_i$ as follows:

This is a special case of Mahalanobis distance that's called the Normalized Euclidean distance.

Examples

Let's start looking at this with a few different examples on the same data set. This dataset can be found here, and is one that was used in particular with a propensity matched study. The data is a population survey including some basic demographics about each subject, and information about their salary. Our treatment group is filled with subjects who partook in an occupational training program. We expect to see that having gone through training has a positive end effect on salary.

After downloading the files, we load them in Python, and add a few useful features.

There are a lot of samples, and a lot of imbalance between the control and treatment group. We have only 185 treatment observations, and 15,992 control observations. It's likely that if we built a causal model on these all of these samples, that we'd see the large number of control samples wash out the value of the treatment indicator. Thus this is a great dataset to use with matching.

If we look at the compared density plots of the controls and the treatments we see the following:

Using the causalinference library

The first thing we're going to do is look at a well made library implemented to specifically deal with this type of causal model, the causalinference package in python. It's very straightforward to run several models quickly.

This library include faculties to perform matching for us. Methodology details can be found in their documentation. We run three models, and see positive treatment effects in each one, though only one is "significant". These are seen in the output above as the "ATE" output lines, where ATE stands for average treatment effect.

While I like this library and recommend it, I've had numerical issues using it on larger and more complex data sets (it's implementations try to invert some matrices which is typically a recipe for singular matrix errors). Additionally I like to use Bayesian frameworks where I can, since I tend to prefer the probabilistic interpretation of results over p-values and hypothesis tests.

Using a simple Bayesian linear regression (PyMC)

Now I'll look at implementing my own matching routines, and using a Bayesian linear regression.

For $y \in \mathbb{R}$, and $\mathbf{x} \in \mathbb{R}^M$ with $N$ data points we have $(\mathbf{X}, \mathbf{y}) = {(\mathbf{x}_n, y_n)}$. With the following distributions,

This is already implemented in PyMC with the glm functionality. Thus, all we will need to do is pass in the formula for our model.

Propensity Matching

For propensity matching we first need some estimate of propensity score. In the interest of computational time, we'll use a standard frequentist logistic model to act as our propensity function. We do this since it's fast, and we don't plan to interpret this intermediate model in this particular analysis. We want to estimate $p(\text{treat} | X)$, we do this with the following using the popular statsmodel package.

From this fitted model, we can then make propensity score estimates, and come up with some matches.

_,X_propensity=dmatrices(propensity_formula,unmatched_raw,return_type='dataframe')propensities=propensity_model.predict(exog=X_propensity)split=cps.shape[0]treatment=np.expand_dims(propensities[split:],1)control=np.expand_dims(propensities[:split],1)# Calculating the distance in propensity of each treatment sample to each control sampledistances=cdist(treatment,control)# Computing an optimal match between the two groupstreatment_idx,control_idx=linear_sum_assignment(distances)treatment_data=unmatched_raw[split:].iloc[treatment_idx]control_data=unmatched_raw[:split].iloc[control_idx]

Here, we compute an optimal match, though this is not always very efficient (or needed). As mentioned earlier typically we can compute a random greedy match with or without replacement and achieve similar results. This will run much faster since it's just a simple loop. Since our treatment group is small enough I wanted to try the optimal matching method.

Now if we compare our density plots we see much more similarity in our groups,

Now that we've got a treatment & matched control group we can put these together and build a model to estimate treatment effect. I do include the covariates in this model ($\hat{Y}_i = \alpha + \tau D_i + \beta X_i + \epsilon_i$, the second regression noted above), but it would just require updating the formula to look at just a regression on the treatment indicator, or to add in the interaction terms with the treatment indicator.

We only output the trace plot for the treatment variable, since it's what we're interested in currently. If you show all the sampled parameters they do look like they've converged sufficiently, one could check this more thoroughly with an autocorrelation on these samples. If we look at what the parameters means are,

Here, we specify the diagonal matrix VI so that we have the normalized Euclidean distance case. The other lines are similar to the propensity matching. Once again we can look at the distribution plot estimation, and we see that it's very similar to the propensity matched grouping.

Once again we get a very similar treatment effect. The rationale here is that our earlier regression models are largely derived from this idea.

Results

We've seen a few different implementations of the Rubin model of causal effect, and looked at matching with propensity scores, and Mahalanobis distance. With all of these models we've come up with similar positive treatment effects. Note, that if we had just used the OLS or Matching treatment effects from the causalinference library we might not have claimed any significance. However, given that each of these models comes up with similar treatment effects, we can be a bit more confident that there is in fact some positive treatment effect. In fact, with our Bayesian models we're relatively sure that there's a positive effect $P(treat > 0) \approx 0.97$.

The difference between the propensity matching and Mahalanobis distance matching was very small. In other models this might not be the case and there might be good reasons to use one over the other. In general propensity matching is a very popular approach for matching, but one might want to be careful about understanding the strong ignorability assumption that goes along.

Lastly, the Rubin model of causal effect is relatively simple, and more advanced approaches have been studied with structural equation modeling. The work by Judea Pearl is very good and his book, Causality: Models, Reasoning and Inference, is a good reference on the topic.