Estimating Intervention Effects using Baysian Models in R

May 2018 · 26 minute read

Measuring the effect of an intervention on some metric is an important problem in many areas of business and academia. Imagine, you want to know the effect of a recently launched advertising campaign on product sales. In an ideal setting, you would have a treatment and a control group so that you can measure the effect of the campaign. Unfortunately, many times it is not possible or feasible to run a controlled experiment.

This project tries to measure the effect that compulsory wearing of seat belts had on UK driver deaths. We use the ‘SeatBelts’ dataset that contains information on the monthly total of car drivers in the UK killed or seriously injured between January 1969 and December 1984. Using front-seatbelts became mandatory starting 31 January 1983. We are interested in estimating the effect of this law on driver deaths.

To estimate the causal impact, we will try and compare different methods:

Vanilla Bayesian regression models

Vanilla model with AR-1 term

Structural time-series models as implemented in package CausalImpact

The vanilla regression model is as the name suggests, pretty simple: we simply include an indicator variable equal to 1 starting from the time wearing seatbelts became compulsory and check if the corresponding regression coefficient is statistically different from zero (hopefully negative). Then we extend the model with an AR-1 term and check again. Finally, we use the package CausalImpact to estimate the effect of introducing the seatbelt law.

While DriversKilled/kms is not quite normal, the distribution of the log() does not seem to depart much from a normal distribution. While y only needs to be conditionally normal given X, it is quite nice (although not at all necessary!) that it seems to be unconditionally normal too. We will take the log to help with numerical precision. Taking the log also gives a nice interpretation where we have the %-change in number of driver deaths per km driven given our covariates.

We start of by modeling the dependent variable using a vanilla regression model:

Vanilla regression

For the following hand-coded Gibbs sampler we assume the following model:

Since a normal-inverse-gamma prior is conjugate to a normal likelihood, the posterior is again a normal-inverse-gamma distribution. So we could either calculate the parameters of the posterior distribution and use it directly or we could use Gibbs sampling to approximate the posterior through simulation. We will do the latter here.

It can be shown that the full conditional distributions of \(\beta\) and \(\sigma\) are given by:

Pretty similar, but not quite the same. Very important to note here is that due to the small sample size (only 192 observations) the priors have quite a strong say on the posterior! If we were to put stronger prior beliefs into our model the prior would dominate the posterior and the results could be very different!

So both models suggest that the introduction of the law actually helped reduce the number of driver deaths per km driven.

Convergence tests

Since the results from our sampler are more or less the same as from OLS, we should not have any convergence issues. Nevertheless, let’s take a look at the traceplots of our parameters:

undecided case: we are not sure what we should expect (this is the scenario we considered above)\[p(law | \sigma^2) \sim \mathcal{N}(0, \sigma^2 1000)\]

easy-to-convince case: we are pretty sure that wearing seatbelts is a good idea
\[p(law | \sigma^2) \sim \mathcal{N}(-1, \sigma^2 0.01)\]

We will assume a constant prior for \(\sigma^2\) given by \(\sigma^2 \sim \mathcal{IG}(1,1)\) in all cases.

There are again 2 options to get the results: we could either use Gibbs sampling again or we could calculate the posterior distribution of the law coefficient directly. We will stick to using Gibbs sampling, because deriving the marginals analytically is quite messy.

The graph nicely shows that the choice of posterior has quite a large impact. This is mainly due to the fact that we have a small dataset and thus the likelihood does not overcome the prior completely. But while the conclusion in the hard-to-convince case is slightly biased towards no positive influence, we can see that quite a large region is actually still below zero. For this particular example, a prior with a lot of mass on 0, none on positive numbers (because intuitively, wearing a seatbelt should not make you worse off) and a long tail on the left, might be even better for sceptics.

In a next step, we code up the same model, but this time we use Stan to estimate the posterior. First, let’s set some general stan options:

The printed confidence intervals correspond to the 2.5%-97.5% credible interval and the estimate column represents the posterior mean. Fortunately, the results are again very similar to what we got from our first two methods (Note: LAW = beta[4], since I started indexing at 1 in this case, to add to general confusion:)

Stan also offers a formula interface that allows to specify simple models with independence priors and a variance with a flat improper uniform prior on \((0, \infty)\) (correct me if I am wrong, but I think that is the default).

Again, the results are pretty similar, despite the fact that we used independent priors this time. The formula interface from package stanarm offers less choice regarding model specification, but makes up for it by offering a convenient wrapper around Stan that is easy to learn for all R users that are familiar with formulas in lm() and friends.

For the rest of the project we will stick to stanarm to do in-depth analysis.

Model diagnostics

Stan provides an interactive web app for model diagnostics written with the Shiny framework created by the amazing folks at RStudio. Simply call launch_shinystan(<your_fitted_model_here>) to explore diagnostics and model fit statistics.

The rstanarm package is pretty convenient in that it automatically checks whether the chains converge or not. In this case there was no error and a quick check of the trace plots confirms that the chains look pretty good. One possible way to monitor convergence is the Rhat statistic. From the rstanarm vignette: Rhat “…measures the ratio of the average variance of the draws within each chain to the variance of the pooled draws across chains; if all chains are at equilibrium, these will be the same and Rhat will be one. If the chains have not converged to a common distribution, the Rhat statistic will tend to be greater than one.” In our case, Rhat is always 1, so everything appears to be fine.

We will evaluate model fit programmatically. First, let’s see how well the model fits in-sample:

y_rep = posterior_predict(model_glm_vanilla)
dim(y_rep)

## [1] 2000 192

We got 2000 simulated points for each observation, in our case 192. Let’s do some plots:

The new model has serially uncorrelated errors. Now we are back in the simple linear regression framework. In other words: the conditional distribution of \(\beta\) coefficients and \(\sigma^2\) is as described in the previous section given that we know \(\rho\).

Now suppose we know the \(\beta\) coefficients. In that case, \(u_t = Y_t - X_t\beta_t\) and we can treat \(u_t = \rho u_t + \epsilon_t\) as a linear regression model in \(\u_t\).

This suggests that our Gibbs sampler should look like this:

Draw the \(\beta\) coefficients conditional on knowing \(\sigma^2\) and \(\rho\) after serial correlation has been removed

Afterwards, draw \(\rho\) conditional on \(\beta\) and \(\sigma^2\)

Finally, conditional on \(\beta\) and \(\rho\), draw \(\sigma\)

For the following hand-coded Gibbs sampler we assume the following model:

The coefficients look pretty similar to what we had in the no-autocorrelation sampler, which is reassuring, since autocorrelation should only affect the standard-errors of the coefficients not the coefficients. Let’s compare the results with an ARIMAX model using the same model specification:

Both chains seem to mix well. So convergence should be fine. Almost all sample points drawn for the LAW coefficient are below 0 so that strengthens our belief that the law did have a positive effect. The distribution of the parameter \(\rho\) is rather wide, suggesting quite some uncertainty, but it is clearly not zero.

Bayesian structural time series models using CausalImpact

In this section we will use yet another approach to check whether the requirement to wear seatbelts helped reduce driver deaths.

Let’s begin by splitting our data set into a pre- and posttreatment phase: