I'm using Google Research's Causal Impact package, and I'd like to understand more fully how the package prevents overfitting and selecting a bad batch of the covariates by chance.

Here's the paragraph from the paper which summarizes how it prevents overfitting:

"Third, we use a regression component that precludes a rigid commitment to a particular set of controls by integrating out our posterior uncertainty about the influence of each predictor as well as our uncertainty about which predictors to include in the first place, which avoids overfitting." (p. 251)

Why does "integrating out our posterior uncertainty about the influence of each predictor as well as our uncertainty about which predictors to include in the first place" avoid overfitting, in as close to layman's terms as possible?

1 Answer
1

First of all, as a courtesy to other people reading this, I would like to provide the link to the paper we're discussing, which also contains the citation details.

The model used in the paper uses hyperparameters for the choice of controls, i.e., each choice of controls corresponds to a choice of certain hyperparameters. The formal equivalent of the "uncertainty" of choosing these parameters is a probability distribution ("prior distribution") over these parameters. Now instead of choosing one of the possible sets of controls (which would correspond to choosing one particular set of hyperparameters), the authors integrate over the possible hyperparameter choices, weighting them according to the prior distribution. That is what they mean by "integrating out" the hyperparameters.

The reason why this strategy is less prone to overfitting than the fixed choice of one set of controls has a lot to do with the concrete prior distribution chosen in the paper. The authors choose a prior which gives a certain advantage to small sets of controls, but is otherwise rather "flat".

This means that the prior is constructed to avoid the two main sources of overfitting that exist in this setting:

The advantage for small sets of controls is particularly important when you have a large and diverse enough set of candidates for controls to allow for the approximation of every strange detail (i.e., noise) of your target time series if you only use enough of them simultaneously. The prior chosen by the authors therefore penalizes large sets of controls.

The "otherwise rather flat" part means that we don't commit to the "best" choice for a set of controls (where "best" may so good that we even fit the noise in the data). Instead, we average over a set of "decent" choices for the set of controls. While some of these "decent" choices may in fact be overfitting, this problem is greatly moderated by the average.