All Bayesian Models are Generative (in Theory)

I had a brief chat with Andrew Gelman about the topic of generative vs. discriminative models. It came up when I was asking him why he didn’t like the frequentist semicolon notation for variables that are not random. He said that in Bayesian analyses, all the variables are considered random. It just turns out we can sidestep having to model the predictors (i.e., covariates or features) in a regression. Andrew explains how this works in section 14.1, subsection “Formal Bayesian justification of conditional modeling” in Bayesian Data Analysis, 2nd Edition (the 3rd Edition is now complete and will be out in the foreseeable future). I’ll repeat the argument here.

Suppose we have predictor matrix and outcome vector . In regression modeling, we assume a coefficient vector and explicitly model the outcome likelihood and coefficient prior . But what about ? If we were modeling , we’d have a full likelihood , where are the additional parameters involved in modeling and the prior is now joint, .

So how do we justify conditional modeling as Bayesians? We simply assume that and have independent priors, so that

.

The posterior then neatly factors as

.

Looking at just the inference for the regression coefficients , we have the familiar expression

.

Therefore, we can think of everything as a joint model under the hood. Regression models involve an independence assumption so we can ignore the inference for . To quote BDA,

The practical advantage of using such a regression model is that it is much easier to specify a realistic conditional distribution of one variable given others than a joint distribution on all variables.

A Bayesian model plus data defines a posterior. Full Bayesian inference is uniquely determined by the model.

There are techniques for approximate Bayesian inference such as using a point estimate based on MAP (equivalent to regularized or penalized MLE) or based on the posterior mean (or its L1 equivalent which minimizes a different point estimate loss function), or you can use variational inference or expectation propagation or a Laplace approximation to approximate the whole posterior.

But none of this changes a “generative” model to a “discriminative” one.

Statistics isn’t really about the “why”. You can build causal experiments or buy into the whole Judea Pearl structural graphical modeling thing, but the bottom line is that neither classical frequentist stats nor Bayesian stats really try to answer the “why” question.

With a Bayesian posterior, you do model the posterior covariance. But the “why” is a different question.

You are never going to ensure independence of parameters in models. Even in a simple linear regression, , you find correlation of the slope and intercept . In a more realistic case, if I use income and education as predictors, you’ll find the coefficients correlated because the predictors are correlated. You could decorrelate the predictors for any given sample using SVD, but it’s a lousy assumption that the resulting transform will decorrelate the entire population.

I overstated the above. You can do causal inference in various ways using statistics. I just meant that statistical inference itself isn’t specific to causal inference and doesn’t intrinsically say anything about causation.

In the linear regression you can often break that correlation by adding more variables. For example, in the time series case by adding Y_0 as a regressor for Y_t (ignoring problems of endogeneity for now). In general we want to explain fixed effects.

Also that two regressors covary is neither necessary nor sufficient for the parameters attached to them to also covary. It might be an empirical regularity, but it need not.

Maybe I am not understanding but if I play God for a minute and build my one model to generate my own data I can draw x1 and x2 from a multivariate Normal where they covary, draw independent parameters b1 and b2, and make y=b1x1 +b2x2.

I thought we were talking about statistical inference? Both Bayesians and frequentists view the parameters b1 and b2 as being fixed, and the problem as one of inferring what they are.

I can’t speak for what Nature can and cannot do, but if the multivariate It creates is correlated for the predictors X1[n] and X2[n], then I as a lowly observer of data, will get correlation in my estimates for b1 and b2.

If I knew what the multivariate normal was that they were drawn from, I can decompose it into two different, independent variables Y1[n] and Y2[n] plus the translation, rotation and scaling matrix. So if you can model the generative process of X1 and X2, and then you’d have the whole story. But there’d still be correlation in your posterior for all of these parameters (or a non-diagonal Fischer information matrix if you’re a frequentist).

The confusion arises bc when I read “generative” models I think of the models Nature uses to generate the data we observe, otherwise know as structural models. Hence I thought you were imposing costraints on Nature.