This post describes two approaches for deriving the Expected Lower Bound (ELBO) used in variational inference. Let us begin with a little bit of motivation.

Consider a probabilistic model where we are interested in maximizing the marginal likelihood $p(X)$ for which direct optimization is difficult, but optimizing complete-data likelihood $p(X, Z)$ is significantly easier.

In a bayesian setting, we condition on the data $X$ and compute the posterior distribution $p(Z | X)$ over the latent variables given our observed data. This may however require approximate inference. There are two general approaches, sampling using MCMC, and optimization using variational inference.

The main idea behind variational inference is to consider a family of densities $\mathcal(Q)$ over the latent variables, and use optimization to find $q(Z)$ that approximates our target posterior $p(Z | X)$. We measure this using the Kullback-Leiber divergence, that is

Now comes the interesting part. Because we are interested in optimizing by changing $q$, the $\log p(X)$ does not change when $q$ changes. And because the KL divergence between $q(Z)$ and $p(Z | X)$ is always positive, then $\mathcal{L}(g)$ must be a lower bound on $\log p(X)$. As a result, because changing the ELBO by manipulating $q$ does not change $\log p(X)$, the expression on the right must be equal to a constant, which means that increasing $\mathcal{L}(g)$ must decrease $KL(q(Z) || p(Z|X))$. But this is what we wanted all along!

If we find a way to maximize the ELBO, we are effectively minimizing the KL divergence between our approximate distribution $q(Z)$, and our target posterior distribution $p(Z | X)$. If we were to choose $q(Z) = p(Z | X)$, the KL divergence would be zero, and $\mathcal{L}(g) = \log p(X)$. This justifies maximizing the ELBO as an objective in variational inference.

ELBO using Jensen’s inequality

The Jensen’s inequality will give us a bit of motivation behind the ELBO.

In simple terms, Jensen’s inequality states that for a convex function $f(x)$ and a random variable $X$ we get

$$
E[g(X)] \geq g(E[X]).
$$

Recall that we’re interested in

$$
\log p(X) = \log \left( \sum_Z p(X, Z) \right).
$$

Introducing a new density $q(Z)$ on the latent variable $Z$ we can re-write the last equation as