9.5 Latent Dirichlet Allocation

Latent Dirichlet allocation (LDA) is a mixed-membership multinomial
clustering model Blei, Ng, and Jordan (2003) that generalized naive
Bayes. Using the topic and document terminology common in discussions of
LDA, each document is modeled as having a mixture of topics, with each
word drawn from a topic based on the mixing proportions.

The LDA Model

The basic model assumes each document is generated independently based
on fixed hyperparameters. For document \(m\), the first step is to draw a topic
distribution simplex \(\theta_m\) over the \(K\) topics,

\[
\theta_m \sim \mathsf{Dirichlet}(\alpha).
\]

The prior hyperparameter \(\alpha\) is fixed to a \(K\)-vector of positive
values. Each word in the document is generated independently
conditional on the distribution \(\theta_m\). First, a topic
\(z_{m,n} \in 1{:}K\) is drawn for the word based on the
document-specific topic-distribution,
\[
z_{m,n} \sim \mathsf{Categorical}(\theta_m).
\]

Finally, the word \(w_{m,n}\) is drawn according to the word distribution
for topic \(z_{m,n}\),
\[
w_{m,n} \sim \mathsf{Categorical}(\phi_{z[m,n]}).
\]
The distributions \(\phi_k\) over words for topic \(k\) are also given a
Dirichlet prior,
\[
\phi_k \sim \mathsf{Dirichlet}(\beta)
\]

where \(\beta\) is a fixed \(V\)-vector of positive values.

Summing out the Discrete Parameters

Although Stan does not (yet) support discrete sampling, it is possible
to calculate the marginal distribution over the continuous parameters
by summing out the discrete parameters as in other mixture models.
The marginal posterior of the topic and word variables is

As in the other mixture models, the log-sum-of-exponents function is
used to stabilize the numerical arithmetic.

Correlated Topic Model

To account for correlations in the distribution of topics for
documents, Blei and Lafferty (2007) introduced a variant of LDA in
which the Dirichlet prior on the per-document topic distribution is
replaced with a multivariate logistic normal distribution.

The authors treat the prior as a fixed hyperparameter. They use an
\(L_1\)-regularized estimate of covariance, which is equivalent to the
maximum a posteriori estimate given a double-exponential prior. Stan
does not (yet) support maximum a posteriori estimation, so the mean and
covariance of the multivariate logistic normal must be specified as
data.

Fixed Hyperparameter Correlated Topic Model

The Stan model in the previous section can be modified to implement
the correlated topic model by replacing the Dirichlet topic prior
alpha in the data declaration with the mean and covariance of
the multivariate logistic normal prior.

Full Bayes Correlated Topic Model

By adding a prior for the mean and covariance, Stan supports full
Bayesian inference for the correlated topic model. This requires
moving the declarations of topic mean mu and covariance Sigma
from the data block to the parameters block and providing them with
priors in the model. A relatively efficient and interpretable prior
for the covariance matrix Sigma may be encoded as follows.

The \(\mathsf{LkjCorr}\) distribution with shape \(\alpha > 0\) has support
on correlation matrices (i.e., symmetric positive definite with unit
diagonal). Its density is defined by
\[
\mathsf{LkjCorr}(\Omega|\alpha) \propto \mbox{det}(\Omega)^{\alpha - 1}
\]
With a scale of \(\alpha = 2\), the weakly informative prior favors a
unit correlation matrix. Thus the compound effect of this prior on
the covariance matrix \(\Sigma\) for the multivariate logistic normal is
a slight concentration around diagonal covariance matrices with scales
determined by the prior on sigma.