Download Presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

“Bag of Words” Models

Mixture of Unigrams

Choose N words by drawing each one independently from a multinomial conditioned on z.

In the Mixture of Unigrams model, we can only have one topic per document!

The pLSI Model

For each word of document d in the training set,

Choose a topic z according to a multinomial conditioned on the index d.

Generate the word by drawing from a multinomial conditioned on z.

In pLSI, documents can have multiple topics.

d

zd1

zd2

zd3

zd4

wd1

wd2

wd3

wd4

Probabilistic Latent Semantic Indexing (pLSI) Model

Motivations for LDA

In pLSI, the observed variable d is an index into some training set. There is no natural way for the model to handle previously unseen documents.

The number of parameters for pLSI grows linearly with M (the number of documents in the training set).

We would like to be Bayesian about our topic mixture proportions.

Dirichlet Distributions

In the LDA model, we would like to say that the topic mixture proportions for each document are drawn from some distribution.

So, we want to put a distribution on multinomials. That is, k-tuples of non-negative numbers that sum to one.

The space is of all of these multinomials has a nice geometric interpretation as a (k-1)-simplex, which is just a generalization of a triangle to (k-1) dimensions.

Criteria for selecting our prior:

It needs to be defined for a (k-1)-simplex.

Algebraically speaking, we would like it to play nice with the multinomial distribution.

Dirichlet Examples

Dirichlet Distributions

Useful Facts:

This distribution is defined over a (k-1)-simplex. That is, it takes k non-negative arguments which sum to one. Consequently it is a natural distribution to use over multinomial distributions.

In fact, the Dirichlet distribution is the conjugate prior to the multinomial distribution. (This means that if our likelihood is multinomial with a Dirichlet prior, then the posterior is also Dirichlet!)

The Dirichlet parameter i can be thought of as a prior count of the ith class.

The LDA Model



For each document,

Choose ~Dirichlet()

For each of the N words wn:

Choose a topic zn» Multinomial()

Choose a word wn from p(wn|zn,), a multinomial probability conditioned on the topic zn.







z1

z2

z3

z4

z1

z2

z3

z4

z1

z2

z3

z4

w1

w2

w3

w4

w1

w2

w3

w4

w1

w2

w3

w4

b

The LDA Model

For each document,

Choose » Dirichlet()

For each of the N words wn:

Choose a topic zn» Multinomial()

Choose a word wn from p(wn|zn,), a multinomial probability conditioned on the topic zn.

Inference

The inference problem in LDA is to compute the posterior of the hidden variables given a document and corpus parameters  and . That is, compute p(,z|w,,).

Unfortunately, exact inference is intractable, so we turn to alternatives…

Variational Inference

In variational inference, we consider a simplified graphical model with variational parameters ,  and minimize the KL Divergence between the variational and posterior distributions.

Parameter Estimation

Given a corpus of documents, we would like to find the parameters  and  which maximize the likelihood of the observed data.

Strategy (Variational EM):

Lower bound log p(w|,) by a function L(,;,)

Repeat until convergence:

Maximize L(,;,) with respect to the variational parameters ,.

Maximize the bound with respect to parameters  and .

Some Results

Given a topic, LDA can return the most probable words.

For the following results, LDA was trained on 10,000 text articles posted to 20 online newsgroups with 40 iterations of EM. The number of topics was set to 50.