Latent Dirichlet Allocation

What is it?

In a general view, LDA is an unsupervised method for clustering documents. It models (purified) documents as bag of words. Also it assumes each word (and document) has a mixture model of topics i.e. each word (and document) may belongs to each of the topics by a probability. It takes number of clusters in the corpus as input then, simply assigns each word in each document a random topic. Then tries for

It was a very general description of LDA.

How it is work?

The process of LDA depends on the bag of words model of documents. First of all there are K topics that is input of LDA (guessed!). We have totally D documents and V distinct vocabulary in the document set. The generative process is:

For k = 1 … K:

φ(k) ∼ Dirichlet(β)

For each d in D:

Θd∼ Dirichlet(α)

For each word wi in d

zi∼Discrete(θd)

wi∼Disctete(φ(zi))

This is the total process. But what it means?

Dirichlet

In simple words, Dirichlet is a probabilistic distribution that has K concentration parameters. Each parameter (α) is a random number greater than zero (α > 0). Following is an example of Dirichlet distribution for 20 documents with 4 topics. Parameters for this example (α1 = 10, α2 = 5, α3 = 3, and α4 = 20).

φ(k)

This is a Dirichlet distribution for the Kth topic. The φ is a KxV matrix where each element is the probability of belonging the vth word to the kth topic.

Θd

Similarly the Θd is a Dirichlet for the document d. It shows the belonging of the document to each of the topics.

Finally

The process is as below in simple words:

For each topic:

Randomly initialize belonging probability of each word in vocabulary to the topics.

For each document:

Randomly initialize belonging probability of current document to the topics.

For each word:

Choose a topic from Θd (zi)

Randomly choose a new word from φ(k) where k is the selected topic in the previous part.

The last step, helps us to find words similar to the current chosen one to be in same cluster.

In the next post, I will explore the mathematics behind the LDA. Any comments?

Mir Saman

I'm currently an IT PhD. candidate at Urmia University. I'm interested in Social Network Analysis, Big Data Mining, and NLP in my academical field as well as Guitar, Nature, and Android!
View all posts by Mir Saman