Academic

Recently I’ve interested in byte code structure of Java and Dalvik. I’ve found some useful tools for playing with them.

Destination Byte Code

Java byte codes are simple to reverse engineering because they compile in run time. i.e. JVM will execute the byte codes in run time, thus Java code is cross platform but executes with more delay than direct compiled machine codes (for example using C++ and gcc).

In the previous post, I’ve described the LDA process and how it can be applied on documents.

In this post I will explain how the probabilities can be estimated using collapsed Gibbs sampling.

Lets start with the LDA Probabilistic Graph Model.

Where W is the sampled word from document, Z is the topic assigned by Document (d), θ is the Dirichlet distribution of d, α and β are the input of Dirichlets. More info about hyperparameters can be found here.

So the only known variables are α, β, and w. All others (z, θ, and φ) are unknown. So based on the LDA graph we have:

p(w, z, θ, φ | α, β) = p(φ|β) p(θ|α) p(z|θ) p(w|φz)

The right side of the above conditional probability can be reached by the probabilistic graph model where each variable only depends on its parent nodes.

What is it?

In a general view, LDA is an unsupervised method for clustering documents. It models (purified) documents as bag of words. Also it assumes each word (and document) has a mixture model of topics i.e. each word (and document) may belongs to each of the topics by a probability. It takes number of clusters in the corpus as input then, simply assigns each word in each document a random topic. Then tries for