Structured and Scalable Probabilistic Topic Models

Posted on March 24th, 2017

John Paisley, Assistant Professor of Electrical Engineering, spoke about models to extract topics and the their structure from text. He first talked about topic models in which global variables (in this case words) were extracted from documents. In this bag-of-words approach, the topic proportions were the local variables specific to each document, while the words were common across documents.

Latent Dirichlet Analysis captures the frequency of each word. John also noted that #LDA can be use for things other than topic modeling.

Capture assumptions with new distributions – is the new thing different?

Embedded into more complex model structures

Next he talked about moving beyond the “flat” LDA model in which

No structural dependency among the topics – e.g. not a tree model

All combinations of topics are a prior equally probable

To a Hierarchical topics model in which words are placed as nodes in a tree structure with more general topics are in the root and inner branches. He uses #Bayesian inference to start the tree (assume an infinite number of branches coming out of each node) with each document a subtree within the overall tree. This approach can be further extended to a Markov chain which shows the transitions between each pair of words.

He next showed how the linkages can be computed using Bayesian inference to calculate posterior probabilities for both local and global variables: The joint likelihood of the global and local variables can be factors into a product which is conditional on the probabilities of the global variables.

He next compared the speed-accuracy trade off for three methods

Batch inference – ingest all documents at once, so its very slow, but eventually optimal

optimize the probability estimates for the local variables across documents (could be very large)

optimize the probability estimates for the global variables.

Repeat

Stochastic inference – ingest small subsets of the documents

optimize the probability estimates for the local variables across documents (could be very large)

take a step toward to improve the probability estimates for the global variables.

Repeat using the next subset of the documents

MCMC, should be more accurate, but #MCMC is incredibly slow, so it can only be run on a subset

John showed that the stochastic inference method converges quickest to an accurate out-sample model.