Standalone Animations

This tutorial is a basic introduction to topic modelling for web scientists.
Prior knowledge on probabilistic modelling or topic modelling is not required. The idea is to explain the fundamental mechanisms and ideas behind topic modelling, without using distracting formal notation unless necessary.

Outline

In this tutorial, we teach the intuition and the assumptions behind topic models.
Topics explain co-occurrences of words in documents with sets of semantically
related words, called topics. These topics are semantically coherent and can
be interpreted by humans. Starting with the most popular topic model,
Latent
Dirichlet Allocation
(LDA), we explain the fundamental concepts of probabilis-
tic topic modeling. We organise our tutorial as follows: After a general intro-
duction, we will enable participants to develop an intuition for the underlying
concepts of probabilistic topic models. Building on this intuition, we cover the
technical foundations of topic models, including graphical models and Gibbs
sampling. We conclude the tutorial with an overview on the most relevant
adaptions and extensions of LDA

Developing an Intuition

In the first part, we provide the participants with an intuition of the ideas and assumptions behind probabilistic topic models. First, we present easily understandable metaphors (following the Polya urn scheme) to introduce the multinomial and the Dirichlet-multinomial distribution and the role of the parameter of the Dirichlet distribution for probabilistic modelling. Furthermore, we introduce the notion that a corpus of documents can be modelled as a mixture of Dirichlet-multinomial distributions. We then train LDA on text corpora and demonstrate the effects of different parameter settings on the trained topic models. In order to deepen the intuition, we conclude this part with a game with a purpose, enabling a human evaluation of model parameters.

Technical Foundations

After developing the intuition, in the second part of the tutorial we show how the assumptions in the metaphors translate to the single parts of Latent Dirichlet Allocation (LDA), the most cited topic model in the scientific community. We provide a translation of the gained intuition to detailed definitions. In particular, we aim to cover concepts such as closed form inference, approximate inference with a focus on Gibbs sampling, generative storyline and plate notation. For each of the introduced concepts, we provide illustrative implementation examples.

Adaptations and Extensions

LDA has been adapted and extended to a wide range of specific settings. In the final part of the tutorial, we will present adaptations relevant for the social sciences.
Examples include models exploiting context information such as L-LDA, a supervised variant of LDA; PL-TM, a topic model for multilingual settings; Citation Influence Model, modeling the influence of citations in a collection of publications.

Evaluation and Discussion of Pros and Cons

While a useful tool for exploitative analysis of unfamiliar
data collections, topic models were disputed in the recent
past. Common error modes are discussed and the critique
about topic models is summarized. We emphasize the im-
portance of evaluating any exploratory tool in domain of
interest before drawing conclusions. To enable participants
to make an informed decision, we discuss several avenues for
in-domain evaluation

Latent Dirichlet Allocation (LDA)

Promoss implements LDA with an efficient online stochastic variational inference scheme, meaning that the memory consumption is lower than for standard implementations and inference is significantly sped-up.

The Usage is simple: You create a corpus.txt file in which each line corresponds to a document. Then you execute the promoss.jar with

Where -T 50 sets the number of topics to 50 and -MIN_DICT_WORDS 100 gives the minimum occurrences required to include a word in the analysis (in this case 100). There also exists an alternative input format based on a dictionary and documents given in SVMlight format, which is documented in the readme file.

Hierarchical Multi-Dirichlet Process Topic Model (HMDP)

You want to include multiple document metadata in your topic model, such as geographical location, timestamps or ordinal variables? But you do not want to spend weeks writing your own topic model and want an efficient inference?

Store the document metadata separated by semicolons in a file named meta.txt. The documents have to be put in a file named corpus.txt in which each line corresponds to a document. Documents can be raw and will be processed by Promoss. You have to tell which metadata are geographical locations, timestamps, ordinal or nominal data. Timestamps can be used to extract yearly, monthly, weekly or daily cycles.

Then you just have to execute the .jar file with a few parameters (documented in the readme file). Example command line usage:

If you need any support in using Promoss, feel free to contact us:
topicmodels (ät) c-kling.de

Contact

Do you have suggestions how we could improve our material? Or do you want to host the topic model tutorial at your institute?
Please feel free to ask us any related question:
topicmodels (ät) c-kling.de