Topic inference visualization

In this page we will be visualizing the inference of topics in an image dataset
and a text dataset. We will be using, as in most examples, the console
applications which are readily available once you
install LDA++. The image dataset is the well known Olivetti
Faces and
the textual dataset is the 20 news
groups. Besides LDA++ will use
scikit-learn to fetch the datasets and matplotlib and
wordcloud to plot the inference
process. All those libraries are very easily installed using
pip or you could download
anaconda for a full python distribution.

Fetching the datasets

We will use scikit-learn to fetch and preprocess the datasets in a few lines
of python. The purpose of this example is to visualize the inference process
and not to produce the best possible topics so shortcuts will be taken to save
computation and experimentation time.

Training LDA

After downloading the data and transforming them into the
format readable by the console
applications we can very easily infer topics from
these datasets. We will use the --snapshot_every option to save a model from
each epoch so that we can later visualize the inference process.

The following code trains two lda models one for the faces dataset and one for
the 20 news groups. We infer 10 topics for the faces dataset and 20 for the text
dataset. One should change the --workers option depending on the number of
parallel threads his processor can execute.

After executing the above code (and the code from the previous section) the
directory should contain the following files:

faces_model.npy

faces_model.npy_001 - faces_model.npy_100

news_model.npy

news_model.npy_001 - news_model.npy_100

faces.npy

news.npy

fnames.pickle

As it is obvious the files (faces | news)_model.npy_(001 - 100) are the
models for the corresponding epochs and we will be able to use them to plot the
topic evolution.

Topic visualization

In order to visualize the evolution of the topics, firstly we need to visualize
a topic. The faces dataset has been reformatted so that the topics can be
visualized as a $64 \times 64$ image and the text topics will be represented by
a wordcloud that emphasizes the most probable words.

Topic evolution

In the following figure we have applied the above visualization for all the
topics of the faces dataset for different epochs.

The evolution of all 10 topics in the Olivetti faces
dataset

We can see that after one epoch all topics start from approximately the same
position and it is really hard to predict what the final outcome will be for
each topic. We can see clearly that there are topics that focus on some facial
characteristics and not others. For instance, the second topic generates no
mouths (hence the large black blob where the mouth would be) and the 6th topic
generates beards.

We can perform the same visualization for the 20 news groups dataset but since
the images of the wordclouds are larger we will visualize the inference of a
single topic. We observe that the topics now converge much faster in the first
tens of epochs.

Evolution of a single topic in the 20 news groups
dataset

Another attribute of a topic that we can visualize is the distribution over the
words and its evolution. When the distribution over the words stops changing
then the topic model has converged. It is common to check convergence using the
likelihood instead. In the following figure we see the change in the
distribution of the same topic as in the
above figure. We see that indeed the topic changes very little from the
30th epoch and onwards.