MNIST dataset

The following example aims to point out the differences between the inferred
topics of LDA and fsLDA. We decided to use the MNIST
database which is a dataset of 70000
handwritten digits, in order to make the topics visualization more fancy!!! Both
models are trained using the console applications that are thoroughly explained
in the corresponding documentation page.

The console applications that are used in this example expect and save data in
numpy format. We have already transformed the MNIST dataset to numpy format
and you can download it if you copy and paste the following instructions in a
terminal.

We observe that the training set is an array of size (784, 60000). All images
in this dataset are $28\times28$ images, thus the first dimension is 784
(28*28=784), while the second dimension refers to the number of the training
samples, which is 60000.

Both in LDA and fsLDA, each document can be viewed as a mixture of various
topics. However, the MNIST database consists exclusively of images, thus there
is no direct analogy between words and documents. In order to use the MNIST
dataset in combination with LDA and fsLDA we consider each image to be a
document and each pixel in the image to be a word. As a result, our corpus
consists of 70000 documents and the vocabulary size is 784.

Unsupervised LDA

In this section, we will visualize the inferred topics from an unsupervised LDA
model using the lda application. We
train an unsupervised model with 10 topics for 20 Expectation-Maximization
steps, by setting the --topics argument to 10 and the --iterations argument
to 20.

In addition, we initialize the topics as random distributions by using the
--initialize_random option and use the --snapshot_every option to save a
model after every epoch so that we can later inspect the topics evolution

After executing the above code the current directory should the following
files:

mnist_lda_model.npy

mnist_lda_model.npy_001 - mnist_lda_model.npy_020

The first file is the final trained model, namely from the $20^{th}$ epoch,
while the rest of the files are the trained models from the corresponding epochs
and will be used to plot the evolution of the inferred topics during the
epochs.

As we have already mentioned in previous documentation pages, the
lda application trains a model and
saves it as two numpy arrays. The first array is $\alpha$, namely the parameter
of the Dirichlet prior on the per-document topic distributions, while the
second one is $\beta$, which is the per-topic word distribution. Subsequently,
in order to visualize the topics we merely have to visualize the $\beta$
parameter. In the following python session, we visualize the evolution of the
inferred topics, during the training process.

The following image depicts the evolution of 10 topics during the first 20
epochs. We observe that during the first epochs, it is rather hard to recognize
any topic, however after the first 10 epochs, they seem quite converged.

The evolution of all 10 topics in the MNIST dataset for the first 20 epochs

Subsequently, we train a new model with 100 topics, in order to observe the difference
when the number of topics increases. The rest of the arguments remain the same
as in the above example. The inferred topics are depicted in the following
figure. If we observe the image from an overall perspective it is easily noticeable
that some of the topics resemble numbers, while others seems to be part of numbers.

100 topics in the MNIST dataset using vanilla LDA

Fast Supervised LDA

In this section, we will visualize the inferred topics using the Fast
Supervised LDA, using the fslda
application. We retain the arguments from the previous section, so that we will
be able to compare the inferred topics from the two variational inference
methods. The following figure illustrates 100 topics, as they were inferred
using Fast Supervised LDA.

100 topics in the MNIST dataset using fsLDA

We observe that, the majority of the inferred topics using fsLDA is completely
different with those using LDA. In addition, while in case of LDA many topics
look like actual letters, in case of fsLDA the most of them looks like
sections of letters.