blog

Gensim LDA: Tips and Tricks

Introduction

Gensim is an easy to implement, fast, and efficient tool for topic modeling. The purpose of this post is to share a few of the things I’ve learned while trying to implement Latent Dirichlet Allocation (LDA) on different corpora of varying sizes. This post is not meant to be a full tutorial on LDA in Gensim, but as a supplement to help navigate around any issues you may run into. If you are getting started with Gensim, or just need a refresher, I would suggest taking a look at their excellent documentation and tutorials.

Troubleshooting Gensim

The Gensim Google Group is a great resource. Most of the information in this post was derived from searching through the group discussions. If you are having issues I’d highly recommend searching the group before doing anything else. Also make sure to check out the FAQ and Recipes Github Wiki.

Logging Logging Logging

When training models in Gensim, you will not see anything printed to the screen. LDA, depending on corpus size may take a few minutes, hours, or even days, so it is extremely important to have some information about the progress of the procedure. Gensim does not log progress of the training procedure by default. The python logging can be set up to either dump logs to an external file or to the terminal.

Building the dictionary and corpus

One of the primary strengths of Gensim that it doesn’t require the entire corpus be loaded into memory. You can also build a dictionary without loading all your data into memory. All of this is summarised in the Corpora and Vector Spaces Tutorial. I don’t have much to add here except the following: save and save_as_text are not interchangeable (this also goes for load and load_as_text)

8 bytes * num_terms * num_topics * 3

8 bytes: size of double precision float

num_terms: number of terms in the dictionary

num_topics: number of topics

The magic number 3: The 8 bytes * num_terms * num_topic accounts for the model output, but Gensim will need to make temporary copies while modeling. The scaling factor of 3 gives you an idea of how much memory Gensim will be consuming while running with the temporary copies present.

NOTE: The link above goes to a FAQ about LSI in Gensim, but it also goes for LDA as per this google discussion) answered by the Gensim author Radim Rehurek

There are multiple filtering methods available in Gensim that can cut down the number of terms in your dictionary. If you are unsure of how many terms your dictionary contains you can take a look at it by printing the dictionary object after it is created/loaded.

Most of the Gensim documentation shows 100k terms as the suggested maximum number of terms; it is also the default value for keep_n argument of filter_extremes. The other options for decreasing the amount of memory usage are limiting the number of topics or get more RAM.

Filtering A Previosly Saved Dictionary and Updating the Corpus

If you need to filter your dictionary and update the corpus after the dictionary and corpus have been saved, take a look at the link below to avoid any issues:

I find it useful to save the complete, unfiltered dictionary and corpus, then I can use the steps in the previous link to try out several different filtering methods.

Training the LDA Model

If you follow the tutorials the process of setting up lda model training is fairly straight forward. The one thing that took me a bit to wrap my head around was the relationship between chunksize, passes, and update_every.

Chunksize, Passes, and Update_every

chunksize: Number of documents to load into memory at a time and process E step of EM.

update_every: number of chunks to process prior to moving onto the M step of EM.

I’m not going to go into the details of EM/Variational Bayes here, but if you are curious check out this google forum post and the paper it references here.
In general a chunksize of 100k and update_every set to 1 is equivalent to a chunksize of 50k and update_every set to 2. The primary difference is that you will save some memory using the smaller chunksize, but you will be doing multiple loading/processing steps prior to moving onto the maximization step. Passes are not related to chunksize or update_every. Passes is the number of times you want to go through the entire corpus. Below are a few examples of different combinations of the 3 parameters and the number of online training updates which will occur while training LDA.

8 bytes * num_terms * num_topics >= 1GB

The only way to get around this is to limit the number of topics or terms.

Summary

Hopefully this post will save you a few minutes if you run into any issues while training your Gensim LDA model. Please make sure to check out the links below for Gensim news, documentation, tutorials, and troubleshooting resources: