Feature selection

Rather than fitting the topic model to the entire text, we fit the model to
just the lemmas of the non-proper nouns. The following code segment filters
the text using the POS-tagged and lemmatized corpus. For each document, we
build a long text string containing all of the selected words, separated by
spaces.

To filter out stopwords, we need to store the words in a file. Since we
already have used POS tags to filter out stop words, we only need to worry
about initials that may have been mistaken for non-proper nouns by the tagger.

These options specify how often to optimize the hyper-parameters (optimize
alpha every 20 iterations after performing 50 burn-in iterations), how many
training iterations to perform (200), and how iterations to use to determine
the topics of each token (10). The values used here are the defaults suggested
by the mallet package. Increasing these values may result in more consistent
runs of the procedure.

After fitting the model we can now pull out the topics, the words, and the
vocabulary:

The output of the topic model is sensitive to the random initialization. You
will likely get different results every time you run this code. I do not know
if it's possible to ensure consistent output from mallet. If you want your
analysis to be reproducible, you should save the topic model output using
a command like
saveRDS(list(topics=topics, words=words, vocab=vocab), "tm.rds").

For the remainder of this analysis, we will use the results from Arnold
and Tilton's analysis: