"Topic Modeling: What Humanists Actually Do With It." A Guest Post by Teddy Roland, University of California, Berkeley

One of the hardest questions we can pose to a computer is asking what a human-language text is about. Given an article, what are its keywords or subjects? What are some other texts on the same subjects? For us as human readers, these kinds of tasks may seem inseparable from the very act of reading: we direct our attention over a sequence of words in order to connect them to one another syntactically and interpret their semantic meanings. Reading a text, for us, is a process of unfolding its subject matter.

Computer reading, by contrast, seems hopelessly naive. The computer is well able to recognize unique strings of characters like words and can perform tasks like locating or counting these strings throughout a document. For instance, by pressing Control-F in my word processor, I can tell it to search for the string of letters reading which reveals that so far I have used the word three times and highlights each instance. But that's about it. The computer doesn't know that the word is part of the English language, much less that I am referring to its practice as a central method in the humanities.

To their credit, however, computers make excellent statisticians and this can be leveraged toward the kind of textual synthesis that initiates higher-order inquiry. If a computer were shown many academic articles, it might discover that articles containing the word reading frequently include others like interpretation, criticism, discourse. Without foreknowledge of these words' meanings, it could statistically learn that there is a useful relationship between them. In turn, the computer would be able to identify articles in which this cluster of words seems to be prominent, corresponding to humanist methods.

This process is popularly referred to as topic modeling, since it attempts to capture a list of many topics (that is, statistical word clusters) that would describe a given set of texts. The most commonly used implementation of a topic modeling algorithm is MALLET, which is written an maintained by Andrew McCallum. It is distributed as well in the form of an easy-to-use R package, 'mallet', by David Mimno.

Since there are already several excellent introductions to topic modeling for humanists, I won't go further into the mathematical details here. For those looking for an intuitive introduction to topic modeling, I would point out Matt Jockers' fable of the “LDA Buffet.” LDA is the most popular algorithm for topic modeling. For those curious about the math behind it, but aren't interested in deriving any equations, I highly recommend Ted Underwood's “Topic Modeling Made Just Simple Enough” and David Blei's “Probabilistic Topic Models.”

Despite its algorithmic nature, it would be a gross mischaracterization to claim that topic modeling is somehow objective or absent interpretation. I will simply emphasize that human evaluative decisions and textual assumptions are encoded in each step of the process, including text selection and topic scope. In light of this, I will focus on how topic modeling has been used critically to work on humanistic research questions.

Topic modeling's use in humanistic research might be thought of in terms of three broad approaches: as a tool to guide our close readings, as a technique for capturing the social conditions of texts, and as a literary method that defamiliarizes texts and language.

Topic Modeling as Exploratory Archival Tool

Early examples of topic modeling in the humanities emphasize its ability to help scholars navigate large archives, in order to find useful texts for close reading.

Describing her work on the Pennsylvania Gazette, an American colonial newspaper spanning nearly a century, Sharon Block frames topic modeling as a “promising way to move beyond keyword searching.” Instead of relying on individual words to identify articles relevant to our research questions, we can watch how the “entire contents of an eighteenth-century newspaper change over time.”

To make this concrete, Block reports some of the most common topics that appeared across Gazette articles, including the words that were found to cluster and a label reflecting her own after-the-fact interpretation of those words and articles in which they appear.

If we were searching through an archive for articles on colonial textiles by keyword alone, we might think to look for articles including words like silk, cotton, cloth but a word like fine would be trickier to use since it has multiple, common meanings, not to mention the multivalence of gendered words like women and men.

Beyond simply guiding us to articles of interest, Block suggests that we can use topic modeling to inform our close readings by tracking topic prevalence over time and especially the relationships among topics. For example, she notes that articles relating to Cloth peak in the 1750s at the very moment the Religion topic is at its lowest, and wonders aloud whether we can see “colonists (or at least Gazette editors) choosing consumption over spirituality during those years.” This observation compels further close readings of articles from that decade in order to understand better why and how consumption and spirituality competed on the eve of the American Revolution.

Following Block's suggestion, several humanists since have tracked topics over time in different corpora in order to interpret underlying social conditions.

Robert K. Nelson's project Mining the Dispatch topic models articles from the Richmond Daily Dispatch, the paper of record of the Confederacy, over the course of the American Civil War. In a series of short pieces on the project website and that of the New York Times, Nelson does precisely the kind of guided close reading that Block indicates.

Following two topics that seem to rise and fall in tandem, Anti-Northern Diatribes and Poetry and Patriotism, Nelson identifies them as two sides of the same coin in the war effort. Taken together, they not only reveal how the Confederacy understood itself in relation to the war, but the simultaneous spikes and drops of these topics offer what he refers to as “a cardiogram of the Confederate nation.”

Andrew Goldstone and Ted Underwood similarly use readings of individual articles to ground and illustrate the trends they discover in their topic model of 30,000 articles in literary studies spanning the twentieth century. Their initial goal is to test the conventional wisdom of literary studies – for example, the mid-century rise of New Criticism that is supplanted by theory during the 1970s-80s – which their study confirms in broad strokes.

However, they also find that there are other kinds of changes that occur at a longer scale regarding an “underlying shift in the justification for literary study.” Whereas the early part of the century had tended to emphasize “literature's aesthetically uplifting character,” contemporary scholars have refocused attention on “topics that are ethically provocative,” such as violence and power. Questions of how and why to study literature appear deeply intwined with broader changes in the academy and society.

Matt Jockers has used topic modeling to study the social conditions of novelistic production, however he has placed greater emphasis on the relationship between authorial identity – especially gender and nationality – and subject matter. For example, in an article with David Mimno, they look not only at whether topics are used more frequently by women than men, but also how the same topic may be used differently based on authorial gender. (See also Macroanalysis, Ch. 8, “Theme”)

Topic Modeling as Literary Theoretical Springboard

The above-mentioned projects are primarily historical in nature. Recently, literary scholars have used topic modeling to ask more aesthetically oriented questions regarding poetics and theory of the novel.

Studying poetry, Lisa Rhody uses topic modeling as an entry point on figurative language. Looking at the topics generated from a set of more than 4000 poems, Rhody notes that many are semantically opaque. It would be difficult to assign labels to them in the way that Block had for the Pennsylvania Gazette topics, however she does not treat this as a failure on the computer's part.

In Rhody's words “Determining a pithy label for a topic with the keywords death, life, heart, dead, long, world, blood, earth… is virtually impossible until you return to the data, read the poems most closely associated with the topic, and infer the commonalities among them.”

So she does just that. As might be expected from the keywords she names, many of the poems in which the topic is most prominent are elegies. However, she admits that a “pithy label” like “Death, Loss, and Inner Turmoil” fails to account for the range of attitudes and problems these poems consider, since this kind of figurative language necessarily broadens a poem's scope. Rhody closes by noting that several of these prominently elegiac poems are by African-American poets meditating on race and identity. Figurative language serves not only as an abstraction but as a dialogue among poets and traditions.

Most recently, Rachel Sagner Buurma has framed topic modeling as a tool that can productively defamiliarize a text and uses this to explore novelistic genre. Taking Anthony Trollope's six Barsetshire novels as her object of study, Buurma suggests that we should read the series not as a formal totality – as we might do for a novel with a single, omniscient narrator – but in terms of its partial and uneven nature. The prominence of particular topics across disparate chapters offer alternate traversals through the books and across the series.

As Buurma finds, the topic model reveals the “layered histories of the novel's many attempts to capture social relations and social worlds through testing out different genres.” In particular, the periodic trickle of a topic letter, write, read, written, letters, note, wrote, writing... captures not only the subject matter of correspondence, but reading those chapters finds “the ghost of the epistolary novel” haunting Trollope long after its demise. Genres and genealogies that only show themselves partially may be recovered through this kind of method.

Closing Thought

What exactly topic modeling captures about a set of texts is an open debate. Among humanists, words like theme and discourse have been used to describe the statistically-derived topics. Buurma frames them as fictions we construct to explain the production of texts. For their part, computer scientists don't really claim to know what they are either. But as it turns out, this kind of interpretive fuzziness is entirely useful.

Humanists are using topic modeling to reimagine relationships among texts and keywords. This allows us to chart new paths through familiar terrain by drawing ideas together in unexpected or challenging ways. Yet the findings produced by topic modeling consistently call us back to close reading. The hardest work, as always, is making sense of what we've found.