Monthly Archives: September 2011

I thought I would share the description of a graduate course I’ll be teaching in Spring 2012. It’s targeted specifically at students in English literature. So instead of teaching an “introduction to digital humanities” as a whole, I’ve decided to focus on the parts of this research program that seem to integrate most easily into literary study. I want to help students take risks — but I also want to focus, candidly, on risks that seem likely to produce useful credentials within the time frame of graduate study.

I think the perception among professors of literature may be that TEI-based editing is the digital tool that integrates most easily into what we do. But where grad students are concerned, I think new modes of collection-mapping are actually more widely useful, because they generate leads that can energize projects not otherwise centrally “digital.” This approach is technically a bit more demanding than TEI would be, but if students are handed a few simple modules (LSA-based topic modeling, Dunning’s log likelihood, collocation analysis, entity extraction, time series graphing) I think it’s fairly easy to reveal discourses, trends, and perhaps genres that no one has discussed. I’ll be sharing my own tools built in R, and an 18-19c collection I have developed in collaboration with E. Jordan Sellers. But I’ll also ask students to learn some basic elements of R themselves, so that they can adapt or connect modules and generate their own visualizations. As we get into problems that exceed the power of the average Mac, I’ll introduce students to the modular resources of SEASR. Wish us luck — it’s an experiment!

ENGL 581. Digital Tools and Critical Theory. Spring 2012.

Critical practice is already shaped by technology. Contemporary historicism emerged around the same time as full-text search, for instance, and would be hard to envision without it. Our goal in this course will be to make that relationship more reciprocal by using critical theory to shape technology in turn. For example, the prevailing system of “keyword search” requires scholars to begin by guessing how another era categorized the world. But much critical theory suggests that we cannot predict those categories in advance, and there are ways of mapping an archive that don’t require us to.

I’ve found that it does make a difference: when critics build their own tools, they can uncover trends and discourses that standard search technology does not reveal. The course will not assume any technical background, although it does assume willingness to learn a few basic elements of programming and statistics. Many of the tools/collections we need are already available on the web; others I can give you, or show you how to cobble together. We will often take time out from building things to read theory — like Moretti’s Maps, Graphs, Trees (2005), corpus linguistics, and influential critiques of or definitions of the digital humanities. But we will not mostly be writing about digital humanities. Instead I’ll recommend writing an ordinary critical essay about literary/cultural history, subtly informed by new tools or new models of discourse. (Underline “subtly.”) Projects on any period are possible, although the resources I can provide are admittedly richest between 1700 and 1900.

*****
By the way, it would be churlish of me not to acknowledge that I’ve learned much of what I know about this topic from grad students, and especially (where methodology is concerned) from Benjamin Schmidt, whose blog posts are an education in themselves and will certainly be on the syllabus. “Graduate education” in this field is a very circular process.

I wrote a long post last Friday arguing that topic-modeling an 18c collection is a reliable way of discovering eighteenth- and nineteenth-century trends, even in a different collection.

But when I woke up on Saturday I realized that this result probably didn’t depend on anything special about “topic modeling.” After all, the topic-modeling process I was using was merely a form of clustering. And all the clustering process does is locate the hypothetical centers of more-or-less-evenly-sized clouds of words in vector space. It seemed likely that the specific locations of these centers didn’t matter. The key was the underlying measure of association between words — “cosine similarity in vector space,” which is a fancy way of saying “tendency to be common in the same 18c volumes.” Any group of words that were common (and uncommon) in the same 18c volumes would probably tend to track each other over time, even into the next century.

Six words that tend to occur in the same 18c volumes as 'gratify' (in TCP-ECCO), plotted over time in a different collection (a corrected version of Google ngrams).

To test this I wrote a script that chose 200 words at random from the top 5000 in a collection of 2,193 18c volumes (drawn from TCP-ECCO with help from Laura Mandell), and then created a cluster around each word by choosing the 25 words most likely to appear in the same volumes (cosine similarity). Would pairs of words drawn from these randomly distributed clusters also show a tendency to correlate with each other over time?

They absolutely do. The Fisher weighted mean pairwise r for all possible pairs drawn from the same cluster is .267 in the 18c and .284 in the 19c (the 19c results are probably better because Google’s dataset is better in the 19c even after my efforts to clean the 18c up*). At n = 100 (measured over a century), both correlations have rock-solid statistical significance, p < .01. And in case you're wondering … yes, I wrote another script to test randomly selected words using the same statistical procedure, and the mean pairwise r for randomly selected pairs (factoring out, as usual, partial correlation with the larger 5000-word group they’re selected from) is .0008. So I feel confident that the error level here is low.**

What does this mean, concretely? It means that the universe of word-frequency data is not chaotic. Words that appear in the same discursive contexts tend, strongly, to track each other over time, and (although I haven’t tested this rigorously yet), it’s not hard to see that the converse proposition is also going to hold true: words that track each other over time are going to tend to have contextual associations as well.

To put it even more bluntly: pick a word, any word! We are likely to be able to define not just a topic associated with that word, but a discourse — a group of words that are contextually related and that also tend to wax and wane together over time. I don’t imagine that in doing so we prove anything of humanistic significance, but I do think it means that we can raise several thousand significant questions. To start with: what was the deal with eagerness, gratification, and disappointment in the second half of the eighteenth century?

* A better version of Google’s 18c dataset may be forthcoming from the NCSA.

** For people who care about the statistical reliability of data-mining, here’s the real kicker: if you run a Benjamini-Hochberg procedure on these 200 randomly-generated clusters, 134 of them have significance at p < .05 in the 19c even after controlling for the false discovery rate. To put that more intelligibly, these are guaranteed not to be xkcd’s green jelly beans. The coherence of these clusters is even greater than the ones produced by topic-modeling, but that’s probably because they are on average slightly smaller (25 words); I have yet to test the relative importance of different generation procedures while holding cluster size rigorously constant.

However, Davidson says such kind things about the digital humanities that someone needs to pour in a few grains of salt. And since I’m a digital humanist, it might as well be me.

To reimagine a global humanism with relevance to the contemporary world means understanding, using, and contributing to new computational tools and methods. … Even a few examples show how being open to digital possibilities changes paradigms and brings new ways of reimagining the humanities into the world.

Reading this, I find myself blushing and stammering. And what I’m stammering is: “slow down a sec, because I’m not sure how central any of this is really going to be to our pedagogical mission.”

I’m going to teach a graduate course on digital humanities next semester, because I’m confident that information technology will change (actually, already has changed) the research end of our discipline. But I’m not yet sure about the implications at the undergraduate level. Maybe ten years from now I’ll be teaching text mining to undergrads … but then again, maybe the things undergraduates need most from an English course will still be historical perspective, close reading, a willingness to revise, and a habit of considering objections to their own thesis.

I’m sure that text mining belongs in undergraduate education somewhere. It raises fascinating social and linguistic puzzles. But I’m not sure whether we’ll be able to fit all the puzzles raised by technological change into the undergrad English major. It’s possible that English departments will want to stay focused on an older mission, leaving these new challenges to be scooped up by Linguistics or Computer Science. If that happens, it’s okay with me. It’s not particularly crucial that all the projects I care about be combined in a single department.

I’m dwelling on this because I feel humanists spend way too much time these days arguing about “what we need to do in order to keep the discipline from shrinking.” Sometimes the answer offered is a) return to our core competence, and sometimes the answer is b) boldly take on some new mission. But really I want to answer c) it is not our job to keep the discipline from shrinking, and we shouldn’t do anything purely for that reason. Our job is to make sure that we keep passing on the critical skills that the humanities develop best, at the same time as we explore new intellectual challenges.

Maybe those new challenges require us to expand. Or maybe it turns out that new challenges are relevant mostly at the graduate level, whereas at the undergraduate level we already have our hands full teaching students social history, close reading, and revision. And maybe that means that departments of English do end up shrinking relative to Communications or CompSci. If so, I hope it doesn’t happen rapidly, because I care about the fortunes of particular graduate students. But in the long term, it would not be a tragedy. Ideas matter. Departmental boundaries don’t. Intellectual history is not a contest to see who can retain the most faculty.

UPDATE Dec. 30 2011: I have to admit that my mind is in the process of being changed about this. After participating in a NITLE-sponsored seminar about teaching digital humanities at the undergraduate level, I’m much less hesitant than I was in September. Ryan Cordell, Brian Croxall, and Jeff McClurken presented really impressive digital-humanities courses that were also deeply grounded in the context of a specific discipline. Recording available at the link above.

While I’m fascinated by cases where the frequencies of two, or ten, or twenty words closely parallel each other, my conscience has also been haunted by a problem with trend-mining — which is that it always works. There are so many words in the English language that you’re guaranteed to find groups of them that correlate, just as you’re guaranteed to find constellations in the night sky. Statisticians call this the problem of “multiple comparisons”; it rests on a fallacy that’s nicely elucidated in this classic xkcd comic about jelly beans.
Simply put: it feels great to find two conceptually related words that correlate over time. But we don’t know whether this is a significant find, unless we also know how many potentially related words don’t correlate.

One way to address this problem is to separate the process of forming hypotheses from the process of testing them. For instance, we could use topic modeling to divide the lexicon up into groups of terms that occur in the same contexts, and then predict that those terms will also correlate with each other over time. In making that prediction, we turn an undefined universe of possible comparisons into a finite set.

Once you create a set of topics, plotting their frequencies is simple enough. But plotting the aggregate frequency of a group of words isn’t the same thing as “discovering a trend,” unless the individual words in the group actually correlate with each other over time. And it’s not self-evident that they will.

The top 15 words in topic #91, "Silence/Listened," and their cosine similarity to the centroid.

So I decided to test the hypothesis that they would. I used semi-fuzzy clustering to divide one 18c collection (TCP-ECCO) into 200 groups of words that tend to appear in the same volumes, and then tested the coherence of those topics over time in a different 18c collection (a much-cleaned-up version of the Google ngrams dataset I produced in collaboration with Loretta Auvil and Boris Capitanu at the NCSA). Testing hypotheses in a different dataset than the one that generated them is a way of ensuring that we aren’t simply rediscovering the same statistical accidents a second time.

To make a long story short, it turns out that topics have a statistically significant tendency to be trends (at least when you’re working with a century-sized domain). Pairs of words selected from the same topic correlated significantly with each other even after factoring out other sources of correlation*; the Fisher weighted mean r for all possible pairs was 0.223, which measured over a century (n = 100) is significant at p < .05.

In practice, the coherence of different topics varied widely. And of course, any time you test a bunch of hypotheses in a row you're going to get some false positives. So the better way to assess significance is to control for the "false discovery rate." When I did that (using the Benjamini-Hochberg method) I found that 77 out of the 200 topics cohered significantly as trends.

There are a lot of technical details, but I'll defer them to a footnote at the end of this post. What I want to emphasize first is the practical significance of the result for two different kinds of researchers. If you're interested in mining diachronic trends, then it may be useful to know that topic-modeling is a reliable way of discovering trends that have real statistical significance and aren’t just xkcd’s “green jelly beans.”

The top 15 terms in topic #89, "Enemy/Attacked," and their cosine similarity to the centroid.

Conversely, if you're interested in topic modeling, it may be useful to know that the topics you generate will often be bound together by correlation over time as well. (In fact, as I’ll suggest in a moment, topics are likely to cohere as trends beyond the temporal boundaries of your collection!)

Finally, I think this result may help explain a phenomenon that Ryan Heuser, Long Le-Khac, and I have all independently noticed: which is that groups of words that correlate over time in a given collection also tend to be semantically related. I've shown above that topic modeling tends to produce diachronically coherent trends. I suspect that the converse proposition is also true: clusters of words linked by correlation over time will turn out to have a statistically significant tendency to appear in the same contexts.

Why are topics and trends so closely related? Well, of course, when you’re topic-modeling a century-long collection, co-occurrence has a diachronic dimension to start with. So the boundaries between topics may already be shaped by change over time. It would be interesting to factor time out of the topic-modeling process, in order to see whether rigorously synchronic topics would still generate diachronic trends.

I haven’t tested that yet, but I have tried another kind of test, to rule out the possibility that we’re simply rediscovering the same trends that generated the topics in the first place. Since the Google dataset is very large, you can also test whether 18c topics continue to cohere as trends in the nineteenth century. As it turns out, they do — and in fact, they cohere slightly more strongly! (In the 19c, 88 out of 200 18c topics cohered significantly as trends.) The improvement is probably a clue that Google’s dataset gets better in the nineteenth century (which god knows, it does) — but even if that’s true, the 19c result would be significant enough on its own to show that topic modeling has considerable predictive power.

Practically, it’s also important to remember that “trends” can play out on a whole range of different temporal scales.
For instance, here’s the trend curve for topic #91, “Silence / Listened,” which is linked to the literature of suspense, and increases rather gradually and steadily from 1700 to the middle of the nineteenth century.
By contrast, here’s the trend curve for topic #89, “Enemy/Attacked,” which is largely used in describing warfare. It doesn’t change frequency markedly from beginning to end; instead it bounces around from decade to decade with a lot of wild outliers. But it is in practice a very tightly-knit trend: a pair of words selected from this topic will have on average 31% of their variance in common. The peaks and outliers are not random noise: they’re echoes of specific armed conflicts.

* Technical details: Instead of using Latent Dirichlet Allocation for topic modeling, I used semi-fuzzy c-means clustering on term vectors, where term vectors are defined in the way I describe in this technical note. I know LDA is the standard technique, and it seems possible that it would perform even better than my clustering algorithm does. But in a sufficiently large collection of documents, I find that a clustering algorithm produces, in practice, very coherent topics, and it has some other advantages that appeal to me. The “semi-fuzzy” character of the algorithm allows terms to belong to more than one cluster, and I use cosine similarity to the centroid to define each term’s “degree of membership” in a topic.

I only topic-modeled the top 5000 words in the TCP-ECCO collection. So in measuring pairwise correlations of terms drawn from the same topic, I had to calculate it as a partial correlation, controlling for the fact that terms drawn from the top 5k of the lexicon are all going to have, on average, a slight correlation with each other simply by virtue of being drawn from that larger group.