Visualizing 27 years, 12 million words of the Humanist list

Back in September, I spent some time playing around with a little project called Textplot, which converts a document into a network of terms by computing similarities between the distribution patterns of individual pairs of words. When you pass the network through a force-directed layout algorithm, it folds out into a kind of conceptual atlas of the document, a two-dimensional diagram that teases out the underlying topic structure of the text. War and Peace, for example, turns into a big triangle – war on the left, peace on the right, Tolstoy’s essays about history on top.

Under the hood, each word is converted into a probability density function across the width of the document – this makes it possible to compute a really fine-grained similarity score between any two words, which can then be used as an edge weight in the network:

In a lot of ways, the density functions look like the time-series plots that crop up in visualizations that show how the frequency of a word changes over time – most notably, of course, the Google Ngram viewer, but also projects like the New York Times’ “Chronicle” tool, which does something similar for the NYT corpus going back to the middle of the 19th century. They’re not exactly the same thing. The biggest difference is that the density functions are normalized so that they always trace out the same amount of area over the X-axis, regardless of how many times the word shows up in the document – this is what makes it possible to compare the distribution of any two words, even if one shows up 1000 times and the other just 20 times. Whereas, the raw word counts in the n-gram viewers show the absolute difference in frequency between words. But, the gist is similar – both are capturing information about how something fluctuates over time.

With the novels, though, the “time” axis isn’t really time at all, but instead what Matt Jockers calls the “novel time” of the text – the interval between the beginning and the end. This got me thinking, though – what would happen if the X-axis actually were, in fact, time in the literal sense of the word? What if the “text” were actually a huge corpus of documents, daisy-chained together in chronological order into a single mega-document, so that the “novel time” of the text corresponds to the “historical time” of the corpus? Would Textplot surface some kind of broad, diachronic shift in semantic focus over time, in the way that it captures the linear progressions in texts like Walden and the Divine Comedy?

Data cleaning

I decided to try this out with the Humanist list, the venerable, 27-year-old email listserv started by Willard McCarty at the University of Toronto in 1987. This seemed like a good place to start for a couple of reasons. The fulltext archive can be downloaded directly from dhhumanist.org – and, as a built-in freebie, email is inherently chronological, which meant that I didn’t have to do any kind of prep work to make sure that the documents were in the right order. I was also kind of interested to try this with a corpus that I don’t know very much about. I’ve subscribed to the Humanist for the last couple years, but I’ve only read it on and off (for a long time, gmail flagged it as spam!), and I certainly don’t know anything about what the list was like for the 25 years before 2012. This is an interesting opportunity, though, to test the usefulness of this kind of approach – without reading the entire thing, what could I learn about it?

I downloaded all of the 27 year-long archive files, and started by writing a little Python script to scrub out the large quantity of non-human-readable “header” information that gets prepended to most of the emails, leaving just the text that had actually been typed out by people (for the most part). Then, I just concatenated all of the files together into a single, 80 megabyte humanist.txt file, which Textplot parses this out into a cool 11.5 million words, consisting of 138,476 unique types. Last, I made a couple of tweaks to the logic that determines which words get added to the final network – I wanted to pick words that are the most characteristic of a particular period in the history of the corpus, in an effort to get the most coherent portrait of the diachronic shift over time (more on this later). Once all the pieces were in place, I built out the graph, fired up Gephi, flipped on Force Atlas 2, and watched the network open up into a big, spindly line – 1987 on the left, 2014 on the right:

Almost immediately, I found myself tabbing back and forth between Gephi and the iPython terminal, pulling up the density functions for different terms to see how they do (or don’t) map onto the positions of the nodes in the network. This got annoying pretty quickly, so I decided to write some code that would take the raw GML that comes out of Gephi and turn it into an interactive, d3-powered viewer that makes it easier to compare the positions of the nodes in the network with the distributions profiles of the words across the history of the corpus:

As the network is panned and zoomed, the time axis at the bottom of the screen will automatically refocus so that it always shows the (approximate) temporal range of the current viewport:

And, to see the temporal pattern of an individual word, click on the label to open up a little chart that shows the kernel density estimate for that word, aligned below the “minimap” at the top right, which makes it easy to see how the final placement of the word compares to the density profile:

The correspondence is pretty tight, though not exact. Most of the nodes line up pretty closely with the approximate “center of mass” of their density functions, although some words get pulled off into weird positions that don’t really make sense. This is usually because the distribution of the word is really distinctly multimodal (it clusters in more than one place), which causes the layout algorithm to drag the node into a kind of median position, a no-man’s-land between the different regions of the network that the word is bound to.

The “spatial turn” seems to register around the same time (spatial, gis), and the network then drifts into more or less the present moment. PhD, studentship, and postdoctoral, all of which have been gradually gaining ground since about the turn of the century, both surge around 2012, perhaps tracking the formalization of DH as a discrete field of study, instead of just a methodology that gets mixed into existing disciplines? At the far right is a cluster of terms related to social media and modern web products, which provides a tidy counterweight to the 80s-era technologies to the left – gmail, wordpress, ipad, blogspot. And, of course, twitter and facebook, both of which peak out in unison before rapidly falling off in the spring of 2012, which seems to have been the moment of peak-DH-social-media? It’s fascinating to see just how late in the game digitalhumanities (and then, a bit later, the abbreviated dh) come into view – neither existed before about 2005.

This is especially fun for me because the history of the list maps almost exactly onto my own life! I was born on June 25, 1987, just 44 days after Willard McCarty sent the first message on May 12:

“This is test number 1. Please acknowledge.”

Next steps

This seems to work well for the Humanist, but I’m curious to see how the same technique generalizes to other corpora – especially at really large scales, and over much longer temporal intervals. I’m thinking about trying to do something similar with the newly released feature-count data set from HathiTrust, which provides page-level term counts for 250,000 volumes published between 1431 and 2010. Would you get the same kind of broad, unified, coherent diachronic progression that surfaced out of the Humanist? Or is that the exception, not the rule?

Dear David, congrats for this fascinating work. However I wonder what exactly you mean by “leaving just text that had actually been typed out by people (for the most part)”. I’ve searched for important terms like “philology”, (373 occurrences according the search made with Google within the mailing list) and it’s not there. Besides, some names of people are there, some other are not. What the scientific and cultural value of such representation would be if data entry is not accurate? Thanks!

Marjorie

Dear David, very interesting post, and also very interesting piece of software! I would certainly like to play with it on my own data. Of course I share Domenico’s concern regarding the absence / presence of some terms. Is there a threshold under which relatively rare words are not plotted? Please tell us more about Textplot 🙂

dclure

Hey Domenico and Marjorie,

Thanks for taking a look at this! There definitely are words that are missing from the network – in fact, almost _all_ of them are missing. The full Humanist corpus contains about 140,000 unique word types, and just 1,000 – well under 1% – are included in the final visualization. As is always the case, most of the words appear just once or a small handful of times, which doesn’t provide enough information to generalize about the placement of a word in a meaningful way. This is especially true with Textplot, where the links between words are determined by measuring the similarity between the overall patterns of distribution of the words. You can estimate a density function for a word that appears 3-4 times out of 12 million, but, since there’s so little statistical power, it’ll probably just be noise – you’d end up with links between words that don’t capture any kind of meaningful semantic / conceptual / chronological connection. And, from a design standpoint – even if there were a way to bind the infrequent words into the network, you probably wouldn’t actually want to, since you’d end up with a really dense tangle of words that would be hard to make sense of.

The question of how to pick out the subset of words that get added to the network (and how many to pick) is really interesting and complicated, and something that I’m still experimenting with. The simplest approach is just to take the top X most common words – this is what I did for the visualizations of the novels that I did a few weeks ago. As I mentioned in the post, though, in this case I wanted to try to pick the words that are the most temporally “focused” – that are very representative of one particular period in the history of the list, with the goal of getting the cleanest progression from 1987 ->2014. You want words that are the most “unimodal,” that cluster together in one place and nowhere else. I took this approach:

1. Take the 3000 most frequent words.

2. For each, compute the standard deviation of the offsets in the corpus where the word appears.

3. Take the 1000 words with the _lowest_ standard deviations, the intuition being that a low value means that most of the word’s occurrences cluster pretty tightly around the mean. This rewards “unimodality,” and penalizes words that are really distinctly multimodal, that cluster in different and distant parts of the corpus.

This is the reason that there will be words that occur frequently but don’t show up in the final network – “philology,” for example, has pretty distinct spikes at the beginning, middle, and end of the corpus (see image), which apparently pushed it out of the top 1000 words. There’s probably a more sophisticated way to do this, but this seemed to work pretty well (I would love to brainstorm about other approaches!). For more about Textplot, check out this blog post: