Looking at a Dataset for Distant Reading: Some Anticlimaxes

August 10, 2015

I’ve been trying to think intelligently about the place of quantitative data in literary studies, especially in light of two excellent posts, one by Andrew Goldstone, the other by Tressie McMillan Cottom, both responding to this review by Ben Merriman.

But before I could even try to say something interesting in response, Ted Underwood announced that he was making available “a dataset for distant-reading literature in English, 1700-1922” (here is a link to the data). This post is a look at that data, mostly using R. I have, essentially, nothing thoughtful to offer in this post; instead, this is an exploration of this dataset (many, many thanks to Ted Underwood and HathiTrust for this fascinating bounty), studded with some anticlimaxes in the form of graphs that do little beyond give a sense of how one could begin to think about this dataset.

With the exception of a bash script (which may, though, be the most repurposable bit of code), everything here is done in R. I don’t like R, and I’m not very good with it,I think R’s datatypes are what make it a challenge; lists in particular seem to materialize out of nowhere and are frustrating to use… but it is great for making pretty graphs and getting an initial handle on a bunch of data.
I try to comment on, and explain, the code below (often in comments)—though if you’ve never looked at R, this may seem really weird. I also may have made some horrible mistakes; if so, please let me know.

The New HathiTrust Data Set

Underwood calls this dataset “an easier place to start with English-language literature” within the HathiTrust dataset. I had poked around the HathiTrust data before, and it really is a very complicated undertaking. This dataset that Underwood has provided makes this much much easier.

The data can be downloaded here. In this post I’ll look at the fiction metadata, and take a peak at the fiction word counts for the years 1915–1919. Those files looks something like this:

fiction_metadata.csv: 17 megabytes, containing author, title, date, and place for each work of fiction. It also includes subjects, an id for HathiTrust (htid), and other fields.

In a directory I uncompressed fiction_1915-1919.tar.gz. The result is 8656 files, each representing a single work, and totalling 827 megabytes. (827 megabytes of text is not “big data”—but it is enough to making toying with it on your laptop at times a little tricky.)

Examining the Metadata: Volumes of Fiction Per Year

library(ggplot2)# Load the metadata from the CSV vile
fiction.data <- read.csv('fiction_metadata.csv',header=T)# Let's look at how many items we have for each date.
ggplot(fiction.data)+
geom_histogram(aes(x=fiction.data$date),binwidth=1)+
ggtitle('Books per Year in Fiction Dataset')+
xlab('Year')+
ylab('Number of Books Per Year in Fiction Data')

This gives a sense of just how few books from before 1800 are in this dataset.

That is, 101948 volumes total, 1129 of which were published prior to 1800, or about 1%. The number of volumes appearing in the dataset per year tends to increase constantly—with a few exceptions. That dip around 1861-1864 may be a result of particularly American factors influencing the dataset; and perhaps it is war again accounts for some of the dip at this period end—though that dip seems to begin prior to 1914.

Examining the Metadata: Change in Length of Volumes Over Time

The length of each volume is contained in the totalpages field in the metadata file. Let’s plot the length of works of fiction over time (so, plot date by totalpages).

ggplot(fiction.data,aes(x=fiction.data$date,y=fiction.data$totalpages))+
geom_point(pch='.',alpha=0.1,color='blue')+
ggtitle('Length of Books by Year')+
xlab('Year')+
ylab('Length of Book, in Pages')

Interesting. It seems that, in the mid-eighteenth century near the dawn of the novel, works of fiction were around 300 pages long. Their length diversified over the course of the novel’s history, as novels grew both longer and shorter as the possibilities for fiction widened, perhaps as a function of increased readership stemming from both the decreasing cost of books and the increasing rate of literacy.

Well, not really. Matthew Lincoln has a very nice post about the dangers of constructing a “just-so” story (often to insist that this graph tells us “nothing new). But there are at least two problems with the interpretation offered above—one broad and one more specific. Broadly, it is worth reiterating the danger of mistaking this data for an unproblematic representation of any particular historical phenomenon (say especially readership of novels). Underwood describes the dataset carefully as representing works held by ”‘American university and public libraries, insofar as they were digitized in the year 2012 (when the project began).’“ And, of course, lots of other things which would be relevant to an investigation of fiction—think of pulp paperbacks and similar forms—will not be in that sample, because they were often not collected by libraries. (Likeiwse, as Underwood notes, pre 1800 books are more likely to be held in Special Collections, and therefore not digitized).

The second point is specific to the graph above. That scatter plot is sparse in the early half of this period and very dense in the latter half. The translucency of each point (set by alpha=0.2) captures some of this, but nevertheless the graph as a whole overemphases the increased spread of data, when really what is happening is an increase in the amount of data. If we plot things differently, I think this becomes evident. Let’s breakdown our data by decade, and then do a box plot per decade of fiction length:

# This helper function will convert a year into a "decade"# through some simple division and then return the decade# as a "factor" (an R data-type).
as.Decade <-function(year){
decade <-(as.numeric(year)%/%10)*10return(as.factor(decade))}# Add a "decade" column by applying our as.Decade function # to the data. (The unlist function... is because lapply returns# a list, and I'm not very good at R, so that's how I got it to work.
fiction.data$decade <-unlist(lapply(fiction.data$date, as.Decade))# Box plot of our length data, grouped by decade
ggplot(fiction.data,
aes(x=fiction.data$decade,y=fiction.data$totalpages))+
geom_boxplot()+
ggtitle('Length of Books, Grouped by Decades')+
xlab('Decade')+
ylab('Length of Books, in Pages')

This plot confirms that, indeed, we see a greater range in the lengths of works of fiction (so my inference from the previous graph is not completely wrong). But a box plot clarifies what is, to me, a surprising constancy in the length of the works collected in this dataset. The apparent increase in variability in length is real—but it is not the most, or the only, salient feature of this data; this fact is better captured in the second graph (the box plot).

Summary: Frequently Occurring Terms

The file fiction_yearly_summary.csv contains the per-year frequencies of the top 10,000 most frequently occuring tokens in the fiction dataset. We can chart the fluctuations of a term’s use, for instance, across the period.

Yet, of course, looking at that sharp rise, we quickly realize—yet again—the importance of normalization. We are not witnessing the explosion of love at the dawn of the twentieth century (and its nearly as rapid declension). We could noralize by adding all the words together—but we only have counts for the top 10,000 wods. Thankfully, the dataset offers “three special tokens for each year: #ALLTOKENS counts all the tokens in each year, including numbers and punctuation; #ALPHABETIC only counts alphabetic tokens; and #DICTIONARYWORD counts all the tokens that were found in an English dictionary.”

So, let’s normalize by using DICTIONARYWORD.

# Let's extract the DICTIONARYWORD tokens into a data frame
yearly.total <-subset(yearly.summary,yearly.summary$word=='#DICTIONARYWORD')# Let's simplify this dataframe to just what we're interested in.
yearly.total <- yearly.total[c('year','termfreq')]# And rename the termfreq column to "total"colnames(yearly.total)<-c('year','total')# Now we can use merge to combine this data, giving each row # a column that contains the total number of (dictionary words)# for that year.
love.normalized <-merge(love, yearly.total, by=c('year'))# This method profligately repreats data; but it makes things # easier. The result looks like this:head(love.normalized)> year word termfreq correctionapplied total
>11701 love 222037234>21702 love 107036>31703 love 5240416126>41706 love 12036501>51708 love 5780482779>61709 love 3610133847# Now, graph the data
ggplot(love,
aes(x=love.normalized$year,
y=(love.normalized$termfreq/love.normalized$total)))+
geom_line()+
xlab('Year')+
ylab('Normalized Frequency of "love"')+
ggtitle('The Fate of Love')

Well, that look’s about right. Just for fun, let’s try a different term, one that is something less of an ever-fixed mark, but which perhaps alters its relative frequency when it historical alteration finds.

# We subset the term we're interested in.
america <-subset(yearly.summary, yearly.summary$word=='america')# And normalize using our already-constructed yearly.total # data frame.
america.normalized <-merge(america, yearly.total, by=c('year'))# Plot as before, though this time we'll use geom_smooth() # as well to add a quick "smooth" fit line to get a sense of # the trend. Minor digression: things like geom_smooth() are one # of the things that make R great (if very dangerous) for an # utter amateur.
ggplot(america.normalized,
aes(x=america.normalized$year,
y=(america.normalized$termfreq/america.normalized$total)))+
geom_line()+
geom_smooth()+
xlab('Year')+
ylab('Normalized Frequency of "america"')+
ggtitle("Occurences of 'america' in the Dataset")

Not sure there’s much surprising here, but okay, seems reasonablish.

Extracting Counts from Individual Volume Files

Now, what if you want to look at terms that don’t occur in the top 10,000. Then, you need to dig in to the files for individual volumes. For simplicity’s sake, I’ll look only at one set of those files, representing volumes of fiction between 1915 and 1919, which I’ve uncompressed in a subdirectory called fiction_1915-1919.

I’ve been using R for everything so far, and I imagine you could use R to loop over the files in the directory, open them up and look for a specified term. As someone who finds R idiosyncratic to the point of excruciation, this doesn’t sound particularly fun. R is great when you’re manipulating/plotting data frames—less so when doing more complicated tasks on the filesystem. So, to extract the information we want, I’ll used a simple bash script.

#!/bin/bash# Our input directoryINPUTDIRECTORY=./fiction_1915-1919
# Let's take a single command line argument ($1) and store it# as the value we're looking for (the proverbial needle in our# data haystack).NEEDLE=$1# We use this convention, with find and while read # because a simple for loop, or ls, might have a problem# with ~10000 files.
find $INPUTDIRECTORY|whileread file
do# For each file, we use grep to search for our term,# storing just the number of occurences in result.result=$(grep -w -m 1$NEEDLE$file| awk '{ print $2 }')# Get the htid of the file we're looking at from the filenameid=$(basename $file .tsv)# And then print the result to the screenecho$id,$resultdone

I’m assuming some familiarity with bash scripts; to make a script executable, using its enough to type chmod +x wordcounter.bash. Save this script to a file (say, wordcounter.bash), make it executable, and then run it with an argument: ./wordcounter.bash positivism and it will output to the screen; pipe that to a csv (type ./wordcounter.bash positivism > positivism.csv) and you can use it in R. Here is what the results look like when they start appearing on the screen:

Those gibberish-looking strings (bc.ark+=13950=tk19k4r10s) are HathiTrust IDs. Then you get a comma, and after the comma the number of times the term appeared in the file… unless it didn’t appear, in which case you just a blank.

Some Notes

This will only work on unixy systems—Linux, OSX, or (I assume) cygwin on Windows.

When a token does not appear in file, this script outputs the htid, a comma, and then nothing. That’s fine—it’s easier to handle this after we’ve imported the resulting csv (to, say, R) than it would have been to write some logic in this script here to output 0. Also, this crude method is probably faster than doing it within R or Python and is certainly not slower. It could be speeded up by doing something fancy, like parallelization. To search through the 8656 files of fiction_1915-1919 for one term took 1 minute and 12 seconds—a totally managable timeframe. Assuming that rate (processing, say, 120 files/second) is roughly constant across the dataset of roughly 180,000 volumes, it should be possible to use this method to search for a term across all the volumes in the dataset in roughly 25 minutes, give or take. That is, of course, based on doing this on my laptop (with a 1.8Ghz Core i5 CPU), no parallelization (though this should be an eminently parallizable task—like really). Not fast, but totally managable.

Right now, though, all we have is HathiTrust IDs and frequencies of our term (or terms). We have no information about date, or title. So let’s get that information from the metadata files we’ve worked with earlier.

# From our custom culled data
gramophone <- read.csv('gramophone.csv')
film <- read.csv('film.csv')
typewriter <- read.csv('typewriter.csv')# All those spots where a token doesn't occur, which produce blank lines
gramophone[is.na(gramophone)]<-0
film[is.na(film)]<-0
typewriter[is.na(typewriter)]<-0colnames(gramophone)<-c('htid','gramophone')colnames(film)<-c('htid','film')colnames(typewriter)<-c('htid','typewriter')# put it all together with our main metadata data frame
gft <-merge(gramophone,film,by=c('htid'))
gft <-merge(gft,typewriter,by=c('htid'))# Now get the metadata from fiction_metadata.csv and# merge based on htid.
fiction.data <- read.csv('fiction_metadata.csv',header=T)
gft <-merge(gft,fiction.data,by=c('htid'))# To normalize let's load our annual totals as well. We can# merge those with our dataframe based on date.# Get Yearly Totals
yearly.summary <- read.csv('fiction_yearly_summary.csv')
yearly.total <-subset(yearly.summary,yearly.summary$word=='#DICTIONARYWORD')
yearly.total <- yearly.total[c('year','termfreq')]colnames(yearly.total)<-c('date','total')# Merge yearly totals with our main dataframe based on date.
gft <-merge(gft,yearly.total,by=c('date'))# Our dataframe is now 23 columns:colnames(gft)>[1]"date""htid""gramophone""film">[5]"typewriter""recordid""oclc""locnum">[9]"author""imprint""place""enumcron">[13]"subjects""title""prob80precise""genrepages">[17]"totalpages""englishpct""datetype""startdate">[21]"enddate""imprintdate""total"# That's not crazy, but to make things easier to under, # let's subset just the data we're interested in right now---say,# the occurence of our terms and their date.
gft.simple <- gft[,c('date','gramophone','film','typewriter','total')]head(gft.simple)> date gramophone film typewriter total
>11915000106553905>21915010106553905>31915000106553905>41915000106553905>51915000106553905>61915000106553905nrow(gft.simple)>[1]8655

Okay, looks good—there are our 8655 volumes, each with date of publication, the occurences of our three search terms (gramophone, film, and typewriter), and the total number of DICTIONARYWORDs for that year. Note that each row still represents a single volume—but we’ve discarded author, title, htid, etc. We’ve also added the total dictionary words for a volume’s year to each row (note the repeated totals in those first 1915 volumes), which is grossly inefficient. All this, however, is in the interest of simplicity—so that we can easily plot the relative occurences of our selected terms (here, gramophone, film, and typewriter).

In order to make this data easily plottable, we need some additional R tricks: we need to reformat our data from a “data frame” to a long “data matrix” (using the melt function). Then we can create a stacked bar graph of terms per year. Let’s start by plotting our raw counts.

Normalization makes some minor adjustments, but pretty similar. Not sure I would want to make any claims as to the importance or meaning of these graphs. They’re over a short historical span, and so far lack any richer contextualization. Like I said, for now, anticlimaxes.