Belles lettres Meets Big Data

The literary scholar needs a quiet room, a reading lamp, a notebook, a receptive mind—and algorithms for n-gram analysis, part-of-speech tagging, word-sense disambiguation, and sentence parsing. “Digital humanities” is all the rage these days in English departments. Recent meetings of the Modern Language Association have had dozens of sessions on the theme. Franco Moretti, a professor of English at Stanford University, insists that the tradition of “close reading”—giving careful attention to every word of a few canonical texts—must give way to “distant reading,” where whole genres are subjected to quantitative analysis in bulk. According to the publisher of his books, “Moretti argues that literature scholars should stop reading books and start counting, graphing, and mapping them instead.”

The idea of applying mathematical and computational tools to literature is hardly new. The first conference on “literary data processing” was held in 1964; it attracted 150 participants. The topics discussed included “computational stylistics” and a computer-aided assessment of John Milton’s influence on Percy Bysshe Shelley. These were not the first such projects. Frederick Mosteller and David L. Wallace had already applied statistical methods to a case of disputed authorship in American history. They tabulated the frequencies of common words (also, an, by, of, etc.) in the Federalist Papers, seeking to determine which of those essays were written by Alexander Hamilton and which by James Madison. Earlier still—and without computer assistance—the British statisticians G. Udny Yule and C. B. Williams had studied variations in sentence length as a way of characterizing literary style and identifying authors.

I have lately become intrigued by two more pioneers of the digital humanities, who worked in an even earlier era, when digital could only refer to fingers, not computer chips. Both of these scholars were Americans born in the middle of the 19th century. One was a man of science who made a few brief forays into statistical language studies. The other was a professor of English literature who yearned to import scientific methods into his field.