The first point to mention is that the things computers are good at are very different from the things humans are good at. The worthwhile work in digital humanities (“DH” for short, a synonym for computationally assisted humanities research) keeps this fact in mind. Computers are useful for doing quickly certain basic (that is, boring) tasks that humans do slowly. They’re really good at counting, for instance. But sometimes, happily, these kinds of quantitative improvements in speed produce qualitative changes in the kinds of questions we can pose about the objects that interest us. So we literary scholars don’t want to ask computers to do our close reading for us. We want them to help us work differently by expanding what we can read (or at least interpret) and how we can read it. And we want to keep in mind that reading itself is just one (extraordinarily useful) analytical technique when it comes to understanding literary or social-aesthetic objects.

There are two main classes of literary problems that might immediately benefit from computational help. In the first, you’re looking for fresh insights into texts you already know (presumably because you’ve read them closely). In the second, you’d like to be able to say something about a large collection of texts you haven’t read (and probably can’t read, even in principle, because there are too many of them; think of the set of all novels written in English). In both cases, it would almost certainly be useful to classify or group the texts together according to various criteria, a process that is in fact at the heart of much computationally assisted literary work.

In the first case, what you’re looking for are new ways to connect or distinguish known texts. Cluster analysis is one way to do this. You take a group of texts (Shakespeare’s plays, for instance), feed them through an algorithm that assesses their similarity or difference according to a set of known features or metrics (sentence length, character or lemma n-gram frequency, part of speech frequency, keyword frequency, etc.—the specific metrics need to be worked out by a combination of so-called “domain knowledge” and trial and error), and produce a set of clusters that rank the relative similarity of each work to the others. Typical output looks something like this figure from Matthew Jockers’ blog (click the image to see it full size in its original context):

Read this diagram from the top down; the lower the branch point between two items or groups, the more closely related they are.

This may or may not be interesting. Note in particular that the cluster labels are supplied by the user, outside the computational process. In other words, the algorithm doesn’t know what the clusters mean, nor what the clustered works have in common. Still, why does Othello cluster with the comedies rather than the tragedies (or the histories, to which the tragedies are more closely related than the comedies)? The clustering process doesn’t answer that question, but I might never have thought to ask it if I hadn’t seen these results. Maybe I won’t have anything insightful to say in answer to it, but then that’s true of any other question I might ask, and at least now I have a new potential line of inquiry (which is perhaps no mean thing when it comes to Shakespeare).

(As an aside, the extent to which I’m likely to explain the categorization of Othello as a simple error instead of as something that requires further thought and attention will depend on how well I think the clustering process works overall, which in turn will depend to at least some extent on how well it reproduces my existing expectations about generic groupings in Shakespeare. The most interesting case, probably, is the one in which almost all of my expectations are met and confirmed—thereby giving me faith in the accuracy of the overall clustering—but a small number of unexpected results remain, particularly if the anomalous results square in some way with my previously undeveloped intuitions.)

Even more compelling to me, however, is the application of these and related techniques to bodies of text that would otherwise go simply unread and unanalyzed. If you’re working on any kind of large-scale literary-historical problems, you come up very quickly against the limits of your own reading capacity; you just can’t read most of the books written in any given period, much less over the course of centuries. And the problem only gets worse as you move forward in time, both because there’s more history to master and because authors keep churning out new material at ever-increasing rates. But if you can’t read it all, and if (as I said above) you can’t expect a computer to read it for you, what can you possibly do with all this stuff that currently, for your research purposes, may as well not exist?

Well, you can try to extract data of some kind from it, then group and sort and classify it. This might do a few different things for you:

It might allow you to test, support, or refine your large-scale claims about developments in literary and social history. If you think that allegory has changed in important and specific ways over the last three centuries, you might be able to test that hypothesis across a large portion of the period’s literary output. You’d do that by training an algorithm on a smallish set of known allegorical and non-allegorical works, then setting it loose on a large collection of novels. (This process is known as supervised classification or supervised learning, in contrast to the un- or semi-supervised clustering described briefly above. For more details, see the Jain article linked at the end of this post.). The algorithm will classify each work in the large collection according to its degree of “allegoricalness” based on the generally low-level differences gleaned from the training set. At that point, it’s up to you, the researcher, to make sense of the results. Are the fluctuations in allegorical occurrence important? How does the genre vary by date, national origin, gender, etc.? Why does it do so? In any case, what’s most exciting to me is the fact that you’re now in position to say something about these works, even if you won’t have particular insight into any one of them. Collectively, at least, you’ve retrieved them from irrelevance and opened up a new avenue for research.

The same process might also draw your attention to a particular work or set of works that you’d otherwise not have known about or thought to study. If books by a single author or those written during a few years in the early nineteenth century score off the charts in allegoricalness, it might be worth your while to read them closely and to make them the objects of more conventional literary scholarship. Again, the idea is that this is something you’d have missed completely in the absence of computational methods.

Finally, you might end up doing something like the Shakespearean clustering case above; maybe a book you do know and have always considered non-allegorical is ranked highly allegorical by the computer. Now, you’re probably right and the computer’s probably wrong about that specific book, but it might be interesting to try to figure out what it is about the book that produces the error, and to consider whether or not that fact is relevant to your interpretation of the text.

One note of particular interest to those who care deeply about bibliography. In an earlier post about Google Book Search (a service tellingly renamed from the original Google Print), there was some debate about whether GBS is a catalog or a finding aid, and whether or not full-text search takes the place of human-supplied metadata. I think it’s obvious that both search and metadata are immensely useful and that neither can replace the other. One thing that text mining and classification might help with, though, is supplying metadata where none currently exists. Computationally derived subject headings almost certainly wouldn’t be as good as human-supplied ones, but they might be better than nothing if you have a mess of older records or very lightly curated holdings (as is true of much of the Internet Archive and GBS alike, for instance).

Finally, some links to useful and/or interesting material:

The MONK Project. A discussion of MONK, which is an attempt to bring corpus-oriented text analysis to the English-department mainstream, set all this in motion here on EMOB.

23 Responses to “Reading with Machines”

[…] to Comments I just put up a longish post over at Early Modern Online Bibliography called “Reading with Machines.” It’s a highly selective and impressionistic overview of literary DH work, plus a […]

Wow. There’s a lot to process in this post, and I haven’t had the chance to do it thoroughly yet (and won’t have the chance to work all the way through all the links and so forth in the immediate future), so I’m not going to be saying anything scintillating in this comment, but your analysis is getting me to look in a very helpful direction. Thanks!

Thanks, Matthew. What a helpful overview! Jockers’ clustering and classification of Shakespeare’s plays still mystifies me: you mention the strangeness of listing Othello as a comedy, but why is Midsummer’s Night Dream listed as a tragedy? One can perhaps guess why Romances like Cymbelline, The Winter’s Tale, and The Tempest are pushed into the tragedy cluster, though that, too, merits analysis. If the classification program works, then perhaps these aberrations will tell us something interesting about the particular use of language in these plays or, depending on what gets analyzed, something about the plays’ structure.

Thanks also for the nice bibliography. I will spend some time looking over the links you provide. Again, thanks!

Thanks so much, Matthew, for this very rich post. I especially liked your reminder about what computers can do well that humans can’t and the need to play to the strengths of the machine when devising computational projects. Your point that reading, despite being a prime method for textual analysis, is not the only “analytical technique when it comes to understanding literary or social-aesthetic objects” dovetails well with the capability these tools afford for asking new questions and discovering new ways of working with texts.

As for your remarks about GBS, I wonder if you know of anyone who is working on devising computational subject headings for Google texts–Google or others (if that would be a possibility)? Such a project could be fascinating.

As you note, it is very telling that Google changed the name of its “library” from Google Print to Google Book Search. One of the reasons Google gave for the switch was that users came to “Google Print” expecting to be able to print everything they found. I would add that people were probably expecting full-text access and the like (some new to Google still approach it with this expectation)–much like one would find on Project Gutenberg or accessing articles (if one belonged to a subscribing institution) found in JStor or Project Muse. GBS enables you to search for books, of course, but its real strength in its present incarnation is, in my mind, its capability of searching across texts.

Also, many thanks for the list of selected links–they are welcome and well-chosen.
I have added three of the readings from the Blackwell Companion works

Warwick’s “Print Scholarship and Digital Resources” and Deegan and Tanner’s “Conversion of Primary Sources” (Companion to Digital Humanities) and Damian-Grintt’s “Eighteenth-Century Literature in English and Other Languages: Image, Text, and Hypertext” (Companion to Digital Literary Studies)http://www.digitalhumanitieshttp://www.digitalhumanities

to the bibliography (see Post 2) because they seemed quite relevant to our upcoming roundtable discussions (Anna had already included Steggle’s “Knowledge Will be Multiplied” from the Blackwell Compantion to DLS).

As for your remarks about GBS, I wonder if you know of anyone who is working on devising computational subject headings for Google texts–Google or others (if that would be a possibility)? Such a project could be fascinating.

I’m not aware of any such projects beyond some vague hints about “collaboration” from people at Stanford, but I’d be surprised if Google didn’t have something like this in development, since it’s so closely related to their search business. They’re probably the only ones who could develop such a thing at the moment, since no one else has programmatic access to their holdings. If and when the settlement is approved and academic researchers get “non-consumptive” access to the data, I suspect this would be an attractive early project.

One of the problems is that the settlement prohibits creation of a service on the “research corpus” that competes with Google’s (see page 82 of the settlement).

So if you do a better job at text mining / entity extraction etc. than Google (or other Rightsholders) and you’d liked to make your results usable so other researchers can build upon them (“Scholarship as a Service”?) then you’d be in breach of the terms of use of the research corpus.

No one really knows how the GBS research corpus will work as a practical matter. But I have to say it doesn’t surprise or concern me that you can’t use it to compete with Google. They’re giving you a huge, valuable resource and saying “have at it for your scholarly work.” That’s a pretty good deal, in my book. If you want to turn that access into a business, well, then you’re going to need to pay. Sure, I wish the whole thing were just being donated to the public domain, but that’s not what’s on offer.

Matt, I read your take on the GBS and you make a good argument for supporting it. However, I can’t agree with it without some key changes. In my mind the dangers of a entrenching Google as a monopoly in this space far outweigh the benefits offered by the settlement.

There are other important objections with regard to the privacy issues and user data capture that will be required under the access and use restrictions. Remember this is a company that already monitors a tremendous amount of user data (some 88% of all web traffic! http://knowprivacy.org/), and is moving toward “behavioral advertising”.

What’s bad about this for scholars? I think there can be a “chilling effect” with the privacy issues. Google does not have the same values found in your university library, and will exploit data about your use of their corpus. They can also remove works with no notice or recourse, again, not like a university library.

With regard to the research use of the corpus, it’s true nobody will know how they will play out. I think for researchers on the computational side, it’ll be a huge boon, since they’ll have a big data set to use to test new algorithms.

However, humanities scholars are on the more “applied” side of this. They’re more likely to want to use text-mining techniques to better understand a collection. Where I see a problem is that they will not have clear permissions to share their understandings, especially as a new service (say one with enhanced, discipline-specific metadata over a portion of the corpus). Because that service may “compete with Google” or other “Rightsholders”. I really think that restriction matters.

The settlement also places restrictions on data extracted (through mining and other means) from copyrighted works. This is a also a problem, because it weakens the public domain status of facts/ideas. If Google launches a Wolfram|Alpha like service on this corpus, they will also likely act like Wolfram|Alpha and claim ownership of mined “facts”.

None of this is good for researchers in the long term. Now, I’m not saying this has to be a totally “open” resource (it can’t because of the copyright status of many of the books). All I’m saying is that we should be REALLY concerned. We should push for some additional protections.

Thomas Rommel’s 2004 chapter called “Literary Studies” in the Blackwell Companion to Digital Humanities cites Jerome McGann’s explanation for why computational analysis has not yet taken hold in literary studies. According to McGann, the

general field of humanities education and scholarship will not take the use of digital technology seriously until one demonstrates how its tools improve the ways we explore and explain aesthetic works – until, that is, they expand our interpretational procedures. (McGann, Radiant Textuality, 2001: xiii).

This makes sense to me. We need a clearly outlined map of how interpretation is enriched or expanded through computational analysis. Perhaps this exists somewhere?

I took McGann to mean that digital work will be taken seriously when and only when we start reading articles and books with traditional literary concerns that happen to use digital techniques as part of their working method. In other words, it’s all well and good to talk about the potential of digital research, but what we really need are more examples of existing, good, interesting, and specific digitally assisted literary critical results.

I agree with McGann completely. There’s not much out there yet that meets those criteria, but it’s coming, and soon (I hope). Of course, non-digital folks could be forgiven if “coming soon” sounded familiar to them.

Anna’s and Matthew’s comments made me think of another variation of how computational tools/programs can assist in handling large textual corpa in ways not otherwise possible: Peter Robinson’s (et. al.) Canterbury Project. Among other initiatives, this project created software (Collate–available for a long time only for use on Macs) to build on the magisterial, pre-digital-age work of John Manly and Edith Rickert’s The Text of the Canterbury Tales Studied on the Basis of All Known Manuscripts (8 vols) and afford better, more effective ways of presenting the information about variants found across the 84 manuscripts of Chaucer’s CT that had appeared by 1500. One interested in learning more about the fruits of this project might consult Robinson’s “The History, Discoveries, and Aims of the Canterbury Tales Project,” Chaucer Review 38.2 (2003): 126-139. Although some might worry that the ability to manipulate the choice of “base” text and witnesses by users of electronic editions decenters the authority of the scholarly edition, I see this capability as a plus and one that could lead to greater discoveries. (My exposure to this project when I attended the Center for Electronic Texts in the Humanities [CETH] in 1996 was what first stimulated my interest in dh–my focus there was scholarly electronic editions; Robinson was the convener for these workshops.)

It’s a hefty report from CLIR on the state of various DH projects and issues. Of particular interest re: this post is Douglas Oard’s essay “A Whirlwind Tour of Automated Language Processing for the Humanities and Social Sciences,” which begins on page 34. It’s whirlwind indeed and covers more ground (necessarily in less depth) than I did, but it’s worth a look if you’d like to know with whom to make useful friends on campus. And it mostly agrees with me on a few lightly polemical points, so I approve heartily 🙂

I liked Oard’s “Whirlwind Tour,” though it successfully calls attention to the complexity of machine reading–and to the difficulty of getting machine reading to work (or of knowing what it is we want it to do).

[…] using these cost-free sites as the foundation for the kinds of digital humanities approaches that Matthew Wilkens describes so well (what John Unsworth calls “not reading” and Tanya Clement calls […]

This is a really great, really informative post. I feel like this is a great starting-point for those exploring this topic. (and I agree with those who think that what machine-reading really needs is a really good book and new critical voice that demonstrate the virtues of the approach particularly well) Thanks, Matthew. DM

Dave is right. This morning I looked at Lisa Spiro’s marvelous site and was again impressed by something we have not yet mentioned here: the way these computerized tools facilitate collaborative authorship. She cites John Unsworth on a theme Dave and others have discussed frequently and usefully in the Long Eighteenth:

In the cooperative model, the individual produces scholarship that refers to and draws on the work of other individuals. In the collaborative model, one works in conjunction with others, jointly producing scholarship that cannot be attributed to a single author. This will happen, and is already happening, because of computers and computer networks. Many of us already cooperate, on networked discussion groups and in private email, in the research of others: we answer questions, provide references for citations, engage in discussion. From here, it’s a small step to collaboration, using those same channels as a way to overcome geographical dispersion, the difference in time zones, and the limitations of our own knowledge.

Collaborative work facilitated by scholars working together to analyze machine-read statistics, or sharing bibliographies on Zotero, or simply mapping information as Matthew and others have on various blogs, seems so helpful, and yet I think the humanities have not, until recently, caught on to the merits of collaborative work. The sciences are far ahead of us in this respect.

This is a bit of a digression from machine-reading, but these thoughts were provoked by Matthew’s generosity in putting together a very useful overview and by the scholarly riches shared on the sites he lists.

John Unsworth’s “How Not to Read a Million Books” provides the most efficient overview of MONK I have seen. It includes examples of several projects now being undertaken with the help of text-mining tools such as MONK.