dedicated to DATA: digitally assisted text analysis

...the broad circumference
Hung on his shoulders like the Moon, whose Orb
Through Optic Glass the Tuscan Artist views
At Ev’ning from the top of Fesole,
Or in Valdarno, to descry new Lands,
Rivers or Mountains in her spotty Globe.
(Paradise Lost, 1. 286-91)

New release of Shakespeare His Contemporaries

I have put a new version of Shakespeare His Contemporaries on Google Drive, where you may or view or download the plays. In this version I have grouped the plays by decades and put them in directories with names like 155, 156 …165. The plays have been encoded in TEI Simple. The texts are in a machine-actionable rather than readable format, but it would not be especially difficult to mount them on the Web in a format that lets you toggle between original and standardized spelling–rather in the manner of the text of Purchas His Pilgrimage on Wolfgang Meier’s eXist showcase of other texts encoded in TEI Simple. The standardized spellings have benefited much from the advice of Richard Proudfoot, but they are the products of algorithmic curation, with a residue of error that requires manual attention. Names pose particular problems.

As reported in my earlier blog of October 25, the new version benefits from the curation of ~ 10,000 textual defects by Hannah Bredar, Kate Needham, and Lydia Zoells, who between April and July separately or together visited the Bodleian, Folger, and Newberry Libraries as well as the Rare Book Libraries of Northwestern and the the University of Chicago. I am delighted to report that their example has moved others to do likewise. In January 2016 five undergraduates from Amherst and Smith will over the course of three weeks tackle the remaining 9,000 defects in the TCP transcriptions. If past experience is a guide they will reduce that number by two thirds or more and come pretty close to wrapping up the initial “data janitoring” of a large corpus of Early Modern Drama. The phrase comes from a terrific New York Times article from August 2014: http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?_r=0

You learn from this article that science and scholarship are just like house painting: the prep is slow and boring, the painting fast and fun. Or, as the Times has it:

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

There is more interesting work ahead, but removing obvious blemishes is an important first step in the curation of a corpus that will be the first, the most convenient, and often the only point of access for future students of Early Modern drama.

I close by repeating the final paragraphs from the October 25 blog, which is a fuller account of the editorial work that has been done:

What if one thought of 1616, the 400th anniversary of Shakespeare’s death and Ben Jonson’s chutzpah of offering his plays as “works” as the occasion for an Eranos (Greek for potluck) where libraries in a loosely coordinated fashion contribute new images of some of their holdings of SHC texts? By the end of that year many of those plays could be available as “digital combos” or digital surrogates in which facsimile images are aligned with versions of the TCP transcriptions that have gone through initial rounds of curation by undergraduates. Students at Northwestern and Washington University have clearly demonstrated–if that ever needed demonstrating–that students do this well and learn much from it. Individual contributions from libraries to such a project would not have to substantial for their aggregate to make a big difference.

The “digital combos” resulting from such a project would require more work to be certifiable documentary editions of a particular text, and they would co-exist for years in varying states of (im)perfection. But that is OK as long as there is an environment, both social and technical, that encourage collaborative, iterative, and incremental curation.