dedicated to DATA: digitally assisted text analysis

...the broad circumference
Hung on his shoulders like the Moon, whose Orb
Through Optic Glass the Tuscan Artist views
At Ev’ning from the top of Fesole,
Or in Valdarno, to descry new Lands,
Rivers or Mountains in her spotty Globe.
(Paradise Lost, 1. 286-91)

Freebo, Free Lunch, and Crowdfunding New EEBO Images

Here is a prefixed postscript (April 18, 2016) to my December 2015 blog post about creating new EEBO images: in a recent conversation with Thomas Stäcker, the deputy director of the Herzog-August-Bibliothek in Wolfenbüttel (HAB), I learned that their average cost for creating a digital image good enough for most scholarly purposes is about a dollar a page. The HAB holds 80% of 16th century German imprints, and the library has over the years digitized 3.5 million images. From a technical perspective, early modern books from Germany do not differ much from early modern books elsewhere. Thus, given a proper workflow and equipment in the lower five-figure range, it would appear that a coordinated and distributed campaign for creating good enough new images of old books is a possible thing. If it has been done in Germany, it can be done in North America. The funding patterns are different, but the social and technical challenges are very similar, and they are superable.

=====

The recent Twitterstorm over Proquest’s first canceling, and then canceling the canceling, of access to EEBO images by members of the Renaissance Society of America should make us take a larger and longer-term view of the “re-mediation” of “old books” in a digital world. Is it time to say Good bye to EEBO images and think of new, much better, and public domain facsimiles? What would it take to finance them? Could forms of crowdfunding get us there?

“Old books” is the charming term that a Computer Science colleague of mine uses to describe my work with EEBO-TCP texts. The plain term has the virtue of highlighting quite a few problems involved in the creation and maintenance of digital surrogates of “Early Modern” books or “ESTC holdings” to use more professional-sounding terms.

The ideal 21st century version of an old book combines three complementary perspectives:

A facsimile of each page tells the reader much about the “materiality” of the text

A fully proofread and corrected transcription with an option to see the original or standardized spelling of the text makes it a pleasure to read

The critical point to stress is the complementarity of facsimile and transcription: a good facsimile does not obviate the need for a transcription and vice versa. Equally important is the point that a good facsimile never supersedes its original. But for many purposes of data curation or analysis, digital surrogates are almost as good as the original, often more suitable, and always more accessible. Not infrequently digital surrogates spur an interest in the original: I remember a conversation with a Newberry librarian who told me that foot traffic increases rather than decreases if you make images available over the Web.

I have spent much of my scholarly life as a student of Homer and have a deep commitment to the idea that a text is an “allographic” object, to use a term from Nelson Goodman’s Languages of Art. The history of texts is an allographic journey in which the text stays the same. Here is the opening line of the Iliad in a familiar modern format, followed by its representation in unaccented beta code, used widely in digital texts before the arrival of Unicode:

μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος

MHNIN AEIDE QEA PHLHIADEW AXILHOS

The second rendering would strike many Hellenists as pure barbarism, but the first would have been “Greek” to Thucydides or Plato although, to judge from inscriptions on early vases, they could have made some sense of the second. The two renderings are allographic variants of the same verse. Nobody in 2,500 years or more has seen a copy of what that line looked like when it was first written down, but I have little doubt that it always named its hero and his all-consuming anger in exactly this manner, just as I have little doubt that the Odyssey always began with a line that studiously avoids naming its protagonist, whom it characters as a man “of many turns” and who in the poem’s central episode saves his skin by saying that he is “nobody.” A little later he can’t keep his mouth shut and blurts out his name, which causes him no end of trouble.

That said, there is much to be learned from the ways in which a printed old book presented itself to its first readers. The EEBO images, digital scans of microfilms whose quality is rarely better than so-so and often atrocious, don’t make for easy reading, but even the bad ones tell you quite a bit about their milieu. Carl Stahmer reminds us in a recent blog-post about EEBO-gate: “Images of texts taken with just a smartphone surpass the images in EEBO in both quality and readability.” What would it take to make new and much better images of all or most of the ~60,000 books in the TCP transcriptions or, more ambitiously, of as many books in the ESTC catalogue as possible?

The title of James Gleick’s Fstr points to a defining feature of digital technology. But however fast computers are now or will be in the future, large-scale data projects will still take a long time. Theodor Mommsen’s Corpus Inscriptionum Latinarum, one of the great monuments of 19th scholarship, took over fifty years to reach a provisional state of completion by 1914. The microfilming of EEBO images also went on for half a century. The TCP project took almost two decades. Making a digital combo with fresh and high-quality images of every book in the TCP will take just as long. On the other hand, the project will be useful from the beginning: one digital combo is better none, a thousand is a lot better, and if the imaging is driven by user demand there may come a point midway where the collection is good enough for most purposes.

How much would this cost, and since there is no free lunch, who would pay for it? Imagine a subscription model, in which the cost for producing the facsimile of a TCP book is shared between three parties: the library that holds the original, a third party source (foundation, special fundraising), and a user who has a need for this particular book right now. The third is critical. If a user wants a book badly enough to have some “skin in the game,” the library and the third-party source know that the facsimile will have at least one active user. The user in this case buys a facsimile for herself, but the facsimile moves instantly into the public domain. This funding model bears some resemblance to the commercial-academic partnership that produced the TCP texts, but it adds individual demand as the critical driver.

I think we live in a world where “one at a time” or “digitize on demand” modes of production are cost effective. In any event, an “old book” will always have to be looked at by somebody with the expertise to judge whether the book is suitable for digitization or what special care needs to be taken. A social rather than technical point is the need for quick turnaround. Readers who want a facsimile badly enough to put up their own (or their institution’s) cash want it “now.” From what I know about my own university, it is an entirely plausible scenario that an undergraduate working on her honors thesis or a doctoral student working on her dissertation could apply for internal funds in the hundreds or even low thousands to get facsimiles that are needed for her project. It may be a lot cheaper (if less fun) than a trip to the Folger or Huntington. But she needs it “now,” where “now” means “no longer than a month and preferably less than a fortnight.” That means that participating libraries would need to let such requests jump the queue of their internal project. That may have its own difficulties, but the principle of paying for express delivery is well established.

There are various ways of implementing a project of user-driven digitization where the user’s “skin in the game”

triggers the release of resources from a Library and the third-party source

produces a facsimile available at short notice not only to the user who requested it but to everybody on the planet

generates a link in ESTC catalogue whose beta release is promised for the spring of 2016.

Since I am a member of a Midwestern university, I’ll sketch a Midwestern or more specifically CIC scenario that takes off from a clever repurposing of the RBML acronym. Andrew Keener, a Northwestern doctoral student, and some colleagues turned RBML, which typically stands for Rare Book and Manuscript Library, into “Renaissance Books in Midwestern Libraries” as an “effort begun at Northwestern … to boost awareness of books printed between 1473 and 1800 in the English language (or in England and its territories) that now reside much closer than libraries in Europe.”

The Midwestern holdings of “old books” are very impressive indeed, and what with the holdings of the Newberry Library and the Rare Book and Manuscript Library at the University of Illinois at Urbana-Champaign, they cluster especially in Illinois. From the perspective of Early Modern Studies in the CIC, there is no doubt that the RMBL at UIUC is the CIC’s crown jewel of Old Books, largely because of the extraordinarily canny buying of a generation of scholars like T. W. Baldwin of William Shakespere’s Small Latine & less Greeke fame.

I could imagine a meeting of CIC provosts who think about the future documentary infrastructure in disciplines where the CIC as a whole has for generations made a significant difference on an international scale and continues to do so. They would (or should) be impressed by a bibliography of important editions, books and articles written by scholars who spent a significant part of their careers at CIC institutions. And they might decide to provide matching funds over the course of a decade that would support the user-driven creation of digital facsimiles of the primary sources for which CIC libraries, and the UIUC Library in particular, hold the original print editions.

I do not know very much about what it actually costs to produce a digital image of a page of an old book, and I have been told quite different stories by different people. I suspect that if the end user’s “skin in the game” were a third of the actual cost, it would be a bearable cost. From the perspective of libraries, they would get two dollars for every dollar they spend. From the provosts’ perspective, the annual cost would be real money, but certainly not big money. If you follow the money for each facsimile, you will sometimes end up at different pockets of the same institution: the provost, the library, a department or school, research accounts or private pockets of individuals. But just as often the requests would come from scholars whose institutions cannot afford a subscription to EEBO-Online but would be happy to support this or that request. This and that add up over time. Some people might prefer a more systematic approach to the problem, but if you start from current needs you have some certainty that you’re paying for something that somebody needs right now. Over time, the aggregate of individual requests will reflect current trends in the profession. Since the number of old books is fixed and they are very likely to be preserved, the order in which they are digitized does not really matter, as long as teachers and students can get at the stuff they need when they need it.

If you created a digital facsimile of each “old book” title that is held by a CIC library and for which no public domain image is available you would not meet all needs of CIC scholars. But you would meet a lot of needs within and beyond the CIC, and the effort might well spur others. Beyond the scholarly and pedagogical needs of CIC institutions there are huge benefits to students and faculty at small liberal arts colleges in the Midwest and beyond. For undergraduates at UIUC who love old books the access to them lies within minutes of their daily footpaths. That is not the case at Knox or Wheaton College. For their students access to good digital surrogates may not be quite as thrilling as touching the real thing, but it is good enough for most purposes. The same can be said of thousands of high school students who read a lot and like old books.

It is worth stating at this point that the Text Creation Partnership, an Anglo-American project that has taken almost two decades to complete, had its origins in the CIC, was led by a CIC institution, and got the majority of its support from CIC institutions, especially in the early years. If one half of the “digital combo” was a CIC product, it would be altogether fitting if the CIC also took the lead in filling the other half with images that take full advantage of 21st century technologies.

The case I make here is very similar to the arguments in the 1907 prospectus to the Tudor facsimiles, a project that took great pride in using the latest advances in “photo, photo-litho, collotype, and photogravure” work to make “absolutely faithful” replications of rare old books available to many users:

Scholars, in common with professors, teachers, students, and lovers of English–the language or its literature…have had hitherto to deplore the fact that…. so many of the rarities of early printing and the priceless treasures of ealry English literature are, comparatively speaking, sealed to general scholarship and research. To remove that reproach is the object now in view.

The technologies and financing strategies have changed. The goals remain the same. If Early Modern scholars want good facsimiles of the primary sources on which their teaching and research depends, nothing will be as effective in getting there as real “skin in the game” from individuals and the departmental parishes in which they live and work.