Abstract

In October 2008, Google announced a settlement that will provide access to seven million scanned books while the number of books freely available under an open license from the Internet Archive exceeded one million. The collections and services that classicists have created over the past generation place them in a strategic position to exploit the potential of these collections. This paper concludes with research topics relevant to all humanists on converting page images to text, one language to another, and raw text into machine actionable data.

Introduction

In a long span of time it is possible to see many things that you do not want to, and to suffer them, too. I set the limit of a man's life at seventy years; these seventy years have twenty-five thousand, two hundred days, leaving out the intercalary month. But if you make every other year longer by one month, so that the seasons agree opportunely, then there are thirty-five intercalary months during the seventy years, and from these months there are one thousand fifty days. Out of all these days in the seventy years, all twenty-six thousand, two hundred and fifty of them, not one brings anything at all like another.
(Herodotus, Histories 1.32, tr. Godley)

In the first book of Herodotus’ Histories, the Athenian
statesman Solon calculates that an average human life of seventy years contains roughly 25,000
days. If we could read a book every day of our lives, it would take a thousand years — almost
forty generations — to work our way through one million books. It would take 10,000 years or
four hundred generations to work through the ten million or so unique books that the original
Google library partners contained in their collections.[1] On October 28, 2008, Google announced an agreement with
publishers that would allow libraries to provide, largely under a subscription basis, access
through Google book search to some seven
million books, including copyrighted materials.[2] Google is providing immense scale but
the scholarly significance is not so great as it might be: there is at present no way to
understand what subset of the world’s knowledge that seven million volumes represents. Even if
there were, scholars have no way of understanding in more than the most general way how the services that extract information from that collection work — what is missed? What biases are embedded in the system? Scholarship depends upon transparency, and we must be careful that we do not, in pursuing our immediate research projects, compromise our fundamental commitment to transparency.

A day before the momentous Google announcement, another and arguably even more important milestone was crossed. On October 27, 2008, the number of books available from the Internet Archive exceeded 1,000,000. While the million books is only a fraction of the size of the seven million that Google boast, the million books available from the Internet Archive are freely downloadable — anyone can analyze them and publish the results. The collection available from the Internet Archive provides the foundation for transparent services and, even more important, transparent discourse. Open source services, carefully evaluated and publicly documented, applied to open content, freely downloadable by anyone without restrictions, embody the goals of scholarly and scientific practice.

A million books alone would support a book-of-the-day club for almost 3000 years. Thus, even if we restrict ourselves to digitized printed books available for public download in a single location, the scale of content available has already passed that which any single human mind could comprehend. As a physical collection, a million books is hardly remarkable. As a store of knowledge for human analysis, the scale of 1,000,000 books has already passed human scale and is as abstract as the distance between galaxies or the number of insects in the world. Only machines can process the collections to which we already in late 2008 have access. What can we do with a million books with the tools now at our disposal and which we could build? What are the research questions that emergent huge collections raise for the historians, literary critics, and other humanists who study their contents and for the computer and information scientists who develop methods with which to process digital information in general?

This paper summarizes research, supported by a grant from the Mellon Foundation, into the challenges and opportunities confronting the humanities in general and classical studies in particular as we shift from small, carefully edited and curated digital collections to very large, industrially produced collections that, in their fullest instantiation, aim to subsume whole libraries. We view classical studies as a special case with the more general question that we have termed "what do you do with a million books?" The authors of this paper come from Europe as well as North America, from classical philology, computer science, corpus linguistics and library and information science, but many others contributed substantively to the work reported here during workshops that we conducted between November 2006 and March 2008.[3] Previous publications have addressed some of the general issues that the humanities as a whole face.[4] This paper explores the particular case of classics within the million-book library.

When we began this study in 2006, we planned to focus upon materials related to the Greco-Roman world, early modern Europe, and the 19th century United States so that we could examine the varying problems and opportunities associated with print materials relating to each. As our work progressed, the advantages of focusing on classical studies became progressively clear. Classics includes not only the Greco-Roman world but the subsequent scholarship about the Greco-Roman world and a vast body of material written in Latin on virtually every subject. Beginning with early printed editions in the 1470s and continuing through the present, Classical scholarship brings us to every corner of Europe, North America, and the Middle East.

Classical studies do not, of course, touch the same audiences as Shakespeare or the American Civil War, and there are not nearly so many Classicists as there are experts in English Literature and American History, but Classics has produced the largest coherent community of scholars engaged with the digital infrastructure for their field. Classical studies became a logical focus for our work: if we could understand how to build a comprehensive collection of classical scholarship from the beginning of print culture to the present, we would know how to work with centuries of print publications on every aspect of human society and in every discipline and from every corner of Europe and North America.

This paper begins by stressing that we have moved beyond islands of digital content in a vast sea of print. Where our first generation collections were autonomous, carefully curated, discipline-specific islands, we now see emerging a world where we dynamically generate collections of heterogeneous materials from vast and constantly expanding digital libraries over which no individual discipline or project exercises control. We cannot thus rely upon a centralized editorial structure to guarantee for us the consistency of what we find. We need tools that can help us assess how representative our automatically extracted corpus is (e.g., what biases are there in the distribution of Latin texts available for searching?) and the accuracy of our analytical tools (e.g., the precision and recall of named entity systems that search for Salamis in Cyprus vs. the Salamis near Athens, the error rates in Latin text that Optical Character Recognition (OCR) extracts from various editions printed in different places and times).

Our discussion then moves to the services that humanists need to exploit very large collections. These include not only advanced services for information extraction, multilingual technologies, and visualization but simple access to the scanned page images with which to support domain-optimized document analysis. These services require the rise of a new, fourth generation of digital corpora. Our first digital corpora included accurate transcriptions with markup of surface features (e.g., we simply indicate that a word is in italics). A second generation began to add semantic markup (e.g., a phrase is in italics because it is the title of a work or a Latin quotation). The third generation created much larger collections by shifting the focus of manual labor from carefully edited typing to industrial scanning of page images. We need fourth-generation collections that can seamlessly integrate image-books, accurate transcriptions, and machine actionable knowledge in various formats.

These fourth generation collections are a qualitatively new phenomenon. They allow us to design collections that are not only more comprehensive but more diverse than we could ever produce in print culture. These collections are unbounded and can include not only texts but every category of data about their subjects — high resolution images, three-dimensional models, geographic data sets, and anything that we can represent in digital form. Even if we restrict ourselves to linguistic data, fourth generation collections are a qualitative advance over print: we can include not only images of neatly printed modern books but non-print representations of language such as three dimensional models of words engraved on stone and digital sound recordings.

For classics, the most important such project is what we have termed the apographeme of classical Greek and Latin — an analogy to the genome, representing the complete record of all Greek and Latin textual knowledge preserved from antiquity, ultimately including every inscription, papyrus, graffito, manuscript, printed edition and any writing bearing medium. This apographeme constitutes a superset of the capabilities and data that we inherit from print culture but it is a qualitatively different intellectual space. In the mature apographeme, every canonical text is a multitext, with dynamic editions linked to visual representations of the manuscripts, inscriptions, papyri and other sources. In the mature apographreme, each source is linked to the background data that we need to understand it — a transcription, information about the particular type of Greek or Latin script and its abbreviations, about the monastery, print shop or Egyptian village that produced it, etc.

From Curated Collections to Dynamic Corpora

The methods whereby we assemble digital content are very different now from those
that were available when, a generation ago, the first pioneers began designing digital tools for the humanities. In the 1970s and 1980s, most scholars considered digital resources — insofar as they considered them at all — as instruments with which to navigate a paper sea of information. The Thesaurus Linguae Graecae (TLG) and the Dictionary of Classical Bibliography (DCB), two of the pioneering efforts within classical studies, were, in effect, indices and depended upon the ability to pay human beings to read and to type. The Thesaurus Linguae Graecae began work in 1972 and can boast in 2008 almost 100,000,000 words of cleanly transcribed Greek text.[5]

By the opening of the twenty-first century, of course, the technologies available began to open up very different approaches. Between 2001 and the end of 2005, one of the authors of this paper, Gregory Crane, developed a 55,000,000 word collection of 19th century American English.[6] He personally scanned 400 volumes, applied OCR to the scanned page images, applied automated post-processing and shipped the results to a data entry firm. A handful of reference works with complicated formatting required traditional manual data entry but for the vast majority of this corpus the OCR-generated text provided a solid foundation and avoided the need for typing. The contractor checked for errors and added basic structural markup, tagging such elements as footnotes, quotations, figures at a cost of under $100,000 for 500,000,000 characters or about $200/book. The corpus was not an end in itself but rather an instrument with which to study problems of automatically analyzing large collections. Most of Crane’s effort went to the production of a system that could automatically identify people, places, organizations, and other named entities in unstructured text.[7] That research became the foundation for a project entitled "Scalable Named Entity Services for Classical Studies" that would, with support from the National Endowment for the Humanities (NEH) and the Institute for Museum and Library Services, adapt this system for use with documents from classical studies.

The situation has changed even further in the past several years. The primary medium for human intellectual life is now irrevocably digital. The most heavily funded academic disciplines use paper to print digital resources on demand. Print-only resources are now archival materials. Consider the following developments that have intensified since Crane began work on the Perseus American Collection in 2001:

Massive scanning. In December 2004, Google seized the initiative to create a vast library of scanned books, with text generated by OCR, but the library community has the resources to convert its print holdings into digital form: the 123 North American libraries who belong to the Association of Research Libraries spent more than a billion dollars on their collections in the 2005-06 academic year.[8] Of course, most libraries will claim that they are under-funded and cannot maintain their existing collections, much less consider a major new initiative. Some of us are old enough to remember hearing that the costs of print collections would never allow for libraries to make digital materials accessible.

Scanning on demand. The OCA has created an infrastructure whereby individuals could, by 2007, select particular books for scanning and then inclusion in the larger OCA collection. The quality is high and the cost is low: $.10US per page plus handling costs of $5US per book — about $40 for a standard book with about 300 pages. It costs about the same to create a high resolution scan that anyone attached to the internet can scan than it does to buy a single printed book ($52 in 2005-06). Support from the Mellon Foundation has allowed the Cybereditions Project at Tufts University to begin creating within the OCA an open source library of Greek and Latin that will contain at least one text or fragment of every major surviving Greek and Latin author and a range of reference materials, commentaries and core publications.

The growth of open access and open source licensing. In 2008, open access
publication became the dominant model for academic publication. The US National Institutes
of Health (NIH) have established a public access policy. As of April 2008, this policy "requires scientists to submit final peer-reviewed journal manuscripts that arise from NIH funds to the digital archive PubMed Central upon acceptance for publication."
[9] The policy further requires scientists funded by the NIH to include in their papers citations to the open access copies of previous publications in the open access PubMed Central web site. The NIH provides more than 22 billion dollars in funding for medical research.[10] Publishers in the most heavily funded research area in all of academia must now develop business models that assume open access that precedes publication. The US NEH, by contrast, requested just under $145 million dollars in funding for 2009 — less than 1% as much as the NIH invests.

Improved OCR. Traditionally, classical Greek has been a huge
barrier for classicists — there was no useful OCR. All classical Greek
required manual data entry and such specialized work usually cost much more than data entry for English. By early 2008, Google had begun to generate initial OCR text from page images of classical Greek. The software is evidently based on OCR designed for modern Greek and contains errors, but clever search software could ameliorate this problem. If classicists have access to the scanned page images and can optimize OCR software for classical Greek, we can achieve character level accuracy (99.94%) comparable to the standard quality for manual data entry (99.95%). In a preliminary analysis of printed scholarly editions, we found that 13% of the unique Greek words on a page, on the average, only appeared in the textual notes. Restricting our analysis to older volumes from the Loeb Classical Library (which traditionally provided a minimal number of textual variants), we found that only 97% of the unique Greek words on a given page appeared in the main text. Thus, collections that contain perfect transcriptions of the reconstructed text but no textual notes offer at most 97% — and, if we use fuller editions, 86% — of the relevant data. The worst OCR error that we measured (98%) matches the overall recall rate of perfect transcriptions of text alone.[11] Once we enter multiple editions of the same text, we can begin using each scanned edition to identify OCR errors and intentional textual variants.

A new generation of text mining and quantitative analysis. The DCB
contains 600,000 bibliographic entries from 1949 to 2005 and adds 12,500 new items each
year.[12] By contrast, the CiteSeer system, upon which computer
scientists depend, was developed more than a decade ago in 1997 at the NEC Research Institute in Princeton, NJ, and offers an automatically generated index of 767,000 publications, including automatically extracted bibliographic citations.[13] Research continues and new generations of automated bibliographic systems, based on the automated analysis of on-line publications, have begun to appear. The Rexa System, developed at the University of Massachusetts, had assembled a collection of almost 1,000,000 publications in 2005 [Mimno 2007]. David Mimno, one of the authors of this paper, is a member of that research group and has support from the Cybereditions project at Tufts University to begin in 2008-09 applying that research to publications from classics.

We might summarize the current situation as follows: Google has begun creating on-line a digital collection that would be more comprehensive than the greatest university libraries ever produced — and the university libraries themselves control the resources needed to do the job were Google to falter: our retrospective collections are being digitized. The OCA has created a public, scalable infrastructure whereby we can, in fact, build high quality collections within the existing library infrastructure: if massive projects miss anything, smaller efforts can fill in the gaps and create curated collections. The US government, under a conservative, pro-business administration, has made the most profitable monopolies on which publishers had depended illegal and declared open access a condition of its most generous funding agency: the richest publishers must learn to make money under open access. Advances in OCR technology have made it possible for scholars in fields such as Classics to generate very serviceable searchable text for non-standard character sets such as Greek: once we scan editions, we can more comprehensively search primary sources and, for the first time, secondary sources that quote Greek. A new generation of text mining can provide new methods with which to trace ideas and research topics that appear in millions of publications: we can design bibliographic databases that incorporate features of particular interest to classics (e.g., the ability to determine whether "Th. 1.38" designates line 38 of Theocritus’ first Idyll, or chapter 38 of book 1 of Thucydides) with the common features of academic publication (e.g., footnotes and bibliographic citations).

Multitexts: Scholars have grown accustomed to finding whatever single edition a particular collection has chosen to collect. In large digital collections, we can begin to collate and analyze generations of scholarly editions, generating dynamically produced diagrams to illustrate the relationships between editions over time. We can begin to see immediately how and where each edition varies from every other published edition.

Chronologically deeper corpora: We can locate Greek and Latin passages that appear anywhere in the library, not just in those publications classicists are accustomed to reading. We can identify and analyze quotations of earlier authors as these appear embedded in texts of various genres.

New forms of textual bibliographic research: We can automatically identify key words and phrases within scholarship, cluster and classify existing publications, generate indices of particular people (e.g., Antonius the triumvir vs. one of the many other figures of that name, Salamis on Cyprus vs. the Salamis near Athens). Such searches can go beyond the traditional disciplinary boundaries, allowing students of Thucydides, for example, to analyze publications from international relations and political philosophy as well as classics.

In this world, we need to recognize that we are — as indeed classicists have always in large measure been — corpus linguists. All classicists can articulate, in some measure, the relationship between the texts that survive and the subject that we are studying. If we work with Sophocles, we know that only seven plays survive and we have only fragments and even titles for the rest. If we study Alexander the Great, we must first understand the fact that our most comprehensive surviving Greek sources were composed centuries later and depend upon earlier histories that are now lost.

Consider three topics that we might pursue in a very large collection: the usage of a Latin word over two thousand years (e.g., oratio, which can, for example, in different contexts designate a speech or a prayer), the reception of Euclid’s Elements, and the reputation of Alexander the Great. In each case, we can assemble far more information that we could ever collect in print culture. The next section will touch upon some of the services with which we can make the sprawling corpora relevant to each of these topics intellectually accessible. But even before we begin our analysis, we need to understand the limitations of the corpus that we have assembled:

How representative is the corpus? Is all of a given corpus available
on-line? (e.g., have all the published volumes of a series been scanned?) Can we estimate
the percentage of the corpus that survives? (e.g., what percentage of Sophocles’ do the
seven plays and other fragments constitute?) What biases are inherent in our data? (e.g.,
do we have any accounts produced by women or by members of every national/ethnic group
involved in a topic? If we find 100,000 instances of the Latin word oratio,
what are the periods, genres, locations, and (in the case of later Latin) original languages of the authors?)[14] And, are there correlations between these parameters? These may in fact be automatically discernable from the data, even if the human eye doesn’t notice them in the forest of data.

How accurate are the digital surrogates for each object? We may have a
satisfactory corpus of print materials but these materials may yield very different results to automated services such as OCR, named entity identification, cross-language information retrieval, etc. Readers of Jeff Rydberg-Cox’s contribution to this collection will realize that OCR software will, at least in the immediate future, extract much less usable text from early modern printed editions than from editions printed in the early twentieth century. We need automatically generated metrics for the precision and accuracy of each automated process on which we depend.

Services for the humanities in very large collections

If humanists are to exploit large new collections to their fullest, they need, as a minimum, the following services:

Access to images of the physical sources: This includes access to particular copies of a document, any pagination or naming scheme with which to address the individual pages, and a coordinate system with which to describe regions of interest on a given page. Many born-digital publications do not provide such access — logical "page 12" of a report (as printed as a page number) may physically be page 21 of the PDF document (after adjusting for front matter, a table of contents etc.). Coordinate systems must have sufficient abstraction so that they can address relationships of the printed page even if the paper has been cropped or varies from one printing to another: coordinates for one First Folio should be useful with others.[15]

Access to transcriptional data: At the least, we need to be able to analyze the words and symbols that are encoded on the physical page.[16] The rough "bag-of-words" approach, where systems ignore the location of words on the page and even their word order, has proven remarkably useful. This level of service is fundamental to everything that follows. Conventional OCR software has traditionally provided no useful data from historical writing systems such as classical Greek. Latin is much more tractable but OCR software expecting English will introduce errors (e.g., converting t-u-m, Latin "then," into English t-u-r-n). Even earlier books with clear print will contain features that confuse contemporary OCR (e.g., the long ‘s’ which looks like an ‘f,’ such that words such as l-e-s-t become l-e-f-t).[17]

Access to basic areas of a page such as header, main text, notes,
marginalia: Even transcription depends upon basic page layout if it is to achieve
high accuracy: we cannot transcribe individual words unless we can automatically resolve
hyphenization and this in turn implies that we can distinguish multi- from single column text, footnotes, headers marginalia, etc. from main text, etc.[18] We need, however, to recognize basic scholarly document layouts: thus, we should be able to search for either the reconstructed notes or the textual notes at the bottom of the page. This stage corresponds roughly to WYSIWYG markup. At this stage, the system can distinguish the main text from the notes in Figure 5 and Figure 10 but it does not recognize that one set of notes are commentary and the other constitute an apparatus criticus.

Access to visually labeled structures within the text: Explicit labeling in this case includes headwords of dictionaries and encyclopedias and canonical citations such as book/chapter/verse/line. These structures draw upon typographical conventions: e.g., bold and indenting to show headwords, numbers in the margins or embedded in the text with brackets to illustrate citations. This stage would recognize where index entries begin along with their headwords and easily recognized citations. This stage corresponds to semantically meaningful structural markup, e.g., descriptive structures about the text.[19] At this stage, the system recognizes that the notes in Figure 5 are a commentary and contain comments on agro vectigali, cum et maxima … ageretur, and tibique … indicium within the text above.

Access to knowledge dynamically generated from analysis of explicitly labeled
knowledge: This process can begin with very coarse analysis: if we recognize when
various encyclopedia articles describing several dozen figures named Antonius or Alexander
begin and end, then we can analyze the vocabulary of each article to begin deciding which
Alexander is meant in running text. This stage includes the lemmatization and
morphological analysis to support the lookups and searches familiar to classicists for
more than a decade (e.g., query fecisset and learn that it is the pluperfect
subjunctive of facio, "to do, make"; query facio
and retrieve inflected forms such as fecisset).[20] We also need at this stage translation services (e.g., a service that determines whether a given instance of the Latin word oratio more likely corresponds to "oration,"
"prayer" or some other usage). At this point, knowledge based services augment general text mining (e.g., being able to cluster usages of the dictionary entry, facio, as a whole — or of facio as it is used in the subjunctive etc. — rather than treating each form of facio as a separate entity.)[21]

Access to linguistically labeled, machine actionable knowledge: This overlaps with the analysis of visual structures but implies a greater emphasis on the analysis of natural language, e.g., "Y, son of Z,"
"perf. feci" → the perfect stem is fec-., "b. July 2, 1887" → the subject of this encyclopedia was born in 1887 and any references to people by the same name that predate 1887 cannot describe this person,” etc. This stage corresponds to encoding information for particular ontologies, i.e., prescriptive structures separate from the text.[22] At this point, we should be able to pose queries such as "encyclopedia entries for Thucydides who is son_of Olorus or has_occupation historian, etc.", "dictionary word senses is_cited_in Homer or has_voice passive;"
"Book 1, lines 11-21 from all translations_of Homer_Iliad that have_language German."

Techniques exist to address all of the services outlined above. Computer scientists strive for completely general approaches and are willing to accept error rates as a cost to achieve the benefit of scalability. Traditional humanists by contrast manually analyze and, where they feel it necessary, justify the results (i.e., results that may be controversial but for which experts can make reasonable arguments) and are willing to accept labor as a cost to achieve a level of transparency. The grand challenge lies in integrating these two sources of energy: scholars need to be able to build on the results of automated processes but automated processes need to be able to build on scholarly data as well.

Fourth-Generation Collections

We need collections that can support a core set of interlocking services. Core services such as morphological and syntactic analysis, citation identification, word sense disambiguation, word sense discovery, cross-language information retrieval, and named entity identification are, however, data-driven and, for optimal performance, require substantial amounts of carefully encoded knowledge and the largest possible bodies of unstructured data. To support these services, we need a new generation of collections. Within the humanities, we need a new, fourth generation of digital collections.

While classicists have digitized texts for a generation and accurate transcriptions exist for selected editions of almost every author, we do not have the developed, scalable, sustainable knowledge base with which to represent the core primary sources that have survived to us in textual form from Greco-Roman antiquity.

We have already touched upon the first generation of digital primary sources. Classicists still depend primarily upon the Thesaurus Linguae Graecae and Packard Humanities Institute collections of Greek and Latin texts, which follow designs from the 1970s. These first generation digital collections concentrated on accurate transcription of the reconstructed text with structural markup showing where works begin and end. They capture general page layout and approximate citation information: if a number in the margin of the original print edition indicates a line or section begins somewhere in the adjacent line, the human reader is left to determine where the break occurs. They do not contain any of the introductory materials, back matter such as indices and appendices or any textual notes.

Second generation collections (such as those available within the Perseus Digital Library) also emphasize carefully produced transcriptions but include explicit semantic markup that follows the Text Encoding Initiative (TEI) Guidelines. These collections reflect the conditions of the late 1980s where image capture and storage remained expensive. They thus do not include page images of the original source texts and only occasionally include textual notes. Second generation collections may apply more sophisticated techniques to automate transcription and tagging but their design still assumes an expensive initial, centralized editing process with small fixes for residual errors after the initial production phase.

Third generation collections, popularized by projects such as the Making of America and JSTOR, emerged in the 1990s when storage costs had declined to the point where page images for large collections of books could be kept on-line.[23] Third generation systems minimize manual labor and emphasize automatic analysis of page images — especially the use of OCR software to generate searchable text. As OCR software increases in accuracy, texts can be rescanned and the searchable text can improve. First and second generation collections worked from the inside of the book outwards, focusing on subsets of printed books for digital conversion. Third generation collections by contrast, work from the outside of the book, starting with book-level library metadata that may be extended with analytical cataloguing for articles within books.

All of the features that characterize the fourth-generation have existed in one form or another — our group at the Perseus Project has been developing some aspects of this plan for more than twenty years. What distinguishes fourth generation collections is the integration of a small body of data, carefully curated and laboriously structured by semi-automated or even wholly manual methods with an arbitrarily large collection for which automated analysis alone is feasible. The semantically encoded data of second generation digital collections becomes the machine actionable reference rooms from which automated systems learn how to structure the vast third generation collections of page images:

Fourth-generation collections contain images of all source writings, whether these are on paper, stone or any other medium: Like third generation collections, the Cybereditions project sets out to incorporate page images of all print originals. Our goal is to help classical scholars shift the center of gravity for textual scholarship to a networked, digital environment. Scholars should not have to consult paper originals of scanned print editions to see what was on the original page.

Fourth-generation collections manage legacy structures derived from physical books and pages but focus primarily upon logical structures that exist within and across pages and books: Even when fourth-generation collections depend upon page images, they exploit legacy book-page citations but they are fundamentally oriented towards the underlying logical structures within the documents. A great deal of emphasis is placed upon page layout analysis so that we can isolate not only tables of contents, bibliographic references and indices but dictionary and encyclopedia entries and critical scholarly document types such as commentary notes and textual apparatus. Cross language information retrieval hunts for translations of primary sources. Alignment services align OCR generated text to XML editions of the same works with established structural metadata. Quotation identification services spot commentaries by recognizing sequences of quotations from the same text at the start of paragraphs.

Fourth-generation collections integrate XML transcriptions of original print data as these become available: All digital editions are, at the least, re-born digital: The best work published so far cannot convert the elliptical and abbreviated conventions by which scholars represent textual data in print into machine actionable data — we cannot even reliably link the textual notes to the chunks of text which they cover, much less convert these notes into machine actionable formats so that we could automatically compare the readings from one MS against those of another. Fourth generation collections naturally integrate page images with XML representations of varying sophistication. XML representations may, like first generation collections, capture basic page layout and they may have advanced structural and basic semantic markup (e.g., careful tagging for each speaker in a play). They may encode no textual notes, textual notes as simple footnotes (free text associated with a point in the reconstructed text) or as fully machine actionable variants (e.g., variants associated with spans of source text, such that we can, among other things, compare the text in various editions or witnesses).

Fourth-generation collections contain machine actionable reference materials: Our digital collections should be tightly and automatically embedded in a growing web of machine actionable reference materials. If a new prosopography or lexicon appears, links should appear between its articles and references to the people or words in the primary sources. Commentaries should align themselves automatically to multiple editions of their subject work. To the extent possible, these links should bear human readable and machine actionable information: humans should be able to see from a link what the destination is about (e.g., "Thucydides the Historian" rather than "Thucydides-3," ἀρχή-"empire" rather than "ἀρχή-sense2"). Equally important, these links should point to machine actionable information: a named entity system should be able to mine the entries in the biographical encyclopedia to distinguish Thucydides the Historian, Thucydides the mid-fifth century Athenian politician and various other people by that name; a word sense disambiguation system should be able to use the lexicon entries to find untagged instances where ἀρχή corresponds to "empire" or "beginning." Editions should be self-collating — when a new edition of a text comes on-line, we should see immediately how it differs from its predecessors.

Fourth generation collections learn from themselves: Even the simplest digital collection depends upon automated processes to generate text from page images or indices from text. Clustering and other text mining techniques discover meaning in unstructured textual data. Fourth-generation collections, however, can also learn from the machine actionable reference materials that they contain so that they apply increasingly more sophisticated analytical and visualization services to their content. In effect, they use a small body of structured data — training sets, machine actionable dictionaries, linguistic databases, encyclopedias and gazetteers with heuristics for classification to find structure within the much larger body of content for which only OCR-generated text and catalogue level metadata is available. In a fourth generation collection, structured documents are programs that services compile into machine actionable code: Aeneid, book 2, line 48 in a dozen different editions already on-line as image books with OCR generated text.

Fourth generation collections learn from their users: Even third generation systems depend upon the ability of OCR software to classify markings into distinct letters and words. Fourth generation systems include an increasing number of classification systems such as named entity analysis, word sense disambiguation, syntactic analysis, morphological analysis, citation and quotation identification. Where there are simple decidable answers (e.g., to which Alexandria does a particular text refer?) we want users to be able to submit corrections. Where the answers are less well-defined (e.g., expert annotators do not agree on word sense assignment and some passages are simply open to multiple interpretation), we need to be able to manage multiple annotations. Human annotators need to be able to own their contributions and readers should be able to form conclusions about their confidence in individual contributors. Automated systems need to be able to make intelligent use of human annotation, determining how much weight to apply to various contributions, especially where these conflict. We therefore need a multi-layer system that can track contributions, by both humans and automated systems, through different versions of the same texts.

Fourth generation collections adapt themselves to their readers, both according to specific recommendations (customization) and by making inferences from observed user behavior (personalization): Fourth-generation collections can process knowledge profiles that model the backgrounds of particular users: e.g., one user may be an expert in early Modern Italian, who has read extensively in Machiavelli, but only have a few semesters of classical Greek with which to read Thucydides and Plato. The fourth-generation collection can determine with tolerable accuracy what words in a new Italian or Greek text will be new and/or of interest, given the differing backgrounds but consistent research interests of the professor. At the same time, the system can infer from the reader’s behavior what other resources may be of interest.

Fourth-generation collections enable deep computation, with as many services applied to their content as possible: No monolithic system can provide the best version of every advanced service upon which scholarship depends. Google, for example, has a growing number of publications about ancient Greece but currently produces only limited searchable text from classical Greek. Different groups should be able to apply various systems for morphological and syntactic analysis, named entity identification, and various text mining and visualization techniques with minimal, if any, restrictions. These groups should include both commercial service providers as well as individual scholars and scholarly teams.

The Classical Apographeme

Fourth-generation collections allow us to design corpora that go far beyond limitations that we internalized in print culture. To describe comprehensive fourth-generation collections we use the term apographeme, derived from the Greek word for copy (apographê). The apographeme echoes the term genome because an apographeme contains, in its mature form, a complete record of every surviving linguistic source for a particular corpus. For classicists, an apographeme of Greek and Latin would contain representations of every written version of every piece of writing from Greco-Roman antiquity. This includes images of every page of every inscription, papyrus, graffito, manuscript, and printed edition — the entire surviving record of the linguistic output for classical Greece and Rome — and the knowledge base whereby machines can intelligently process and humans productively decipher, insofar as existing knowledge and probing intellect can, every written word in every witness. In a library grounded on images of writing, there is no fundamental reason not to integrate, at the base level, images of writing from all surfaces. Inscriptions, papyri, and manuscripts may not be suitable fodder with which OCR software can generate useful text, but neither Google nor OCA can produce much useable output for even the best printed classical Greek and little, if any, useable text from early modern books.

The Cybereditions project at Tufts has begun preliminary work for this massive task,
focusing on the texts that have survived from Greco-Roman antiquity through manuscript
tradition. These literary texts are, however, designed from the start to become part of a
larger collection that will include documentary materials that survive on stone and papyrus
(see Hugh Cayless's article in this collection) as well as
manuscripts (see Casey Dué and Mary Ebbott's article in this collection). While developing the underlying bibliography is a major and on-going task of the Cybereditions project, we currently estimate that this apographeme would contain the following (because page images would be the first stage of collection, we use "books" as a rough initial unit of measure). Major work for the Cybereditions project will be (1) to complete a first cut of the bibliography below, (2) to begin creating the apographeme, with particular attention to the published editions, and (3) to make progress on the services that will convert these image pages into machine actionable data, with particular attention to the problem of high accuracy OCR for Greek and Latin.

We will not be able to create a comprehensive apographeme for classical Greek and Latin for many years but we can establish a solid foundation from that portion of the apographeme represented by texts that have survived in manuscript tradition. The figures associated with each element reflect very preliminary estimates for broad, illustrative coverage sufficient to model a more mature system that can evolve over time.

c. 500 "book-length" authors/collections. Hundreds and thousands of ancient Greek and Latin authors survive as names or with a small number of fragments preserved in quotations of later authors or on papyrus. F. W. Hall’s Handbook to Classical Texts lists 133 entries in its survey of the "chief classical writers" — including portmanteau works with many authors (e.g., the Greek Anthology) and authors with very large corpora (e.g., Aristotle and Cicero).[24] The Loeb Classical Library does not contain comprehensive editions for massive authors such as Galen or the early Church Fathers but its 500 volumes contain Greek and Latin texts as well as English translations for most surviving authors and works. If we assume that Galen and early church fathers would double the size of the Loeb, then we would have c. 500 volumes worth of Greek and Latin source text. Measured by word count, the corpora of classical Greek and Latin are closer to 100 and 20 million words respectively.[25]

c. 1,000 manuscripts (MS) and an undetermined number of papyri, many very small fragments of literary works. Based on a survey of summary data from Richard and Olivier's Repertoire des
bibliothèques et des catalogues de manuscrits grecs (1995), we possess more than 30,000 medieval manuscripts that contain at least parts of Greek
classical texts (there are nearly 1,200 manuscripts for Aristotle alone).
Since the number of extant Latin manuscripts is conventionally assumed to be 5
to 10 times that of Greek manuscripts, there might be as many as 150,000 to
300,000 manuscripts for Classical Latin. Nevertheless, a small subset of these
provide most of the textual information relevant to the authors and editions
of the most commonly studied authors. Hall’s early twentieth-century Handbook
to Classical Texts summarized the major MS sources for major classical authors
and contains c. 650 readily identifiable MS sigla (e.g., patterns of the form
"A = Parisinus 7794") — while editors have since added additional manuscripts of importance for
most authors, Hall provides a reasonable estimate for the number of the manuscripts on which our editions primarily depend. Some authors do not have a few very authoritative MSS and editors must examine large numbers of MSS of roughly equal authority, and these will inflate the total.
Assuming that this list underestimates the whole by 50-100%, we are still left
with the evidence that a database of 1,000 MSS would represent the majority of
textual knowledge preserved for us by MS transmission.

c. 5,000 major editions over the five centuries extending from the editiones principes of the early modern period to the start of the twenty-first century. Assuming at the high end that each author has c. 10 volumes worth of major editions. Multi-work canonical authors will have many editions of individual and selected works. At the very high end, the New Variorum Shakespeare series chooses c. 50 editions of each play as worth collation and this may represent an upper bound for canonical texts outside the Bible.

c. 5,000 translations in European languages such as English, French, German and Italian. These are important because we can use parallel text analysis to infer translation equivalents and word senses and then use advanced language services (e.g., syntactic analysis, named entity analysis) on the translations and then project this backwards onto the original. Such a technique can, for example, add 15% to our current ability to analyze Latin syntax (e.g., from 54% to 70%).

c. 5,000 modern commentaries, author lexica etc. These are useful for human readers and may lend machine actionable data as well.

c. 1,000 general reference works such as lexica, grammars, encyclopedias, indices and other entry/labeled paragraph reference works with high concentrations of citations and, in some cases, elaborate knowledge bearing hierarchical structures.

c. 1000 specialized studies of Greek and Latin language in a sufficiently structured format for high precision information extraction.

Three Technical Challenges

The implications of very large collections for the humanities are profound. We can transform existing research agendas, render content physically and intellectually accessible to new audiences and make human inquiry possible over barriers of language, culture and sheer volume. An immense amount must be — and is being — done. Within this context, we offer three strategic areas of development that are both essential for the humanities and are not, to our knowledge, currently covered by industrially driven research. These areas of interest include the need to transform page images into machine-readable text, machine readable text into machine actionable knowledge, and text from one language into another. Each of these areas of development reflects the particular needs of humanities scholarship and would benefit from targeted support.

Leverage the fact that many historical texts quote documents for which excellent transcriptions exist in machine-readable form

Thus, the tenth century Venetus A manuscript (Figure 1 and Figure 2) and Jensen’s 1475 incunabulum (Figure 3 and Figure 4) contain texts of Homer and Augustine. We need systems that can use their knowledge that a given document represents texts for which transcriptions exist to decode the writing system of the document, to separate text from headings, notes, and others annotations, to recognize and expand idiosyncratic abbreviations of words within the text, to distinguish variants from errors, and to provide alignments between the transcribed text and their probable equivalents on the written page. Even if we only succeed in general alignments between a canonical text and sources such as early modern printed books and manuscripts, the results will be significant.[26] If we can improve our ability to collate manuscripts or extract useful text from otherwise intractable sources, the results will be powerful.

This task requires very different OCR technology from that currently in use. In this case, we assume that our texts contain many passages for which we possess good transcriptions. The problem becomes (1) finding those quotations, (2) learning what written symbols correspond to various components of transcription, and (3) comparing multiple versions of the same passage to distinguish variants and errors. The OCR system uses a library of known texts to learn new fonts, idiosyncratic abbreviations and even handwriting.

There are two measures for this category of OCR. First, there is the overall character accuracy of transcriptional output from documents that the OCR software produces by training itself with recognized quotations. Second, the ability to locate quotations of existing texts is an important scholarly task in and of itself.[27] Two of the prime tasks in the German eAqua Classics Text Mining Project focus on identifying undiscovered quotations of Plato and of Greek Fragmentary Historians.[28] The apparatus criticus for the Ahlberg Sallust (Figure 9 and Figure 10), for example, includes not only textual variants but testimonia — places where later authors have quoted Sallust. Such manually constructed lists of testimonia provide us with instruments with which to measure precision and recall for automated methods.

Use propositional data already available to decode the formats in which unrecognized knowledge has been stored.

Printed reference works contain an immense body of information that can be converted into machine actionable knowledge. The Perseus Digital Library, to take one example, has tagged hundreds of thousands of propositional data within reference works originally published on paper. Thus, the Liddell-Scott-Jones Greek-English (Figure 13 and Figure 14) and Lewis and Short Latin-English lexica, for example, contain tagged citations to 422,000 Greek and 303,000 Latin authors (i.e., citations tagged with author numbers from the TLG and PHI canons of Greek and Latin authors). Since the structure of the dictionary articles has also been tagged, many of these citations represent propositional statements of the form SENSE-M of DICTIONARY-WORD-N appears in CITATION-P of AUTHOR-Q.

The works of many Greek and Roman authors survive only insofar as other authors have quoted or described them. These fragmentary texts are published as lists of excerpts (Figure 12). Thus, fragment 116 of the historian Ephorus in Mueller’s edition contains an excerpt from chapter 12 of Plutarch’s Life of Cimon. Each of which represents the propositional statement "EXCERPT-A frm CITATION-C of AUTHOR-D refers to (fragmentary) AUTHOR-X." Note that not all citations refer to the author: thus, fragment 113 of Ephorus includes a cross-reference for background information on a historical event in Herodotus, who wrote before Ephorus.

Grammars also contain well-structured information: citations within a section on contrary to fact conditionals, for example, (Figure 15 and Figure 16 through Figure 18) can be converted into propositional form: e.g., GRAMMATICAL-STRUCTURE CONTRARY-TO-FACT occurs at Xenophon’s Cyropaedia, book 1, chapter 2, section 16. Fine-grained analysis of the print content can also extract quotations and their English translations that appear throughout reference grammars and lexica. Smyth’s Greek Grammar, the German Kühner-Gerth reference Greek Grammar, and the Allen and Greenough Latin Grammar contain 5,300, 21,000 citations and 2,000 tagged citations within labeled sections

Citations in indices of proper names and in encyclopedias about people and places provide similar propositional data to disambiguate references to ambiguous names: thus, the print index to Rawlinson’s Herodotus (Figure 20) distinguishes passages where Herodotus cites Alexander, a king of Macedon, from Alexander, the son of Priam who appears in the Trojan War. Encyclopedias (Figure 22) contain citations from many different sources and many different people and places with the same name. By converting the citations to links and then extracting the contexts in which different Alexanders appear, machine learning algorithms can be used to find patterns with which to distinguish one Alexander from another elsewhere. The Smith’s biographical and geographical dictionaries contain 37,000 tagged citations for 20,000 entries on people and 26,000 tagged citations for 10,000 entries on places. The Perseus Encyclopedia, integrating entries from originally separate print indices contains 69,000 citations for 13,000 entries.

A great deal of information remains to be mined from the print record and we need to be able to leverage the information already extracted to extract even more from the much larger body of reference materials available only as page images.

Extraction contains at least two dimensions. In each case, we need more scalable methods.

Parsing the structure of individual documents: Even if we can recognize that "Th. 1.33" represents a citation to a text, we need to determine whether this cites book 1, chapter 33 of Thucydides’ Peloponnesian War or Idyll 1, line 33 of Theocritus. The indices shown at Figure 20, Figure 21, Figure 15, etc. illustrate some of the varying formats with which different works encode similar information

Aligning information from different documents: Author indices distinguish different people and places with the same name in the same document, but aligning information from multiple author indices is not easy. Is Alexander the son of Amyntas in Herodotus the same person as Alexander the father of Perdiccas in Thucydides?

Use existing translations of source texts to generate multi-lingual services such as cross language information retrieval, word sense disambiguation and other searching/translation services.

There are already English translations aligned by canonical citation to more than 5,000,000 words of Greek and Latin available in the Perseus Digital Library. These provide enough parallel text to support basic multi-lingual services such as contextualized word glossing (e.g., recognizing in a given context whether oratio is more likely to correspond to "prayer,"
"oration" or some other word sense), cross language information retrieval (e.g., being able to generate "prayer" and "oration" as possible English equivalents of Latin oratio), and semantic searching (e.g., find all Latin and Greek words that probably correspond to the English word "prayer" in particular passages).

The larger our collections of parallel text and translation, the more powerful the services can become. We need methods to locate more translations of Greek and Latin and then to align these with their sources. In some cases, library metadata will allow us to identify translations of particular Greek and Latin works. In other cases, however, we will need to depend upon cross-language information retrieval to find translations where no machine actionable cataloguing exists (e.g., anthologies, quotations of excerpts or smaller works).

Once we have identified a translation, we need automated methods to align translation and text. Figure 11 shows a best case scenario: a book where the modern translation and classical source text are printed side by side. In this case, the modern translation shares the chapter number of the Latin source text (both have "LXIV" to indicate that they include chapter 64), but the English translation does not include the finer grained section numbers in the Latin text. We need automated methods to align the many translations now appearing in large image book collections.

Conclusion

Comprehensive collections of industrially scanned written materials provide historic new instruments with which to better understand and to make intellectually accessible the record of human existence. These comprehensive collections of scanned print materials are, however, not an end in themselves but instead provide the foundation on which new collections, integrating images of writing with machine actionable data, will support a new generation of services for a new generation of intellectual projects.

Appendix: Sample Page Images

Primary Sources

The 10th Century Venetus A MS of Homer

The 10th Century Venetus A MS of Homer: U4 (Allen): Marcianus Graecus Z. 458 (= 841) - the back (verso) of folio 15 (available under a Creative Commons license from Harvard’s Center for Hellenic Studies: http://​chs.harvard.edu/​chs/​manuscript_images) The knowledge based OCR project recommended in this report would allow us to work with manuscripts as well as printed materials.

Editions of Fragmentary Authors and Works

Typical page from Mueller's Fragmenta Graecorum Historicorum. Above we see an edition of a fragmentary Greek author — quotations of and allusions to the Greek historian Ephorus, whose works have been lost. Each fragment contains one or more citations to works that provide information about a particular passage in Ephorus. The format is Fragment number — Citation — Excerpt. Latin translations of the Greek excerpts appear at the bottom of the page.

Rutherford’s First Greek Syntax

The index to Rutherford’s First Greek Grammar: note that citations point to the numbered paragraphs rather than the page numbers. The index appears at the end of the book and an automated system could infer that pages were not the citation scheme because almost all of the numbers in the text above are greater than the current page (174).

Information about People, Places, Organizations and other Named Entities

Section from the index to the Loeb Edition of Thucydides. In this case, the index uses the canonical book/chapter/section citation scheme, using upper case Roman numerals for books, lower case Roman numerals for chapters and Arabic numbers for sections.

Figure 20.

Index to Rawlinson’s Herodotus. In this case, the citations point to the particular volumes and page numbers of this translation rather than to the conventional book and chapter references. These references are, however, in the original pages and we could convert the idiosyncratic citations above to a more standard format by checking vol. 3, page 187, for example, to determine that Alexander appears in Herodotus, book 5, chapter 17.

Figure 21.

Page from vol. 1 of Smith’s Dictionary of Greek and Roman Biography (1848).

Figure 22.

Detail from Smith’s

Figure 23.

Detail from the article on Alexander I from Smith’s Dictionary above.

Notes

[3] Workshops took place at the University of Chicago (November 2006), Tufts University (May 2007), the Council for Library and Information Resources (Washington, DC, November 2007), Imperial College London (March 2008) and Humboldt University in Berlin (March 2008).

[25] Most surviving classical Latin was composed after antiquity. Johannes Ramminger had, as of 2008, assembled more than 200 million words of Latin in digital form (http://​www.neulatein.de/, accessed October 19, 2008). The Thesaurus Linguae Latinae (TLL) is based on an archive of 10 million slips, which contain, for the older texts, a slip for each occurrence of a word (http://​www.thesaurus.badw.de/​english/index.htm, accessed October 19, 2008). The Packard Humanities Institute CD ROM of Latin, which is fairly comprehensive through 200CE and contains some later materials contains c. 7.5 million words.

Crane 2008 Crane, G. and A. Friedlander. Many More than a Million: Building the Digital Environment for the Age of Abundance. Report of a One-Day Seminar on Promoting Digital Scholarship Sponsored by the Council on Library and Information Resources. November 28, 2007 Final Report. March 1, 2008. http://www.clir.org/activities/digitalscholar/Nov28final.pdf.

KyrillidouM., and M. Young. ARL Statistics 2005-06: a compilation of statistics from the one hundred and twenty-three members of the Association of Research Libraries. http://www.arl.org/bm~doc/arlstats06.pdf.

Smith, David A., and Gregory Crane. “Disambiguating Geographic Names in a Historical Digital Library”. Presented at ECDL 2001. Proceedings of the Fifth European Conference on Research and Advanced Technology for
Digital Libraries (2001), pp. 127-136. http://perseus.mpiwg-berlin.mpg.de/Articles/geodl01.pdf.