dedicated to DATA: digitally assisted text analysis

...the broad circumference
Hung on his shoulders like the Moon, whose Orb
Through Optic Glass the Tuscan Artist views
At Ev’ning from the top of Fesole,
Or in Valdarno, to descry new Lands,
Rivers or Mountains in her spotty Globe.
(Paradise Lost, 1. 286-91)

This is a report about the current state of the collaborative curation of TCP texts. While I have written about this topic many times on this blog, this report is written for newcomers who have an interest in what was printed before 1800 but may or may not know anything about TCP texts. TCP stands...

This is a report on a “mixed initiative”–a term of art in computer science–that combines old-fashioned philological elbow grease with new-fangled long short-term memory neural network processing (LSTM). The goal is to fix as many as possible of the approximately five million incompletely transcribed words in the 1.7 billion word TCP corpus of English printed...

I have put a new version of Shakespeare His Contemporaries on Google Drive, where you may or view or download the plays. In this version I have grouped the plays by decades and put them in directories with names like 155, 156 …165. The plays have been encoded in TEI Simple. The texts are in...

While reviewing the work of Hannah, Kate, and Lydia, I enjoyed the precision and concision of their annotations. A sample of them appears below. While a full documentation would require snippets of the image and the transcription as well as the annotation, the annotations themselves clearly show their minds at work, combining clear description with...

Below are the reflections of Hannah Bredar, Kate Needham, and Lydia Zoells about their adventures in the mundane world of Lower Criticism, about which I wrote in an earlier blog and of which the digital surrogates of our cultural heritage will need a lot in the decades to come. Racine observes in his preface...

This is a progress report on the basic clean-up of the 504 plays in my current Shakespeare his Contemporaries corpus (SHC). I hope to release an updated corpus by the end of November. It will replace the current corpus at https://github.com/martinmueller39/shc The SHC texts are partially curated versions of the TCP texts, which have “known...

The are somewhere in the neighbourhood of five million incompletely transcribed words in the rougly two billion words of English books before 1700 transcribed by the Text Creation Partnership. Depending on how you look at it, that is either a lot or not very much at all. Less than half a percent of words are...

In my earlier post “From Shakespeare His Contemporaries to the Book of English” I promised to release all SHC plays “later this spring.” I have now done so, and you may download all 504 of them from https://github.com/martinmueller39/shc. Most of the texts come from Phase I of the TCP project and have been in the...

The following is a reposting of excerpts from a 2009 report by two undergraduate students of mine, Emily Anderson and Sasha Puchalla. As part of a course assignment, they checked the TCP EEBO transcription of Marlowe’s Tamburlaine. They worked from a spreadsheet with a ‘verticalized’ representation of the text in which every word was a...

Today’s New York Times carried a touching obituary of Claude Anne Lopez, author of Mon Cher Papa: Franklin and the Ladies of London and other biographical studies of Franklin. A Jewish refugee from Nazi-occupied Belgium, she arrived in America in 1941. She married an historian who moved to Yale, where the only employment available to...

The title of this blog entry is the title of a keynote address I gave at the Chicago Digital Humanities and Computer Science Colloqium, held November 18-19 , 2012 at the University of Chicago. There is a pdf of the talk at http://panini.northwestern.edu/mmueller/backtothefuture.pdf The talk was about the challenges and opportunities posed by the TCP...

“Revolutionizing Early Modern Studies?” was the question that governed the recent EEBO-TCP 2012 conference sponsored by the Bodleian Library. I gave a talk there about “Towards a Book of English: A linguistically annotated corpus of the EEBO-TCP texts.” In another blog I will write about the ways in which this project will keep Phil Burns and...