Help wanted by Perseus with metadata for Patrologia Graeca

The Perseus project are working on the Patrologia Graeca and Patrologia Latina. I’m not entirely certain what they are hoping to produce as output, but it looks as if they are OCRing the volumes, as best they can, and producing lists of what texts are contained, on what pages/column numbers, what footnotes, introductions, etc. They also need help with proofreading.

It might be a fun thing to get involved in, if you have some time (which I don’t myself). Although how you contact them I don’t know (for, curiously, they do not say).

http://tinyurl.com/p39fx3f [draft — January 19, 2015]
Gregory Crane (Perseus Project and the Open Philology Project, The University of Leipzig and Tufts University)

We are looking for help in preparing metadata for the Patrologia Graeca (PG) component of what we are calling the Open Migne Project; an attempt to make the most useful possible transcripts of the full Patrologia Graeca and Patrologia Latina freely available.

Help can consist of proofreading, additional tagging, and checking the volume/column references to the actual PG.

In particular, we would welcome seeing this data converted into a dynamic index to online copies of the PG in Archive.org, the HathiTrust, Google Books, or Europeana.

For now, we make the working XML metadata document available on an as-is basis.

They’ve been attacking the OCR in an interesting way:

Nick White … trained and ran the Tesseract OCR engine and Bruce Robertson [ran] … the OCRopus OCR engine on scans of multiple copies of each volume of the Patrologia Graeca.

The resulting OCR [outputs] contain … a very very high percentage of the correct readings [allowing] very useful searching, as well as text mining…

This is all very well; but of course you need to be able to label each text, so that you can find things. This means indexing the texts and tagging them. There is already an index, created by Cavallera in 1912. So…

To support this larger effort, we are working on Metadata for the collection.

We have OCRd and begun editing the core index at columns 13-114 of Cavallera’s 1912 index to the PG ([link] here).

A working TEI XML transcription, which has begun capturing the data within the print source, is available for inspection here.

I must confess a small bit of pride here: for I had long forgotten that I uploaded that PDF of Cavallera to the web. But this is the beauty of the web – each contribution makes another contribution possible.