For this to be maximally effective, we need to produce groundtruth for a random sample of many different books. So over the weekend, I'll do a new run of all the books in Internet Archive and then post a random sample. (Ideally, we'd also include ones from other sources, such as HathiTrust, but there are intellectual property claims there that I don't want to get enmeshed in.) In the meantime, I'll provide a model of what is needed.

cwconrad wrote:
What's ephemeral and what's permanent is more subjective than we'd like to think. I'd love to see new fragments of Sappho as others might love to look at a scrap of Origen. B-Greek archives of the email list of olden days and of the forum, contain both trinkets and treasures. Sifting them might be a matter of taste: whose?

The more of the language we have from ancient times available, the better. This is true even of sub-literary texts, as we well know from the papyri. Even "bad" stuff from ancient times helps us form a better total picture of the language and its use.

This is a comment mainly directed towards Bruce Robertson, and partially towards Jonathan.

I realise that for improving the quality of the OCR the ideal is a variety of sources, but I wondering if we might collaborate towards some mutually beneficial goals.

I've been discussing with Geoffrey Steadman who produces wonderfully Creative-Commons reader/commentary editions of classical texts, about the process of putting those together, with the intention of beginning to create some of those resources for Patristic texts, which desperately need aids for the average reader who approaches them for the first time. I've made a start on some of this, but my great need is for access to digitised versions of Migne texts that aren't bound up in excessive and probably non-legally-binding terms of use. In any case, I am starting with some letters of Basil of Caesarea, since these are short and manageable

So, my proposal is that you do some OCR scanning of specific texts, starting off with letters of a page or two in the Migne, I will take on the proofing of these texts in particular, you get improved OCR feedback, I get public-domain digitised base texts for my project.

Let me know your thoughts, and if agreeable I will direct some specific pages for OCRing.

Geoffrey Steadman's Creative-Commons reader/commentary editions really are nice, aren't they? I hadn't seen them before. I would love to see similar editions for Migne, or for the GNT or Septuagint, for that matter.

Bruce's Migne output looks rather clean for the most part, I don't think it would take that much editing. He's the guy who can tell you what would be helpful for improving OCR. And Bruce's licenses seem to fit what you need. I don't think anyone is double-keying Migne, and it's valuable text, so this is a great area to have someone focus on.

I take it these letters would be in PG 29-32? I can process them, to the best that our scans and ability allow, in the next couple of weeks. However, for this specific project you also might want to take a look at the Loebs we've already processed, available in this list:http://heml.mta.ca/lace/catalog

if you search on 'Basil'. We're missing Loeb Basil volume 3. Perhaps we can find someone to scan it, which will produce a result equal to or better than what is there.

I'm not in a country with any Loebs, but I will see if I can convince someone to make a clear scan of book 3.

On the reader project: I'm running letter 361 as a test case, to see what sort of workload is involved in making the Steadman-style text, then I plan to expand to a selection of 10 letters from Basil, and from there, if the process and labour seems reasonable, my plan is to go on to do a range of significant Patristic texts. One step at a time though! I'll let you know when the sample text of letter 361 is done so you can see what I'm aiming at.