I received a question about transcribing and proofreading texts. Personally, I am currently proofreading the Pericla Navarchi Magonis by Arcadius Avellanus. But it made me think about the subject as such.

First of all, why creating a transcription in the first place? Isn't it enough to simply scan the books and make them available for download? My answer to that is a resounding No! PDF-versions do not lend themselves to reading on anything but big-screen devices (mostly desktop PCs or notebooks). Screens even as big as that of Amazon's Kindle are far too small, mobile phones can be ignored in this context due to their minute size. I use an e-reader with a 9.7" screen but even that is not always enough (partly due to resolution, partly due to multi-layer PDFs causing trouble). A transcribed version, on the contrary, can be read on all devices without needing any special programme or application.

Another reason in favour of transcription is that a well-done e-book can be searched through. I am using a small javascript to do so. But one can use Google to do so as well. For example, head over to Google.com and type in the following search expression:

This will give you a list of the pages which include either "ostium" or "ianuam" (the "/" acts as a logical OR). Of course, this search is limited because Latin is a highly inflected language (which is why I use a javascript that allows me to use at least the *-wildcard and a NOT-operator). But this is still useful. In my opinion, being able to search through an e-text is a valuable tool in language acquisition because you can look at how a word is used, and that very easily. Some PDFs at Archive.org have a text version embedded, but that is only an OCRd version which is in most cases bordering on the horrible.

Transcribing (I did my past such projects from scratch, the current one I do based on an OCRd text with quite a few errors left) and proofreading is a rather time-consuming business when doing it completely on one's own. It also requires quite a bit of concentration because our brain is very good at ignoring errors and supplying the correct word without us being conscious of it.

Should you decide to start a transcribing and proofreading-project on your own, I would suggest that you begin with a smaller text and "upgrade" to a larger challenge once you are comfortable with that.

Ideally, however, the text to be proofread would already be hosted by Distributed Proofreaders which allows proofreading by many people and creates e-texts for Project Gutenberg. That way, everyone could invest exactly that amount of time and effort he/she is comfortable with (even if it is only proofreading a single page).

This is not my way to do it (I favour the full-blown lone wolf frontal assault kind of approach), but it certainly would allow many people to contribute their time and effort at exactly the rate they want to. In addition the proofreading project would not depend upon specific proofreaders (the manager would be essential, at least in the first stages). Since the DP-site is well-established one need not fear that the project suddenly goes down the drain due to someone losing interest in maintaining the site as such, etc.

Should anyone of you want to contribute in transcribing and proofreading, there is more than one way to do so. Important: the text needs to be in the public domain, and it is my personal belief that Project Gutenberg is the ideal haven for the resulting texts.

When deciding upon the text to be transcribed and proofread I recommend thinking about it several times. If you are doing this on your own you will spend quite a bit of time with it, so it ought to motivate you. And if you are doing it in a distributed version, one should still be aware of the limited amount of proofreading-capacity of people able to read Latin. Even Distributed Proofreading will not allow many Latin e-books to be created. Careful consideration is therefore required. I am not going into more detail here because tastes differ. I do, however, say that the Latin community does not need more Latin grammars. Personally I believe that the choice of grammar is of minor consequence. Far more important is to read as much as possible. What we need is e-texts of Latin books. However, we do not need texts from classical authors either as these can be found at the Latin Libary (although proofread versions for Gutenberg.org wouldn't hurt either).

That is why I chose to transcribe the Mysterium Arcae Boulé (and am transcribing the the Pericla Navarchi Magonis). That is Latin which is alive and kicking (although some may criticize that the Latin used is not classical, but to each his/her own).

Thanks, Carolus, for sharing your experience. Question on fidelity to the source text: if I were to transcribe, say, a Neo-Latin student text, I would want to change the spelling of some words to what students are already used to, as well as perhaps punctuation and formatting (e.g. paragraph breaks). But for the Distributed Proofreaders, does all this have to be exactly as it is in the source?

Also, I'm curious about your search Javascript. What kind of a search does it perform?

I will answer your question about the javascript I use first. You can download it from my homepage (NOTE: TEMPORARY FILE only. It will be removed in about a month).

Its abilities are rather restricted, I am afraid, because I am not a programmer. I am glad that I was able to hack this together. A short description.

It is a HTML-page using javascript to search the text Mysterium Arcae Boulé which is stored as an array in the HTML-page itself. So, this javascript is NOT a web-spider which can search an entire array of pages (and therefore texts). That ought to be possible, but would require someone who actually knows how to write "proper" programmes. This is the first constraint: it only works on the text in the array. But changing that should be straightforward enough.

The search expressions allowed:

ianua: looks for the exact word "ianua". This does not find "ianuam", etc.

ianu*: looks for any word beginning with "ianu" (including "ianu").

ianu*/osti*: the "/" operator strings several words together (including ones using the "*"-operator). This example lists all instances of EITHER "ianu*" OR "osti*", so it is a simple logical OR (not a XOR!).

labor*/!labora: This lists all instances of "labor*" (to find occurrences of "labor, -is m") but excludes all instances which contain "labora*" (to avoid listing occurrences of the verb "laborare"). So, the "!"-operator is a logical NOT. This is far from perfect, however, because it would exclude an instance in which both a "labor, -oris"- and a "laborare"-instance occurred. Which leads me to another restriction of the current version (which dates from 23 August 2014):

This script works on units of paragraphs instead of single sentences which can be annoying when these paragraphs are long because paragraphs are checked as one unit. This may prove problematic especially in connection with the NOT-operator because the longer the paragraph the greater the likelihood that both a wanted expression and an exluded expression ("!...") are contained. Automatically ripping a paragraph into single sentences is not quite that straightforward as it might seem, especially when direct speech is mixed into a normal sentence so that a period, question or exclamation mark appears in the middle of the sentence. Should anyone know, please tell how to do so!

So, my script is far from perfect. It is probably not much more than a rough demonstration what a good script might actually be capable of. That being said, I find it VERY useful to do word-hunting and learn how words are actually used (something not always obvious from dictionaries).

As to your first question, Project Gutenberg recommends sticking to the text as closely as possible (for more see their Volunteer's FAQ. However, it is possible to create altered versions. They even want to create 7-bit text version which precludes ligatures for example. My personal opinion is that one should transcribe the text verbatim but include information to enable automatic conversion later on. The original edition of Pericla Navarchi Magonis, for example, uses ligatures (both oe and ae). I don't like these at all, but I mark these as "<ae>", "<Ae", "<oe>". That way, I can later on create specific version with and without ligatures. But that is only one way to do it. Some older texts have so many ligatures that reading them is a rather confusing experience. A more readable version (where ligatures have been "expanded") would certainly be good, but there probably should be the "true" version as the basis. For specific questions I recommend asking the guys at the Distributed Proofreading-forum.

Have you heard of regular expressions? See this demo site. I found out about it recently, and it's been useful for certain search functions on texts. I don't think it can match "not" strings like your script can, but I thought I'd mention it in case you haven't heard of it and it's of any use to you.

Best of luck as you finish up Pericla Navarchi Magonis. I'm looking forward to it!

Yes, I heard of those, mostly in connection with cruel and unusual punishment and the Human Rights-Convention
Although some of them are quite useful. A list of useful ones is available at the site of the EditPlus-editor, but they can be used with any text editor understanding regex.
Bye,

I was under the impression that there is now available software which will automatically scan and digitize printed texts. I believe this is being done with Greek texts, and I imagine it is even simpler for Latin texts. A text that is widely read can then be proofed by readers.

Some people, though, I guess, might learn things from transcribing their own texts.

If the text is not recent and in good condition, OCR produces many errors, and in these cases I have sometimes found transcribing equally productive, and even when slower, more pleasant. But Carolus has probably put OCR to better use than I have.

Markos wrote:I was under the impression that there is now available software which will automatically scan and digitize printed texts. I believe this is being done with Greek texts, and I imagine it is even simpler for Latin texts. A text that is widely read can then be proofed by readers.

Alas, OCR'd versions often contain many errors, even in the case of English texts. As far as I know these programmes uses statistics to better "understand" what kind of scanning errors are likely and how to deal with these. But these statistics need to be done for each language, and there are no relevant modules for Latin, as far as I know. And many scans of older texts are simply too bad (with smudges etc.) than to allow OCR to work (at least as far as I know).

I did use an OCR'd version to automatically transcribe my latest project (Pericla Navarchi Magonis), but that contained quite a few errors. I am not entirely sure whether this approach was better than just typing in the text from scratch (which I did for the Mysterium Arcae Boulé). In this case I OCR'd the text using Adobe Acrobat (after putting together a somewhat decent version because the one at Archive.org I checked at the time had double pages, wrong pagination, etc.), then I (proof)read the text and corrected those mistakes I noticed (where possible doing "replace" on the entire text). I am just about finishing this stage. The next stage will be that I print out both the original and my (first-stage proofread) transcription and do a second prooreading pass. My experience has taught me that quite a few errors will be found during this second proofreading run. Ideally there would be a third proofreading run, but this is the real world, not Fantasia:

Tai T'ung wrote:"Were I to await perfection,
my book would never be finished."

bedwere on the Composition board (thread on catasterisms) provided a link to an OCR’d text:http://heml.mta.ca/lace/sidebysideview2/5112879
The Greek text itself comes out with few errors (I spotted two on the page I looked at; it has apparently been spellchecked) but the apparatus is a total disaster. In the master it is in smaller type, and combines latin and greek, but is clearly legible. Surely OCR can do better than this?

Meanwhile, I’d opt for a scan every time. It seems a far simpler way of preserving obscure old publications and providing them with on-line availability. Cheaper, less labor-intensive, and more reliable. With good quality scans the visual clarity doesn’t suffer much. And importable to that thoroughly admirable enterprise the Project Gutenberg.

With an antiquated edition like the one bedwere linked to, the ps.-Apollodoran/Eratosthenean Catasterismi, the time and effort involved in tidying up the OCR’d apparatus would be better spent producing a new edition. Of course that would not be a mechanical process, but who wants to spend all their time proofreading—or transcribing?

As to phil’s question about fidelity to the source text: we can either aim at faithful reproduction, thereby preserving the contemporary orthography, punctuation, etc., or we can modernize or otherwise tamper with the original. Myself I’d prefer neolatin texts in their proper form.

mwh wrote:bedwere on the Composition board (thread on catasterisms) provided a link to an OCR’d text:http://heml.mta.ca/lace/sidebysideview2/5112879
Meanwhile, I’d opt for a scan every time. It seems a far simpler way of preserving obscure old publications and providing them with on-line availability. Cheaper, less labor-intensive, and more reliable. With good quality scans the visual clarity doesn’t suffer much. And importable to that thoroughly admirable enterprise the Project Gutenberg. [...] but who wants to spend all their time proofreading--or transcribing?

You seem to contradict yourself, mwh. On the one hand you say that you would opt for a scan (a scanned image without the text embedded) every time. On the other hand you seem to think that the Project Gutenberg versions are "thoroughly admirable." But the latter are exactly the product of people who spend at least part of their time proofreading--or transcribing. In the case of texts contributed by the project Distributed proofreading (DP) the transcription is mostly done automatically by OCR. But I would not be surprised to note that quite a few texts do not make it to DP because it is difficult to create a reasonably good OCR'd version.

As for a scan being better, I would contradict, at least as far as "using" a text is concerned (for preservation purposes a scan is certainly the easiest way to do so). It depends on circumstances so that ideally both a scan and a transcribed version should be available. Scans, as reliable copies of the original text they may be, are ill suited for reading on devices with small (or even medium-sized displays). And above all there is no full-text search capability which may be fine for "normal" reading but certainly not so when dealing with a text on a deeper level or simply "mining" it. To be blunt: a text available as an image is almost "dead" whereas a transcribed version is "alive".

When reading a written text - a physical book - it represents itself at first as an image, but our brain immediately transcribes the image and also does some proofreading on it. What an admirable ability, something that our computer are far away from achieving realiably. Until then we humans will have to do the work if we want to "unleash" these texts.
Vale,

Salve Carole,
You are quite right. To me the original book, or an image of it, is more “alive” than a transcription of it, at least when it comes to critical editions, but certainly I take your point about the value of search capability.

I would add that a proper transcription makes it possible to produce new print editions instead of mere facsimilies, for those of us who prefer paper. It makes a huge difference in the quality of the type.