Monday, April 2, 2007

Feature: Regularization

One of the many editorial decisions that must be made while transcribing a manuscript is whether or not to preserve the document's original spellling and punctuation. Happily, TEI has a mechanism for preserving preserve both versions while typing the transcript, so the choice of which one to display is delegated to the reader/printer. Unhappily, the eierlegende wollmilchsau approach of TEI means their mechanism is pretty hokey:

<reg> (regularization) contains a reading which has been regularized or normalized in some sense.

<orig> (original form) contains a reading which is marked as following the original, rather than being normalized or corrected.

<choice> groups a number of alternative encodings for the same point in a text.

The reason they've made <reg> and <orig> freestanding elements is that they want to be able to show a word as having been corrected without providing an alternative, the same way that one uses sic. This is perfectly reasonable, though I do not think it applies to my application. Less defensible is their choice of <choice> to enclose orig/reg elements. <choice> is used elsewhere to encode variant readings encoded with the <unclear> tag. As a result, any XSL transform attempting to normalize (or originalize) a TEI-encoded document is stuck peeking within every <choice> element it encounters to search for the <reg>/<orig> pair.

Since my transcription source will have to use a different, per-page DTD, I'll probably create an <irreg> tag to use instead of <choice> here.