3. "Don't Proliferate; Transliterate!"

1. Spoilsports

Begotten as it is of compromise and negotiation, it
is difficult to establish rules for what characters
are truly platonic enough ideals to warrant inclusion in Unicode, and which do not.
We've seen that Coptic was conflated with Greek,
and now is not, for example. The more general
issue is, what should be counted as a distinct character,
and what should be counted only as a variant character,
which need be differentiated only through markup.
(For instance, the difference between a and a.) This issue is clear in well-established modern scripts,
which have a history of standardisation and a well-defined
repertoire.

In historical scripts, and particularly poorly understood,
partly deciphered or undeciphered scripts, this issue
is much murkier. It is so murky, in fact, that specialists
have concluded it is pointless to even attempt to
derive a repertoire of codepoints for inclusion in
Unicode:

For poorly understood scripts, we don't know which
characters are distinct, and which are merely
variants of
each
other.

Since Unicode is a standard For The Ages, we cannot
decide on one repertoire of codepoints today,
and shift to a smaller or bigger repertoire when
we
know more: Unicode needs to get things right
from the get-go, or we will be stuck with errors in
perpetuity.

But there's more. When we know enough of a script to
publish texts in it, we form a standardised repertoire
of graphemes to put such texts in. In doing so, we
generalise from whatever variations may have in the
script through time and space. Fraktur a and Celtic
a are shaped quite differently, as are the a written
in Roman in 400 AD and in 900 AD; but we represent
all of these as U+0061 Latin Small Letter A.

Now, scholars understand enough about Egyptian Hieroglyphics,
Sumerian Cuneiform, and Ogham to be able to publish
texts in them. A photograph of the stone or tablet
obviously does not count as publication: the scholar
needs to reduce the glyphs in the original to a standardised
set of graphemes, fix any errors, fill in gaps and
so forth. So in most instances, you won't care whether
the Ogham glyph for "b" was at a 65° or 70° angle;
you just want to know that it was an "b". (And if
you really wanted to know, you can always refer to
the photo.)

The catch is, the standardised repertoire of glyphs
scholars use to publish ancient texts is typically not a form of the script itself. It's Roman transliteration.
(But see proviso below.) Scholars working on cuneiform
don't take the Assyrian variant and the Babylonian variant , and decide which of the two to use consistently in
their publications. [Glyphs courtesy of John Heise's
ultracool online course on Akkadian.] That's because scholars working on Akkadian don't
publish anything in cuneiform: they publish their
texts in Roman transliteration, and both variants
are normalised to the single, Roman sequence ni. If you want to know what the wedges look like, you
look at the clay; if you want to know what the wedges
say, you read the transliteration.

This means that the scholars who work on historical
scripts don't actually use those scripts in anything
but the initial deciphering. Since they don't publish
in Akkadian cuneiform, they have no earthly need
for a standardised Akkadian cuneiform script: their
standardised script is Latin. And since Unicode is
concerned with the platonic ideal and not the minute
variant, it will not find a repertoire of cuneiform
platonic ideals, but only of Latin transliterated
ideals—so ideal, one might say, they've even been
stripped of their cuneiformicity.

And if the scholars don't need a standardised Akkadian
cuneiform or Egyptian hieroglyphic, the only people
who do are those who think the scripts are K00l,
and want to use the scripts for recreational purposes.
One might argue they are also used in teaching people
about the scripts themselves, as distinct from the
Akkadian or Old Egyptian languages; but when a textbook
on Akkadian says "the Babylonian equivalent of is ", it's not clear that we're actually dealing with
text as distinct from illustrations. If the reading
exercise in such a text features a normalised sequence
of cuneiform glyphs, rather than a photo of the stone,
then you might argue this really does constitute
cuneiform used as text; but that kind of use is fairly
marginal.

The preceding does not do justice to Carl-Martin Bunz's
patient exposition Encoding Scripts from the Past,
which constituted Unicode Technical Note #3. Nor
does it really convey the tension that must have
given rise to Bunz writing his note: reading between
the lines (and the standard disclaimers apply), it
is apparent that Unicode implementers, who would
dearly love to implement hieroglyphics and cuneiform,
think of the academics as spoilsports who are getting
in the way of their scriptal delights—and have let
them know it. (And I do not trivialise that kind
of delight: if
Daniels
& Bright's The World's Writing Systems does not make your heart go aflutter, you have little
business working on Unicode.) But the bottom line
is, standardisation for such scripts is hard, and
the people who would do the standardisation don't
need
it. Without the support of the people who work with
the scripts for a living, any attempt at such standardisation
is ill-starred, and likely to wait for feedback indefinitely.

2. Epichorica

What has this to do with Greek? The Greek
repertoire of characters is well-defined and understood,
and it's only very rarely that Greek is published
in anything but Greek. Yet Greek script, like any
other, went through a period of flux, particularly
in its first few centuries; the repertoire wasn't
always as established as it is now, and there was
much greater variation in the shape of characters.
In later times, that variation becomes the province
of palaeography: scribes from different times and places write letters
slightly differently (or occasionally quite differently),
and the editor's job is to reduce those characters
to their modern standard forms, and disentangle their
combinations into ligatures.

Ligatures, as we have
seen, are not something Unicode is concerned with; published
texts rarely indicate that clusters of letters
used to be joined and jumbled together, and if
that do
that properly becomes an issue of markup rather
than of the underlying platronic forms.

But for the first few centuries of Greek script, there
was huge variation from place to place as to the
shape of letters, and their phonetic value. Each
city had its own epichoric alphabet. Epichoric is Greek for 'local' (ἐπιχώριος), and the fact that
epigraphers call local alphabets epichoric instead
of local is the kind of turf practice you might expect
from the industry.

If you didn't know what epichoric meant, you're not
alone: a scribe of the mediaeval Greek poem that
I've recently coauthored a book on managed to mangle the phrase "epichoric proverb", referring to a pig, into ἐπιχοίριος—pig-ological.

At any rate, the variation was
prodigious—some epsilons looking like betas, Η
being either /h/ or /ɛː/, at least two different
letters
for /s/—and epigraphers wouldn't be doing anyone
any favours by perpetuating this confusion in print.
The normalised inventory of Greek letters is the
standard Greek script as we know it; and that's
what epigraphers reduce epichoric scripts down to.
It
is this reduction
which is within the scope of Unicode: if epigraphers
do anything with normalised Greek script that Unicode cannot cope with, then there
is a case for expanding it. But there is no case
for Unicode incorporating every epichoric variant
of a letter, because that's not what anyone publishes
or expects a font to contain.

So for instance the glyph that was to become Ψ stood
for /ps/ in Ionia, /kh/ in Euboea, and /ks/ in Crete.
But that is the worry of the epigrapher, not the
computer programmer; because any normalised text,
which would be published in Unicode, would have Ψ
standing only for /ps/. Likewise, many scripts had
a crooked rather than a straight line for their iota;
but this does not mean a normalised text will represent
iota as anything but Ι. The platonic forms Unicode
bases itself on are graphemes, not phonemes: Unicode
is not intended to do your phonological analysis
for you, and if your script doesn't map to the phonology
as elegantly as it might have, that's not for Unicode
to fix. But the convention in Greek epigraphy is
to take care of that mapping, and only let transliterations
which do follow Greek phonology as we know it see
the light of day. The only time you will ever see
funny uses of psi, or crooked iotas, or epsilons
that look like betas, is in histories of the Greek
script (and even there as illustrations rather than
text), and in depictions of inscriptions.

3. Epigraphers vs. Linguists

Almost. Epigraphers do normalise the shape of their
letters to the modern standard; and they do accent
and punctuate their texts as would be normal—including
lower case and diacritics, both of which were unknown
at the time of the inscriptions. However, they do
not disrupt the graphemic system of the inscription,
adapting it to the modern norm:

If the inscription distinguishes between two letters,
presumably as different phones, but normal Greek
script does not, because they were the same phoneme
(and the writers of the inscription just hadn't
worked that out yet), the epigrapher preserves
the distinction. So epigraphers keep koppa in their
texts, though within a couple of centuries the
Greeks had worked out that koppa [q] and kappa
[k] are the same phoneme, and dropped the former.

If the inscription conflates two phonemes which the
normal script distinguishes, the conflation is
preserved in the publication, with the distinction
made by diacritics. So epigraphers follow their
sources in conflating eta and epsilon as epsilon;
they distinguish them by adding a macron to epsilon
where normal Greek would write eta.

If the inscription has a separate letter for what the
normal script treats as a diacritic, the epigraphers
also use a separate letter. This is the curious
fate of heta.

If the inscription uses a digraph where the normal
script uses a single letter, the epigraphers preserve
the digraph. So texts from Crete will have kappa
heta instead of the single grapheme chi for /kh/.

So the text an epigrapher publishes does not look the
same as it would if the text had been preserved in
manuscript, and published in normal orthography—even
when the inscription dialect in question is Attic,
whose phonology is what underlies the standard script.
To illustrate, here is an inscription from the Acropolis
in Athens, probably dating to 566 BC, as published
by an epigrapher, and as it would appear in normal
script:

This race-track was made by ... Crates, Thrasycles,
Aristodicus, Bryson, Antenor ..., Supervisors of
Religious Rites [hieropoioi], who first established
the race [in the Panathenaea Games] in honour of
the
Gleaming-Eyed
Maiden [Athena].

By contrast, someone writing a linguistic or historical
account using those texts will be likelier to use
a normalised rendering, and a standard repertoire
of symbols. There is a revealing contrast between
the index of Jeffery (1990), a text on the early
history of the Greek alphabet, and Buck (1955), a
dialectology manual. Jeffery is studying the history
of Greek letters in detail, so it is obvious for
her to follow her sources closely in transliteration,
and to sort different Ancient graphemes as distinct
letters. Buck, on the other hand, while a much more
comprehensive and linguistically informed treatment
of its subject matter than his predecessors, still
takes Attic Greek as the departure point in his index,
and is only concerned with phonology: he eliminates
archaic letters where they were not emic, and in
his index ignores even distinct dialectal phonemes
which Attic dropped
(digamma),
sorting everything as if it is Attic. For his purposes,
this makes sense: if you want to know about the old
form of "king", ϝάναξ (wánax), you expect to find
it listed under its classical form, ἄναξ (ánax). We join the respective indices for ech- through to thar-.

Buck

Jeffery

ε

ἐχθός

ἐψαφίττατο

ἕωκα

ϝέχω

ζ

ζά

ζᾶ

ζαμιοργία

ζαν

ζέλλω

ζέρεθρον

Ζῆνα

ζίκαια

ζίφυιος

Ζόνυσσος

ζτε̄ραῖον

ζώω

η

ἠ

ἐ̄

ἦ

ἤγραμμαι

ϝῆμα

ἦμεν

ἤμην

ἠμί

hε̄μίδιμμνον

ἠμίνα

hε̄μιρρήνιον

ἥμισσον

ἡμίτεια

ἠμιτυέκτο̄

ἥμυσυ

ἤν

ἦν

ἦναι

ἤνατος

ἦνεικα

ἦνται

ϝηρόντων

ἦς

ᾗς

ἥσσαντο

ἤστω

ἦται

ἤτω

ηὑτῶν

ἥχοι

ἠώς

θ

θάλαθθα

θάλαττα

Θαρῆς

θαρρέω

ε

εχε̣[ε]

εχινος

ϝ

ϝαναξ

ϝεϝρε̄μενα

ϝειδο̄ς

ϝεκαβολο̄ι

ϝεξ

ϝεργα

ϝετεα

ϝ⊢εδιεστας

ϝικατι

ζ

ζο̄ος

η

ημεας

ε̄νικε

ε̄ριον

⊢

⊢αγεν

[⊢αιρ]ε̄σει

⊢αλιι̯ος

⊢εζατο

⊢ενατον

⊢ε̄μιτριτον

⊢ε̄ρο̄ος

⊢ιαρος

⊢ιατρο

⊢ικατι

⊢ιμερ[ος]

⊢ιπ(π)ι[ϙο]

⊢ιπ(π)οδρομο

⊢ιροποιοι

⊢οδο̄ι

⊢οπλα

⊢ορος

⊢ορϙος

⊢υιος

⊢[υπαρχοις]

⊢υποδ[εξαι?]

⊢υπυ

θ

θακος

[θαν]α̣τ̣οιο

θανο̄ν

The alphabetical order Jeffery uses is:

αβδγεϝζη⊢θικλμνξοπ[Ϻ]ϙρστυφχψω[ϠᛇИE]

4. Corinthian EI

That said, epigraphers do not leave the alphabet as
open-ended as they might. Epigraphers will use digamma,
heta, and koppa, but are reluctant to use san, and
tend not to use the extra locally coined letters
at all; if they feel they must represent the distinction
between the normal and the innovated letter, they
tend to use devices like diacritics instead.

To illustrate, consider the case of the Corinthian
epsilons. Classical Greek had three e-like sounds. Short /e/ is what was is represented
in the standard script by epsilon. Long /ɛː/ is what
is represented by eta, but was usually written
just as epsilon in epichoric alphabets; eta was an
Ionic innovation. A second long vowel, /eː/, arose
latterly from two
sources:
monophthongisation
of /ei/ ([hiéreia] > [hiéreːa] "priestess"—cf. [hiereús] "priest"), and lengthening of /e/ ([ksénwos] > Ionic [kséːnos]
"stranger").

The missing digamma of both xenwos and wanax (see above) feature
prominently in a fanfic piece by Kevin Wald on
Xenwa, er, Xena, Warrior Wprincess.

Because this sound was at least sometimes
originally a diphthong, it is written in standard
script as the diphthong ει: ἱέρεια, ξεῖνος. (The former kind of ει is called a "genuine" diphthong,
while the latter—which was never a diphthong at all—is
termed "spurious" or "false".)

In Corinth, the monophthongisation took place early;
but something strange then happened to the epichoric
alphabet (Jeffery 1990:114-115):

Phonetic Value

Standard Greek

Corinth

/b/

Β

⑀

/e/

Ε

Β

/ɛː/

Η

Β

/eː/

ΕΙ

Ε

Corinth conflated /e/ and /ɛː/, as was pretty standard
in that part of Greece. But it used a completely
new glyph for beta, and the beta glyph for the epsilon.
As for the epsilon glyph, it was put to use for the
new monophthong. (Corcyra [Corfu], which was a Corinthian
colony, still used ΒΙ = ΕΙ for the monophthong /eː/.)
So Corinth divided up vowel space differently to
standard Greek. (Tirnys, down the road, divided it
differently again, using Β for /ɛː/ but Ε for /e/,
and ΕΙ for /eː/—pretty much anticipating the Milesian
alphabet division.)

And of course, this makes an unholy mess of transliteration.
Noone wants to transliterate Corinthian /b/ and /e/
~ /ɛː/ as anything but β and ε. If you are intrigued
by the Corinthian innovation, you might want to add
a new letter for /eː/ in your transliteration; the
problem is, ε is already taken for /e/. So how do
you transliterate Corinthian?

Obviously there are going to be problems for using
that in the general case. Buck needs to differentiate
the two e's because he's making a point about Corinthian phonology.
But you couldn't keep using a capital e to distinguish
from a lowercase e; the minute you talk about someone whose name
starts with an epsilon, you're done for.

Ironically Jeffery (1990:404), so meticulous with her
koppas and hetas, takes the opposite approach in
her transliteration—and the one most epigraphers
would take. The Corinthian Ε glyph stands for what
was subsequently written as ει—and that's exactly
how she transcribes it:

Given that the Corinthians didn't have an /ei/ distinct
from /eː/, the only thing ει can mean in Corinth
is /eː/. (If we ignore the fact that, as Jeffery
1990:115 reports, the Corinthians did occasionally
get confused, and wrote ΒΙ for /ei/ and E for /eː/)
And Jeffery doesn't have to deal with the problem
of
how to write
an
epsilon
as distinct
from
an epsilon. (If someone does feel the urge to using
diacritics, the traditional closed vowel underdot
will not do, since it's already being used in epigraphy
for damaged letter. I would propose the IPA raising
diacritic: ε̝.)

And if even epigraphers are comfortable ditching the
extra letter for expediency, you can bet your bottom
dollar Greek Letter Corinthian Ei is not going to be proposed for inclusion in Unicode
any time soon.

Or so I believed in 2003. EI is now included
as Raised E in a proposal I have submitted to the
UTC, L2/05-003 Proposal to add Greek epigraphical letters (see also L2/04-389). Raised E conflates the Corinthian
glyph with the Boeotian use of ⊢ to represent a short raised /e/ (Thespiae,
ca. 424 BC, for a raised /e/ before a vowel: Buck
1955:22, Jeffery 1990:89). The problems with the
glyphs are still there; but enough epigraphers
in feedback to me think that this deserves a distinct
codepoint after all, that the proposal is worth
making, even if the glyphs are not ready for primetime.

5. What to transliterate into

A little excursus I owe to discussion with John Cowan
is the issue of target transliteration script.
It's all very well to say that scholars choose
not to proliferate glyphs in their source material,
but instead choose a normalised script to transliterate
into. But what determines scholars' choice of this
script? Because it's not always Roman (let alone
IPA).

The choice of script to transliterate-not-proliferate
into for
Western scholarship was dictated by two principles:
patrimony and
accessibility. If you were a Slavonicist writing
for other Slavonicists,
or an Arabists writing for other Arabists, you
would be expected to leave
your Cyrillic and Arabic (or Syriac or Hebrew)
untransliterated: that
was the patrimony you were discussing, after all.
Your target audience
would be sure to already know Cyrillic and Arabic.
Furthermore, if your script had a significant contemporary
constituency
-- significant enough that a non-trivial typographical
tradition could
develop -- then the script was deemed accessible
to your peers, even if
that use was limited to the liturgy. A theologian
or a linguist could
quote Syriac or Coptic in those scripts, because
those scripts continued
in liturgical use -- so they were known to printers,
and to specialists
outside the field of palaeography, who could learn
their Coptic and
Syriac from printed books rather than manuscripts.

If on the other hand you were discussing material
in a script which did
not make it to print, but was present only in the
original sources
(accessible to the scholarly republic only with
difficulty), then it
was your business to transliterate it out of the
original script,
into a script you deemed accessible --- and which
corresponded to
your notion of the script's patrimony. Gothic was
deemed part of the Germanic patrimony; so it was
transliterated
out of the
long extinct and unfamiliar, Greek-like Gothic
script, into the same
alphabet used for Old English and Old Norse (with
an addition or two).
Slavicists rejected Glagolitic in favor of Cyrillic,
as Glagolitic was
not regarded as accessible enough, being restricted
in printed use to
a corner of Dalmatia.

In the same way, Semiticists treated Phoenician
and its ilk as part
of the Hebrew patrimony, and so transcribed
it into Hebrew (as a furious thread on Unicode
List through much of 2004 brought forth; Semiticists
do not see any point in encoding Phoenecian separately
from Hebrew because they are isomorphic). It certainly
helped that Hebrew persisted in liturgical
use and had a print
tradition, so it was accessible. But the choice
of Hebrew rather than
Latin transliteration reflected an ideological
choice, as well as a
practicality: Semiticists approached Semitic via
texts published in
modern Hebrew script, so modern Hebrew was the
natural target script, which the other variants
of Phoenician were fully isomorphic
to. There was of course no surviving Moabite constituency
to protest
transliteration into an enemy script.

If you were writing for an audience of general
linguists, however,
your choice of script was constrained to what scripts
you could expect
a generalist to be familiar with. In the 19th century,
that was really
only Greek and Latin --- and maybe Hebrew. (Behaghel's
Historical
Syntax of German, for example, written in the 1920s,
numbers its
sections hierarchically with Roman numerals, Arabic
numerals, Greek
letters, and Hebrew letters --- so he expected
his Germanist readers
could tell a Bet from a Gimel. And at that font
size, they'd be doing
better than I would.) Handbooks of Indo-European
never present Sanskrit
or Armenian in their original scripts. You would,
however, expect that a
generalist could deal with the different traditions
of Latin script use,
and would leave Gothic, Irish, French, etc. in
their conventional Latin
transliteration or orthography. And to this day,
no Indo-Europeanist
will cite Greek in anything but Greek script.

In the late 20th century, the abandonment of Classical
education means
that you cannot expect a general linguist to have
any fluency in reading
Greek, and Greek is universally transliterated
in generalist contexts
(outside of traditional historical linguistics).
Similarly, there will
be some supplementing of Latin orthography with
IPA, although it does
not appear to have supplanted orthographic quotation
of Latin text.

Even in specialist usage, the use of Greek script
has disappeared in
non-Greek contexts: those languages are no longer
deemed part of the
Greek patrimony -- possibly as a measure of political
correctness,
but more likely as an acknowledgement of the trickiness
in pinning
down the phonetics of the letters (a primary concern
of decipherment,
which the use of Greek script, with its reconstructed
phonetic values,
would not really address). 19th century editions
of Phrygian or Lycian
would think nothing of publishing their texts in
the normalised Greek
orthography. In the 20th century, the "transliterate don't proliferate" approach used Latin, not Greek transliteration,
as its target --- though
this was a Latin script used by Hellenists, so
that Greek letters were
readily called on (e.g. tau or beta, as opposed
to any number of IPA
renderings: recall that philologists don't do IPA).

Outside the quite distinct Mycenaean script tradition,
this has not
happened with texts in Greek. The Pamphylian
script is pretty close to
Carian and Lycian, but because Pamphylian itself
is a dialect of Greek
(however deviant), there has never been any hesitation
in publishing it
in normalised Greek script (with a couple of
additions).