Comments on Public Review Issues

The sections below contain comments received on the open Public Review Issues
as of June 7, 2004, since the previous cumulative document was issued prior to UTC #98 (Feb 5, 2004). Also included are other commentaries received via the
public Reporting Form during this period, especially on Phoenician.

This message concern the Public Review Issue Nr 13, Unicode 4.0.1 Beta. It was
supposed to be closed on 2004.01.27, so this message is late. But I discovered
your site only some weeks ago, and the concerned file only yesterday.
The concerned file is PropertyAliases.txt (or, to be more precise,
PropertyAliases-4.0.1d3b.txt).
You say in beta.html that "The property aliases have also been rearranged into
somewhat more meaningful categories."
I do not want to discuss their meaningfulness, but the fact that they mix and
confuse several concepts.

"Numeric", "String" and "Binary" is a classification based on the type of data
of the values of these properties.
"Miscellaneous" and "Catalog" describe the type of the properties, not the type
of data of their values.
"Enumerated" simply indicates that the list of possible values is limited and
closed.

In fact:

- all enumerated properties are alse either numeric (ccc) or string (all the
others);
- binary properties are also enumerated, with the list of values limited to two
members (Yes/No or True/False);
- catalog and miscellaneous properties are string properties, but
Unicode_Radical_Stroke could alse be considered as the combination of two
numeric properties.

So, I think that the new set of categories is more confused than meaningful. You
have already a better set of categories at your disposal, i.e. the convenient
grouping of categories according to their usage, in UCD.html.
Any kind of group or category can be used, even with two or more levels, but it
should be conceptualy accurate, and should not mix different concepts at the
same level.

With my best regards.
Francis Boxho

P.S. Sorry for the typos and mistakes, but my mother tongue is French, not
English.

The section on programming language identifiers should not only be moved to this report (from UAX 15), but should also be changed, to not potentially give the (unrealistic) expectations that tools such as compliers should normalize identifiers, or eliminate formating characters.

We have unfortunately not yet had the time for a full review of this document, but we plan to do such a full review. Here are the comments we already have:
It would be very helpful if terminology would be more alligned with that of other standards bodies. In particular, the term "Character Map" seems to need reconsideration, because nobody actually uses it.
The reference to the W3C Character Model should be updated.
You may be interested at looking at the Last Call comments that we got on the W3C Character Model, because some of them may apply to your document, too.

I am disappointed that Public Review Issue #27 has been only partially resolved,
in that "The interpretation of joiner/nonjoiner between two combining marks is
not yet defined." I strongly supported the original proposal, according to which
ZWJ or ZWNJ between two combining marks would affect the rendering of those two
marks. I did not formally express this support because I understood the public
review issue as concerned only with the choice between options A and B (on which
I had no particular opinion), and that the main principle was not being
reviewed.

There are specific cases where ligatures may be made between combining marks
associated with the same base character, analogous to ligatures between base
characters, and there is a need for a mechanism to control ligation. One such
example is that (in some typesetting traditions) the Hebrew mark meteg generally
combines with certain Hebrew vowel marks, but is also sometimes written
separately. Another possible example is with IPA contour tones written above the
character; to avoid a possible proliferation of tone contours it might be
sensible to define these contours as ligatures of acute, grave and macron.

I would like to encourage the UTC to reconsider the issue which was left "not
yet defined" and to accept the principle that ZWJ and ZWNJ may be used to
control ligation between combining marks in specific defined instances (and
should be ignored when used between other combining marks). I intend to present
to the UTC a proposal for at least one such specific instance.

We fully support this. It's a pity there was a bug in the text, but fixing it is much better than leaving things inconsistent.
The exact effect on various versions of the Unicode standard should be clarified, in a way that does not adversely effect third-party specs that are referring to previous versions of Unicode automatically.

One consideration that the review document failed to address was the ease of
converting Bengali data to Unicode from ISCII and vice versa. This clearly must
be considered a con for both Models B and D.

In the case of Model B, it introduces a fourth hasant variant beyond the normal,
explicit, and soft that is present in ISCII. Unless ISCII were prepared to
introduce a fourth hasant form (perhaps with hasant, hasant, nukta) this would
introduce a complication in conversion to ISCII.

In the case of Model D, the introduction of an extra character also presents
difficulties in representing in ISCII as there is no khanda-ta character in
ISCII. Even worse, unless the existing conventions of how to convert data from
ISCII to Unicode are changed for this one exception, data converted to Unicode
will not use the Khanda-Ta at all.

On the other hand both Models A and C have the advantage of ease of conversion
to and from ISCII, as both use the three usual hasant variants.

The issues raised in the review document were enough to convince me that models
C and D were not desirable, but were not enough to cauase me to favor eother
model A or B over the other. This additional point that I am raising, is
sufficient to cause me to favor Model A.

From: Manoj Jain
Date: 2004-05-19 23:33:00 -0700

Dear All,

Most of the Bengali experts recommend that "Bengali Khand Ta" should be
encoded separately in the Unicode Standard.

regards,

Manoj Jain

Scientist
Government of India
Ministry of Communications & IT
Department of Information Technology
New Delhi 110003
Phone +91-11-24301240 Fax +91-11-24363076

This is about your enquiry on Bengali Khanda-ta coding in UNICODE format.
Myself and my colleagues here think that Khanda-ta should be encoded as a
separate character. This is because it will help in both scientific (eg
NLP, Computational linguistics) and commercial applications. The typist
will find it convenient to type it in a single keystroke. Moreover, by
alphabet convention of Bengali script, it is treated as a separate
character.

You may kindly forward our views to the appropriate authority. Regards and

(formatted by editor in Arial Unicode MS due
to need for various Indic letters & transliterations)

I agree with the conclusion of Peter Constable’s PRI paper “Encoding of Bengali Khanda Ta in Unicode”. The current encoding model should be kept, but the descriptive wording of the Standard improved (model A). Model A correctly captures the fact that khaṇḍa ta is equivalent to an overt‐virāma form, both in historical origin and in modern clustering behaviour.

Khaṇḍa Ta should not be represented with obligatory ZWJ (model C) because the choice between khaṇḍa ta glyph representation and conjunct glyph representation depends on the capabilities of the font. Khaṇḍa Ta is a matter of glyph presentation, not of encoding. In addition, this would result in very heavy use of ZWJ in the encoding of regular Bengali text, and it is my impression that the ZWJ character was meant for requesting special behaviour in exceptional situations, not for constant use in the regular course of character coding.

Khaṇḍa ta should also not be represented as a separate character, for the reasons given above: it is the equivalent of an overt‐virāma form. Based on what native speakers have said on the Indic mailing list, it would seem to be the case that khaṇḍa ta is taught as a separate letter in Bengali primary schools. While this could be used as one argument among others in building a case for encoding as separate character, it is not on its own decisive. We know that in ancient India and right up into the period of the modern scripts, the consonant clusters क्ष (kṣa) and ज्ञ (jña) were regarded as separate letters, but nobody has ever suggested encoding them separately in Unicode. The intuitions of native users can be misleading.

Another important argument against encoding khaṇḍa ta as a separate character is overall consistency in the Unicode representation of Indian writing systems. If khaṇḍa ta were encoded as a separate character, then it would be the only consonant character in the Bengali script that does not have a short a inherent, and indeed the only such consonant character in any Brāhmī‐derived script. Breaking with such a fundamental property of the Indic encoding model should not be done lightly, even if there were good overriding arguments in the khaṇḍa ta case, which in my opinion there are not.

Concerning the Background section of the PRI document, I have two suggestions:

1. There is a rather vague reference to Chatterji 1926 that claims that according to him “ta‐hasanta was preferred for indigenous Bengali words ... in contexts in which conjunct forms would occur for loans from Sanskrit, Persian or other languages.” That may well be so, but I was not able to locate this claim in Chatterji’s orthography chapter. One should add a page reference to this sentence of the PRI paper or any derivative thereof.

2. The sentence at the bottom of page 1 (“... khanda ta is not used in older texts, and would not normally be expected in Sanskrit‐language documents”) is not only wrong, but directly contradicts the sentence quoted from Chatterji 1926 as well as the immediately preceding paragraph. It is my understanding that khaṇḍa ta is used in Sanskrit texts wherever a conjunct was not available, i.e., the usage sphere of khaṇḍa ta would be the same as that of Nagari’s half‐form and overt‐virāma form of ta taken together. As a matter of fact, I would expect khaṇḍa ta to occur rather more frequently in Sanskrit words than in real (tadbhava) Bengali words. This is because at the Middle‐Indo‐Aryan stage of language development, wide‐ranging consonant cluster assimilation removed all instances of t + another consonant in favour of homorganic clusters.

Best regards,
Stefan Baums
Asian Languages and Literature
University of Washington

I do not believe that adopting the jyutping romanization is in the interests of
the largest number of users of the unihan data. The Yale romanization
is widely used in the teaching of Cantonese, and
shifting to a different romanization for the unihan
data set will make it difficult for both teachers and
students. It would be much clearer for the unihan
data set to remain as close to the Yale romanization
as possible.

You note that Cantonese linguists prefer jyutping;
while this may be true, it also a young romanization,
without the decades of use that the Yale romanization
has seen. Today's preference may fade as the warts
of the new system begin to show.
It has not, in short, had the decades of use that the
Yale romanization has seen, and the preference
may simply be because its elements are not as well
known. Jyutping does have some interesting aspects
and as specialist tool for linguists it may be a good
choice, but the romanization chosen for unihan
should be striving for wide utility, not for interest.
That wide utility is clearly in the Yale romanization.

best regards,
Ted Hardie

Date/Time: Fri Mar 26 04:18:48 EST 2004
Contact: John Clews

Dear Rick

You wrote via the JTC1/SC2 list:

> The Unicode Technical Committee has posted a new issue for public
> review and comment. Details are on the following web page:
> http://www.unicode.org/review/
> ... Briefly ... we plan to adopt a single, standard Cantonese
> romanization for use throughout the Unihan database.

I strongly agree with the recommendation of the Unicode Consortium that it would
be better to adopt the new jyutping romanization developed by the Linguistic
Society of Hong Kong <http://cpct92.cityu.edu.hk/lshk/>.

This would result in much more consistency at the expense of very few changes.
I have also recorded these comments via
http://www.unicode.org/reporting.html

John Clews,
Former Chair of ISO/TC46/SC2 (Conversion of Written Languages)
which deals with transliteration and transcription issues in ISO.

Date/Time: Mon Mar 29 06:23:08 EST 2004
Contact: Kent Karlsson

re review issue 31 (Cantonese romanisation)
I know nothing about the issue... But the line matching aap to aan does look odd.
I guess it's a typo.

I just wanted to add my support for adopting Jyutping instead of Yale. I run what I believe is the most popular Cantonese learning website on the Internet (www.cantonese.sheik.co.uk, ranked #1 in Google for "cantonese" and "cantonese learning") and I have had a lot of feedback from all over the world regarding which romanisation scheme to use.
It seems Yale is slightly easier for English speakers to use, notably from the UK and USA, but jyutping is easier for most Europeans. In my experience, Jyutping generally only takes English speakers about 10-15 minutes to learn.
Jyutping also offers a few more tangible advantages, as it distinguishes between certain sounds where Yale does not.

Finally, it is far easier to use tone numbers instead of diactrics on the Internet, so whether you choose Jyutping or Yale, please consider using tone numbers.

By the way, you may like to know that my site has been fully converted to unicode this weekend. I had a few difficulties but everything is now working well. You can read about it here:

1.) The decision to adopt a standard romanisation for Cantonese is a good decision.

2.) Jyutping and Yale are both good romanisation systems, in fact the two best available, and so whichever one of the two you will choose, it will be a good decision.

3.) If you are looking for the most wide spread romanisation, Yale is the choice. Especially concerning text books, Yale is widely used and Jyutping is not. Jyutping has gained some popularity though in the internet.

4.) If you are looking simply for the best romanisation, Jyutping is the choice. Advantages are:
* Yale has some inconsistencies, final -a instead of -aa, and dropping of initial y- if followed by the vowel -yu-.
* Jyutping distinguishes the vowels -eo- and -oe- which are very different.
* Yale's choice of -eu- for these vowels is unlucky, because it blocks this letter combination to be used for the rare diphtong that results by speaking -e- followed by a -u. While Jyutping makes the obvious choice to write -eu for this diphtong, there is no standard Yale way of writing it. I have seen -el and -ehu, which are both not at all straightforward.
* Rare syllables are generally defined only in the Jyutping standard.
* Jyutping's choices for representing Cantonese sounds follows international standards. Yale retains some specific English usages, especially j- and ch-.
* Standard Yale is defined as using the letter -h- and diacritics for indicating tones. Jyutping uses numbers which is more computer friendly.

We did not yet have time for a full review, but we plan to do a full review soon.
The definition of String as a sequence of code units (rather than characters) is rather strange. Either the report should change the definition, or use a different, more precise, term.
Also, PD10 contains an extra comma after "that".

In addition of a LATIN SMALL LETTER C WITH STROKE, there's also a LATIN CAPITAL
LETTER C WITH STROKE, found, for example, in the Bureau of American Ethnology
reports. So the cent sign and the letter must be deunified so proper casing can
be done.

Date/Time: Thu Apr 29 13:26:17 EDT 2004
Contact: John Cowan

The proposed C WITH STROKE should be encoded separately, despite the similar
appearance to CENT SIGN. The use of a cent sign in place of a c-with-stroke is a
simple font approximation, analogous to 7 for TIRONIAN SIGN ET, ? for GLOTTAL
STOP, and my own use of CAPITAL LETTER OPEN E with COMBINING LONG SOLIDUS
OVERLAY for handwritten AMPERSAND. Abusus non tollit usum.

Date/Time: Thu Apr 29 19:55:32 EDT 2004
Contact: Philippe Verdy

The document forgets to consider other possible legacy encodings of this
character, notably if it is already used with case mappings.

The table of possible legacy encodings should include the possibility that it is
already encoded with Unicode using Latin letter c or C, followed by a combining
solidus overlay, which would not have the problem of the CENT sign.

However, as it is not clear whever the legacy encodings may have chosen a
combining slanted solidus overlay or a combining vertical bar overlay, due to
presentation forms for the same character, the proposal to encode the character
isolately may be useful to allow freedom in its presentation, without depending
too much on the slanted or vertical presentation of the combining overlays.

The document exposes the case that phonetic characters used to write languages
without a accepted orthograph will sooner or later evolve to normal uses of the
character as a plain latin letter, including a uppercase version. Casing is a
standard feature of the Latin script and is used very often as a matter of style
for the presentation of book and chapter titles, or as an emphasis style
(including the smallcaps style), or sometimes required for the presentation of
some documents (notably for postal addresses on envelopes, and administrative
forms).

So encoding the proposed character with gc=Ll will make it suitable for later
additional support of an uppercas
e version. Still, not proposing the uppercase
version of the character will not make it a true Latin letter for languages with
an accepted orthography, as it would cause problems if one wants to use it
properly for toponyms, trademarks, people names, etc... where uppercase would be
needed. If this creates a problem immediately, people will start by rejecting
the proposed Unicode character as it will complexify the case mappings (the
lowercase would be encoded isolately, but the uppercase woul have to be emulated
with C + a combining solidus overlay, or even worse with just a capital C).

For semantic preservation with case folding operations, it seems reasonnable to
include then both the lowercase and uppercase version (and add a note so that
the uppercase version will not even be unified with the CEDI currency sign also
proposed recently).

The same remark would apply for the African R-barred and U-barred, or W-barred,
which are used in Niger, Cameroun, and Congo (Kinshasa, former Zaire): some of
them only exist in lowercase version and the lack of an uppercase version or
their absence is already a problem. (Note the ressemblence of W-barred with the
Won currency sign... another hack that has been used to approximate the missing
character, simply because there's no other workable solution to print this
character)

Date/Time: Wed May 26 14:55:21 CDT 2004
Contact: John Koontz

I do not know if I understand all of the principles governing Unicode
encoding well enough to offer an appropriate argument on the slashed c
encoding issue (http://www.unicode.org/review/pr-35.pdf) either way.
However, I can add that the US Bureau of American Ethnology, a precedecessor to the US National Anthropological Archives, used slashed c
to represent the edh character, and that this usage is embedded in BAE
orthography in, e.g., the BAE and Contributions to North American
Ethnography series, in the work of James O. Dorsey on the Siouan languages.
For example, the Dhegiha (or Omaha-Ponca) language is referred to as
C/egiha. Capitalized and lower case versions are used. I can provide more
precise citations of examples if this is desired. BAE orthography is a
dead issue at present, but an interesting and useful body of Americanist
literature on American languages is encoded in it. Slashed c is not the
only character employed there, though I think that most can be represented
in Unicode with floating diacritics. Maybe not "turned" or "inverted"
letters, e.g., ptksc, cent-sign, c-cedilla, h and perhaps a few others.

Date/Time: Wed May 26 15:58:27 CDT 2004
Contact: Julian Bradfield

I am a computer scientist with an interest in character encoding issues,
and I also maintain a strong interest in phonetics and phonology, and will
be working in the area shortly.

I wish to support the decision to make latin small letter c with stroke
a separate character from cent sign.

The arguments in favour are valid; and moreover, the two characters are
conceptually quite different. As a non-American, it would not even have
occurred to me that slashed-c might be the same as the cent sign, although
I am of course used to seeing slashed-c as one variant of the cent sign.

The legacy encoding argument seems weak to me. The use of this character is
limited, as far as I know, and while there is almost certainly some data that
codes c-slash as cent in some legacy encoding, I find it hard to believe that
the amount of such data is sufficient to outweight the future inconvenience
caused by unifying c-slash and cent.

The arguments for unification of LATIN SMALL LETTER C WITH STROKE with CENT SIGN
are mostly concerned with data conversion issues. The users affected are
primarily going to be a relatively small group of specialists (linguists using
the symbol for phonetic transcription according to the Americanist tradition)
who are familiar with the need for data conversion, and will have access to
means of converting their data. Data conversion will not be a big problem for
most of them, and so the arguments that there is a need to maintain
compatibility with past encodings are not very strong.

As one of those users, I would prefer having a separate character which reliably
had the correct glyph and character properties. I would not want to be forever
hamstrung in use of this character by an attempt to maintain compatibility with
what would clearly be regarded as our past makeshift representation in legacy
encodings (even if those encodings were standard ones).

So, I would argue against unification, and for encoding this as a separate
character. Let's do it right and discard the past in this case.

Date/Time: Thu May 27 12:52:06 CDT 2004
Contact: James L. Fidelholtz

Peter Constable (http://www.unicode.org/review/pr-35.pdf) gives arguments for
and against using this symbol as distinct from the 'cent sign' and other
possibilities already incorporated within Unicode. In this case, I consider the
argument in favor of the existence of capital letter variants to be crucial for
the adoption of the proposal.

More generally, I find it somewhat disturbing that the issue even arises, since
in my understanding of Unicode as a sort of universal encoding for *all* letters
and symbols for *all* writing systems, it seems to me that it should be
generally inclusive, excluding symbols *only* if there are VERY strong arguments
against them (which I cannot conceive for any case, but am prepared to admit
could possibly exist). If the intention is truly to have a *universal* and
*standard* coding system, which I strongly support, then it *must* be
*inclusive*.

If the 'same' symbol is encoded in different sets in different ways, this can
only make it easier to use in different ways for different people. There is no
reason to arbitrarily exclude symbols, as far as I can see, for, really, *any*
reason, and much less if there are even moderately strong arguments in their
favor, as there are in the present case.

James L. Fidelholtz

Date/Time: Fri May 28 09:05:16 CDT 2004
Contact: Rory Larson

I work with the Omaha-Ponka language. A great deal of OP material was recorded
in the 19th century by the missionary James Owen Dorsey. Dorsey used the c with
slash character for a special phoneme in OP which I call "ledh". This is a
non-continuous sound which starts as [l] with the tongue curled up to the
alveolar ridge. The point of articulation then slides down the back of the upper
front teeth and drops off the bottom as an edh sound. Thus, it is somewhere
between an [l], an edh, and an apically rolled [r]. The modern written form of
the language uses the "th" digraph, but it would be nice to have a single
character to represent this. In that case, a capital form as well as a lower
case form would be needed. I don't know if there is any other recognized
phonetic symbol for ledh; perhaps the Japanese [l/r] sound is similar? In any
case, a conversion from the Dorsey corpus would probably require manual retyping
anyway, so legacy issues should not be a problem.

Thanks,
Rory

Date/Time: Mon May 31 14:58:47 CDT 2004
Contact: Doug Ewell

I support the separate encoding of LATIN SMALL LETTER C WITH STROKE (as well
as its uppercase counterpart)and its implicit disunification from U+00A2 CENT
SIGN.

The question of encoding this letter seems comparable to the questions years
ago of encoding U+01BC LATIN CAPITAL TONE LETTER FIVE as distinct from '5',
U+01C0 LATIN LETTER DENTAL CLICK as distinct from '|', and U+0222 and U+0223
LATIN * LETTER OU as distinct from '*'. In each case, the identity of the
character as a letter, with letter properties, outweighed the potential for
legacy transcoding problems.

It seems unlikely that there are large amounts of legacy data for these
Americanist transcriptions that use both CENT SIGN and LATIN SMALL LETTER C WITH
STROKE such that disambiguation would become a problem.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Date/Time: Tue Jun 1 02:21:39 CDT 2004
Contact: Martin Duerst

This is a comment on behalf of the W3C I18N WG.

Any consideration for encoding this with U+0338 (COMBINING LONG SOLIDUS OVERLAY) seems to be missing. Given the standing policy that no more precomposed letters are being encoded, there may be no need at all to encode this letter. The samples in the pdf document all show slanted strokes (rather than vertical), even in the Roman font example, but there is no discussion of the possibility of using a combining character.

Michael Everson has made a proposal, N2746, for encoding the Phoenician script
in the UCS. The principle of encoding Phoenician separately from Hebrew has been
discussed at length e.g. on the Unicode Hebrew list, and remains highly
controversial. Indeed it seems to have won little support in these discussions
apart from that of the current proposer. The general scholarly practice is to
encode Phoenician, paleo-Hebrew etc as Hebrew script with variant glyphs. A
change to using a separate Phoenician script will be disruptive and will
compromise existing encoded texts. The user community is apparently far from
convinced that the negative effects of this change will be outweighed by any
claimed benefits.

In section C point 2a of the proposal the proposer states that no contact has
been made with the user community. In fact there has been some contact, at least
on the Unicode Hebrew list, but the users contacted have not been in favour of
the principle of the proposal.

Date/Time: Thu Apr 29 15:30:26 EDT 2004
Contact: John Cowan

I believe that it is inappropriate to encode Phoenician script at this time. The
Roadmap provides for no less than 8 copies of the same 22-character West Semitic
abjad (viz. Hebrew, Mandaic, Samaritan, North Arabic, Palmyrene, Nabataean,
Phoenician, Aramaic). Before any of these other than Hebrew are encoded, we need
to have a systematic justification for making precisely these cuts in the
complex Semitic family tree and no others. Saying simply "Adherence to the
Roadmap" does not cut it. (Greek, Arabic, Syriac, and Indic, though also
descendants of Phoenician, are not relevant because they are no longer
22-character abjads).

In particular, if all of these are encoded using the Hebrew block, they will
"just work" without any further implementation effort, since none of them
require any treatment different from that applied to the subset of Hebrew
characters represented by the base characters excluding final forms. This is a
real advantage to users. An affirmative defense is needed for disunifying these
scripts from Hebrew.

Date/Time: Mon May 10 12:24:59 CDT 2004
Contact: John Cowan

I wish to withdraw my remarks opposing the encoding of Phoenician as a separate
script.

I also urge the UTC to collate Hebrew and Phoenician scripts jointly in the
default collation, so that aleph and alaph are given the same primary weight,
beth and beth, etc. etc.

I appreciate Jony Rosenne's comments on the these Hebrew related items.

My position on the Phoenician proposal is already clear from L2/04-206. If the proposal for a new script is accepted despite the position against it of scholars of north-west Semitic script, then Rosenne's second paragraph becomes an important observation.

I agree with Rosenne's comments on Meteg. On Qamats Qatan, I agree that this is a glyph variant of Qamats and should be treated as such. The most appropriate mechanism would appear to be a Variation Selector, but this depends on an extension of the currently defined mechanism to support variant glyphs of combining characters. An acceptable alternative might be a new character with a compatibility decomposition to the existing Qamats.

On Holam, Rosenne rightly points out that this is an important plain text issue which must be addressed by the UTC. His support for Option B1 in my proposal (http://qaya.org/academic/hebrew/Holam.html) seems theoretically neat, but the UTC should not entirely avoid considerations of implementation feasibility. The difficulty is that implementation of this option requires the rendering engine to position a glyph according to the phonetic environment of the sound represented by the character, and not only the graphical environment of the glyph. This is well outside the intended scope of rendering engines, although it may just be feasible for some engines to distinguish the environments commonly encountered in practice. The encoding with ZWNJ which is my Option B2 avoids this requirement for the rendering engine to determine the phonetic environment, by distinguishing the two positions of the glyph by the presence or absence of ZWNJ. This also ensures that the glyph is positoned correctly even in some rare cases, e.g. the divine name as discussed below, where the method of determining it from the phonetic environment breaks down.

My most significant comments here are on the section of Rosenne's submission entitled "Qere and Ketiv". It seems to me that there is a basic misunderstanding in this section. Rosenne seems to hold that Unicode should not seek to represent the actual form of the pointed text of the Hebrew Bible, as it has been presented in manuscripts and printed editions for more than 1000 years, but only either the unpointed text (Ketiv) or the form which is to be pronounced, as reconstructed from marginal notes (Qere). But Unicode is supposed to represent written texts, not their pronunciation. So the forms which should take precedence are those which are actually found on paper.

Fortunately, there is in fact much less problem in representing these forms than Rosenne seems to suggest. There are some cases in which Hebrew points appear as spacing diacritics, either at the beginning of a word or in words which have no base characters, but the Unicode representation for this is well-known: the combining marks are combined with NBSP (or SPACE, but the latter is inappropriate here as the word must not be broken). There are also cases of two vowel points combined with one base character, but the issues here have already been considered by the UTC, in August 2003, and the principle was accepted that the vowel points can be separated by CGJ to avoid inappropriate canonical reordering. There are some challenges here for rendering, but none for Unicode representation.

The rare forms of the divine name whose correct pointing causes a problem, as in the right hand image in Figure 4 of my Holam proposal, are probably technically cases of "perpetual Qere" and so not pronounced as written - although some in fact hold that the pronunciation as written (YEHOVAH) is correct. Nevertheless, the form as written, complete with anomalous position of Holam, is printed in a standard scholarly text (Biblia Hebraica Stuttgartensia) and is of special religious significance to some. It should therefore be supported in plain text. This is further evidence that the simple form of Option B1 of my Holam proposal is inadequate.