L2/12-160

Comments on Public Review Issues
(February 6, 2012 - May 4, 2012)

The sections below contain comments received on the open
Public Review Issues and other feedback as of
May 04, 2012, since the previous cumulative document was issued prior to UTC #130 (February
2012).
This document does not include feedback on moderated Public Review
Issues from the forum that have been digested by the forum moderators; those
are in separate documents for each of the PRIs. Gray items in the Table of
Contents do not have feedback here.

The proposed update for UTS 18, Unicode Regular Expressions, section 1.5,
Simple Loose Matches, includes an example showing the expansion of /Dåb/
into /(?:d|D)(?:å|Å|Å)(?:b|B)/ . There's no need to repeat Å in the
expansion; I assume that instead Å, or more clearly \u2128, is meant
since it also has å as its lower case mapping.

The proposed update for UTS 18, Unicode Regular Expressions,
section 2.4, Default Case Conversion, is not very clear on how
full caseless matches are supposed to be handled in different situations.
The guidance provided seems to cover only the case of literals
within patterns. It's not clear how, say, a class such as /[äöüß]/i
should be handled. Full mapping of "ß" results in "SS", but a
two-letter string cannot be a member of a set of characters. So,
should the "SS" be quietly dropped in this case (as the ICU implementation
does)? Or should the range be rewritten as /(?ä|ö|ü|ss)/i ? Going further,
should /[a-ß]/i result in an error, or what does it mean?

I was re-reading the draft, and noticed this minor problem
that I had overlooked:
In section 2.5, it has these:
\p{HANGUL SYLLABLE GAG}
\p{BEL}
\p{BELL}
Did you mean to suggest that all character names should be
considered properties? I had never noticed anything like this
before, and I worry about the possibility of collisions.
Perl uses e.g., \N{BELL} to specify character names.

Two issues:
The document uses the term "stacked" for horizontal cursive scripts
(Arabic, Syriac, etc.) written vertically so as to be read top-to-bottom.
This style is different from default vertical positioning, but conflating
it with the use of unrotated glyphs in horizontal non-cursive glyphs
(Latin, Greek, etc.) is IMHO more confusing than helpful.
Something also needs to be said about Ogham in section 4. The tables
correctly give it an orientation property of Rotatable-only, but don't
mention that it is written bottom-to-top, and therefore Ogham embedded
in vertical scripts requires bidi handling in all cases.

I notice in the latest meeting minutes:
A.5.2 Action item review.
[130-A1] Action Item for Lisa Moore: Follow up with Andhra Pradesh
on action 125-A17.
[130-A2] Action Item for Eric Muller: Take info for Indic TR and turn
into a document for the doc register.
Where 125-A17 is:
South Asian Subcommittee — TELUGU LENGTH MARK (D.3.1)
[125-A17] Action Item for Manoj Jain: Work with Andhra Pradesh Gov't to
determine what additional clarifications and annotations may be required
for the Telugu script. L2/10-339
[125-A18] Action Item for Eric Muller, Julie Allen, Editorial Committee:
Look for cases to be added to the confusable vowel representation tables
in the Indic chapter(s) for Unicode 6.0. Look at document L2/10-339 Telugu,
and other cases where documentation could be improved.
Since I was the one who submitted the document L2/10-339 requesting
deprecation of Telugu Length Mark, let me just give the list of confusables
I had in mind.
VS-II ీ = VS-I ి + LM ౕ
VS-EE ే = VS-E ె + LM ౕ
VS-OO ో = VS-O ొ + LM ౕ
HA హ VS-AA ా -> HAA హా = HA హ LM ౕ
(VS = vowel sign; LM = length mark)
The people with the Action Item can incorporate this into what they write.
[Submitted via the form as per offlist suggestion of Markus Scherer to
ensure it doesn't get forgotten.]

Since this sign is the same in form and function as its Tibetan look-alike
U+0FD3, I think the two should be unified, provided that the Tibetan sign
actually means the same thing (I can't find information about this). It's a
little strange to incorporate a Tibetan character into Devanagari fonts, but
it does not seem to require any special Tibetan support. U+0FD3 is Po rather
than So, but as we know that is not a hard and fast distinction.

Since the proposed SHARADA SIGN NUKTA has exactly the same form, function, and
properties as the Devanagari version, I think unification should be strongly
considered. In the words of the proposal, "these signs were used by Kashmiri
scribes in both Sharada and Devanagari", which implies that the Sharada sign
is borrowed from Devanagari. In general, when a character is borrowed from a
related script, we don't double-encode it unless its range of forms in the
borrowing script are outside the bounds of the lending script, as with the
Kurdish Q.
The other two marks also have Devanagari look-alikes, but clearly don't share
function with them, so they should be encoded.

The same issue I raised about DEVANAGARI SIGN SIDDHAM applies here also: this
should be unified with Tibetan U+0FD3, provided the semantics is the same.
Failing that it should at least be unified with the Devanagari sign, since
there is plenty of precedent for sharing Devanagari punctuation/symbols with
other Indic scripts.

I'd like to suggest the following clarifications in the Unicode Names List:
1) To avoid confusion between the Latin-American Peso currencies and the
Filipino currency, add an alias "Filipino Peso Sign" to U+20B1.
2) Modify the comment for the U+20B1 code point to state something like
"Extant and discontinued Latin-American Peso currencies (Mexican, Chilean,
Colombian, Dominican, etc.) use the dollar sign.".
3) Change the spelling from "milreis" to "milréis" in the informative
aliases for U+0024.
4) Add a comment to U+0024 along the lines of "The dollar symbol is used for
many peso currencies in Latin America and elsewhere, except U+20B1, which is
used for the Philippine peso.".
For rationale and background for this request, please see the Unicode Forum
Discussion at http://www.unicode.org/forum/viewtopic.php?f=21&t=261 .
Please use the provided background information to also add to the description in
Chapter 15, Currency Symbols, where neither Dollar nor Peso (Philippine) are
currently discussed explicitly today, while Yen/Yuan is.
Thank you.

I was looking at the charts, just discovering U+0342 COMBINING GREEK
PERISPOMENI. It really confused me, thinking a glyph error has found
its in the charts.
I think it would be a good idea if some minor explanation is added to
the NamesList, together with a reference to U+0303 COMBINING TILDE.

When new Hangul characters were added in Unicode 5.2, it appears that they
were all given an EastAsianWidth property value of W. This is the case
regardless of the type of jamo. But that is not consistent with properties
that were assigned to jamo that predate TUS 5.2: choseong characters
(1100..1159) were given a width value W, but jungseong (1160..11A2) and
jongseong (11A8..11F9) were given a width value N. Thus, all of the newer
jungseong and jongseong characters have different width values than the
older jungseong and jongseong characters.
Unless there was a specific reason for setting these characters to W,
I suggest that the following have their East Asian Width values set to
N: 11A3..11A7, 11FA..11FF, D7B0..D7FB.

There is no declared policy on the storage sequence of decimal digits,
i.e. characters with general category Nd. What is currently done could
be summed up as:
'The Bidi class of decimal digits shall be such that a sequence of digits
from the same set of 10 contiguous character points shall be stored in
order of decreasing significance when representing a number'.
This could be included in the stability guarantee at
http://www.unicode.org/policies/property_value_stability_table.html .
At present, all decimal digits have the Bidi class EN, AN or L except for
the N'ko decimal digits, which have the Bidi class R. If this principal
were violated, a 'simplistic parser' could misinterpret values of digit
sequences. (Not that it would be likely to get the prime number 25₁₆ right either!)
The guarantee, converted to a statement of practice, could reasonably be
included in the TUS section on 'Numeric Value', currently Section 4.6.
It would be good to say there that this principle is and will generally be
followed for characters that primarily function similarly to 'decimal digits',
e.g. for other radices or for derived characters such as superscript numerals.
(The word 'primarily' allows the principle to be ignored for letters also
used as digits.)

Section 5.1, Parametric Tailoring, of UTS 10 describes caseLevel as "If set to
on, a level consisting only of case characteristics will be inserted in front
of tertiary level. To ignore accents but take cases into account, set strength
to primary and case level to on."
I think "in front of tertiary level" should really be "between primary and
secondary level". "In front of tertiary level" is normally interpreted as
"between secondary and tertiary level", but then it would still distinguish
based on accents.

Dear Sir/Madam
Would you please inform me about the latest position of Assamese writing
system in Unicode? Earlier the Unicode said,Bengali script is used in
writing Assamese. We disagree, since we have our own script that has a
history of 1500 years and from which developed Bengali and Maithili.
Moreover, at least 15 characters of Assamese are different from modern
Bengali. With all documentary evidences and our state government's approval
we have been requesting the Unicode to provide a separate slot for Assamese.
Sincerely yours
A. Haque

Sir/Madam,
With due respect again I inform you that assamese script is not
Bengali script. Historically also,the typeset prepared by British
was sampled from assamese manuscript.Again,the oldest written form
of assamese script was found in "Charyapad".The language of
"Charyapad" is Kamrupi. Even the book on Origin of Bangla Script
was written collecting the inscription,manuscripts of assamese
writings. Then why your consortium repeated the same mistake
hurting the self esteem of assamese people. If you need scientific
proofs in support of special identity of assamese script, please
let us know the way to establish the truth. I respect your
consortium and I understand also the importance of your
consortium. But I never want being an assamese person any
wrong information in your version underestimating any assamese
scrips and language.
I am eagerly looking forward for your valuable suggestion for
not hurting the sentiment of assamese people further.