L2/13-202

Comments on Public Review Issues
(July 30 - October 31, 2013)

The sections below contain links to permanent feedback documents for the open
Public Review Issues as well as other
public feedback as of
May 1, 2013, since the previous cumulative document was issued prior to UTC #135 (May
2013).
This document does not include feedback on moderated Public Review
Issues from the forum that have been digested by the forum moderators; those
are in separate documents for each of the PRIs.
Grayed-out items in the Table of
Contents do not have feedback here.

Contents:

The links below go to directly to open PRIs and to feedback documents for
them, as of
November 1, 2013.

Currently, the common Indic dandas are listed in ScriptExtensions.txt as:
0964..0965 ; Beng Deva Guru Orya Takr # Po [2] DEVANAGARI DANDA..DEVANAGARI DOUBLE DANDA
But there is also pointers to the dandas from the following blocks in
NamesList.txt: Gujarati, Tamil, Telugu, Kannada, Malayalam.
The text of the Core Specification says in section 9.1, under "Punctuation":
"the intent is that they be used as common punctuation for all the major
scripts of India covered by this chapter. Danda and double danda punctuation
marks are not separately encoded for Bengali, Gujarati, and so on."
The Gujarati, Tamil, Telugu, Kannada, and Malayalam sections in the core
spec also clearly refer to dandas used from the Devanagari block.
Apart from that, Limbu (section 10.5) also seems to use the double danda,
while Syloti Nagri (10.6) uses both dandas.
This means the line in ScriptExtensions.txt needs to change to:
0964 ; Beng Deva Gujr Guru Knda Mlym Orya Sylo Takr Taml Telu # Po DEVANAGARI DANDA
0965 ; Beng Deva Gujr Guru Knda Limb Mlym Orya Sylo Takr Taml Telu # Po DEVANAGARI DOUBLE DANDA
We could also probably use pointers in the NamesList from the Limbu and
Syloti Nagri blocks to the dandas they use.
[The situation of Sinhala is not clear, but we can update that if we
find more information.]

[:Line_Break=Close_Punctuation:] has
U+FE50 (small comma), U+FE11 (presentation form for vertical ideographic comma),
U+FF0C (full width comma) and U+FF64 (half width ideographic comma), but
U+FE10 (presentation form for vertical comma) and U+FE51 (small ideographic
comma) are NOT included.
U+FE10 (presentation form for vertical comma) is LB=Infix_Numeric and
U+FE51 (small ideographic comma) is LB=Ideographic.
It might make sense for U+FE10 (presentation form for vertical comma)
to have LB=Infix_Numeric because the corresponding ASCII comma
(non-presentation form) has that, too.
However, treating U+FE51 (small ideographic comma) and U+FE11 (presentation
form for vertical ideographic comma) differently (the former in LB=Ideographic
and the latter in LB=CP) seems not very consistent.
This issue was initially reported against CLDR ( http://unicode.org/cldr/trac/ticket/6557 ).

Hello,
This is not a bug report per se, but is just to bring an issue we
came across about the uppercase of U+0587 at Google to the UTC's attention.
U+0587 (Armenian Small letter Ligature ECH YIWN) is currently
case-mapped to a sequence of U+0535 (Amernian Capital Letter ECH)
and U+0552 (Capital letter YIWN).
There's a report from Armenian speakers in Armenia that the latest
Armenian orthography as used in Rep. of Armenia uppercases it to
a sequence of U+0535 and U+054E ( (Armenian Capital letter VEW).
OTOH, Armenian diaspora and "Western Armenian" speakers follow
the current Unicode standard.
A comment from Google's Armenian speaker:
"That form was used in Armenia before "spelling reform of the
Armenian language" at the beginning of the 20th century (1922–1924 -
according to Wikipedia). There is a variation of Armenian language
currently used by Armenian diaspora, who still use the old version.
But everyone in Armenia (including official documents and media)
are using the new form."
Another comment from a linguistics professor at Yerevan :
<quote>
So I asked this guy http://www.ysu.am/science/en/4Kg4l3vuxYoueJU5nAWSsH9JAT/type/1/page/1
who is friend of mine. His comment was "Ev is a ligature, same as &, and
as such it is not a full first class citizen letter and it cannot have a capital.
In Eastern Armenian it is usually "ԵՎ" although it is logically wrong, as the
ligature is ligature of "եւ".
To cut things short - it is illogical and historically incorrect to write ԵՎ
in his opinion, but that is the way it is done, so we shall write ԵՎ in
Eastern Armenian and ԵՒ in Western.
</quote>

I refer to the current UAX #44.
The annex lacks a syntax for the property types. For example, does
Enumeration (E) resemble a conventional identifier and how about the
underscore and case-sensitiveness? What's the syntax for Numeric (N), etc.?
Also, fields 6, 7, and 8 of the UnicodeData.txt are composed of a
Numeric_Type (E) and a Numeric_Value (N). It is left unspecified
how the two are separated, whether the former is optional and so on.
The Numeric_Type never appears in the file, so I'm wondering if the
provision for it is obsolete or is there for future extensions.
Fields 12, 13, 14 provide simple mappings to a single character.
It is unspecified that the field shall be in the form of a hex code point.
Best regards
MO

According to the Core Spec, section 8.5, page 275, under Numerals,
"Arabic numeric punctuation is used with digits [in Thaana], whether
Arabic or European."
It's not very clear what that text means, but I take "Arabic numeric
punctuation" to mean:
U+066A ARABIC PERCENT SIGN
U+066B ARABIC DECIMAL SEPARATOR
U+066C ARABIC THOUSANDS SEPARATOR
If that is the case and those are indeed used in Thaana, we need to
add these three to ScriptExtensions.txt as:
066A..066C ; Arab Thaa
If not, we need to clarify what the text means.

Please see this email thread for reference: http://www.unicode.org/mail-arch/unicode-ml/y2013-m10/0028.html
The confusables data leaves out certain characters based on the
assumption that they would have been removed by way of NFKC
normalization. However, I argue that may be a dangerous assumption.
Could there be cases where implementations want to detect
confusability but cannot guarantee NFKC normalization?
In another case, implementations may wish to generate confusable
data for testing or other purposes. For example:
http://unicode.org/cldr/utility/confusables.jsp?a=m&r=None
With certain data missing from the equivalence sets, people who rely
on the expertise of the Unicode Consortium may expose their implementations to vulnerability.
My ask with this report is that the confusables data be updated to
include all characters which have a confusable potential even though
they may not fit the profile described in
http://www.unicode.org/reports/tr39/#Identifier_Characters.
Best regards,
Chris Weber

The NormalizationTest file provided on the website
(http://www.unicode.org/Public/UCD/latest/ucd/NormalizationTest.txt)
seems to be missing one specific kind of pattern for Hangul.
There are no tests that start with a "halfway-composed" Hangul
syllable, i.e. one that uses a LV Hangul syllable followed by a T Hangul Jamo.
In NFD, this LV + T normalizes to L + V + T, which should be
covered by the existing test for LV -> L + V. However, in NFC,
this should normalize to LVT. There is no test that actually
checks this, and there is a potential for errors when working on
non-straightforward implementations (i.e. not going to NFC via NFD).
This actually happened in an implementation I was working on,
and I only discovered the problem through a code walkthough.
An example entry in the test file to cover this case (without
the comment) would be:
AC00 11A8;AC01;1100 1161 11A8;AC01;1100 1161 11A8
There may not be a need to provide tests for all such cases
(around 10'000), but even having just a single one will catch
some errors that haven't been caught up to now.

In implementing UAX #9 Bidi Algorithm (6.3.0) I encountered a few
issues, some of which may be clarified by tweaked wording in the spec.
1. Section 5.2, X9 modifier, "assign the embedding level to each
formatting character" and "turn it into "BN".
Turning it into BN makes sense, but to what "embedding level" is
this referring? They are already at the embedding level that they
are at. As these BNs are ignored in subsequent steps, theoretically
it doesn't matter what embedding level is assigned, so perhaps
this could be removed.
2. BD16: this algorithm makes no mention of a maximum stack depth,
which could lead to implementations diverging. I'd love to see it
capped at max_depth to keep things simple.
2.5. (Also, I completely skipped over the word "canonical" in BD16
originally — mentioning that would be helpful, and even just including
the 2(?) legacy cases would have saved me a bit of time).
Thanks,
Loren

http://www.unicode.org/Public/security/latest/xidmodifications.txt
is still the 6.2 version and has not been updated to include
changes in 6.3
There are at least two such changes that will affect xidmod: Firstly,
U+180E MONGOLIAN VOWEL SEPARATOR should change from "restricted ;
not-xid" to "restricted ; default-ignorable". This may not make much
practical difference, but more seriously the new U+061C ARABIC
LETTER MARK needs to be added to "restricted ; default-ignorable"
(The other new Bidi control characters are already there as "reserved")

I noticed an inconsistency between the the Code Chart glyph of U+1F12E
and its decomposition. Its decomposition is <0057 005A> ("WZ"), but
its Code Chart glyph suggests <0057 007A> ("Wz").

RESPONSE FROM KEN Whistler, 2013/10/29:
I just ran an extensive back search, and this may have been an error that I made
on May 4, 2009, which was never caught during beta review of the data files.
The Amd 6 post Dublin chart (L2/09-172) had the correct decomposition to
<circle> W z, but there are various anomalies in the process here. The U.S.
ballot comments on FPDAM6, which asked for this, L2/09-082, claimed that
the decomposition was listed in L2/09-034, Karl Penztlin's proposal document,
but it fact it wasn't. Nor was a decomposition explicitly listed in Germany's
ballot comments. That means that Michel put the decomposition in himself
in the Amd 6 data files. But there seems to be a handoff glitch for Amd 6
data for addition to the draft Unicode 5.2 data I already had lying around
containing Amd 5 data. I can't find my copy of the FDAM 6 names list file,
which ordinarily I would have archived. Instead I see a UnicodeData delta
only, with a manual addition of the decomposition for U+1F12E that I did
on May 4, 2009. I would ordinarily get the decompositions from a combination
of examination of proposals and examination of the FDAM 6 names list annotation entries.
But 4-1/2 years later, I can't recover the exact details of what happened here.
My own handwritten UTC notes from February, 2009 are ambiguous about
whether the "z" was supposed to be uppercase or lowercase, so that might
have been the source of my original error.
At any rate, this error was totally missed in the beta review for Unicode 5.2,
and it has taken 4 years for somebody to report it as a problem. Not sure
whether that deserves a :-) or a :-(

I am a native malayalam speaker and wish to point out two errors in
malayalam unicode standard 6.3.
The standard directs
1) the sequence <0D7B, 0D4D, 0D31> to be rendered as "NTA" ‍ന്‍റ
2) the sequence <0D31, 0D4D, 0D31> to be rendered as "TTA" റ്റ
While on the face, this scheme gives the desired visual result, it is
only as correct or wrong as using <0D7B, 0D4D, 0C67> or <0D7B, 0D4D,
0CE7> for "NTA" ന്‍റ or <0C67, 0D4D, 0C67> or <0CE7, 0D4D, 0CE7>
to represent "TTA" റ്റ.
The "NTA" ‍ ‍ന്‍റ is actually a combination of MALAYALAM LETTER CHILLU N,
0D7B and MALAYALAM LETTER TTTA, 0D3A, though it is written as chillu n
combined with rra, 0D31. It is pronounced similar to the nt of ant.
Similray , the "TTA" റ്റ is a duplication of MALAYALAM LETTER TTTA,
though it is shown as one rra below the other. It is pronounced similar
to the t of bat, but with more stress.
The reason for this apparent digraph, where the rra, represents its
original sound as well as "ttt", is that MALAYALAM LETTER TTTA is never
used singly. It occurs only in these two conjuncts "NTA" ‍ന്‍റ and "TTA" റ്റ.
In native malayalam words, RRA is not duplicated as well. So, the same
curved symbol has been used to represent the "TTTA" occuring ion these
conjuncts. This fact is described in the book "Samboorna Malayala Vyakaranam"
by V Ramkumar , publisher SISO books and in it the author quotes KeralaPanini.
My suggestion is
1) "NTA" ‍ന്‍റ be defined as a precomposed characters that are
decomposable to <0D7B, 0D4D, 0D3A> instead of the current
suggestion of rendering the sequence <0D7B, 0D4D, 0D31> as
"NTA"
2) "TTA" റ്റ be defined as a precomposed characters that are decomposable
to <0D3A, 0D4D, 0D3A> instead of the current suggestion of
rendering the sequence <0D31, 0D4D, 0D31> as "TTA"
ajith

The character name for U+2B81, to be added in Unicode 7.0, has a typo.
The actual name in the ISO/IEC 10646:2012 Amd.1 text and the Unicode
7.0 beta files http://www.unicode.org/Public/7.0.0/ucd/UnicodeData-7.0.0d12.txt
is:
UPWARDS TRIANGLE-HEADED ARROW LEFTWARDS DOWNWARDS OF TRIANGLE-HEADED ARROW
This should be:
UPWARDS TRIANGLE-HEADED ARROW LEFTWARDS OF DOWNWARDS TRIANGLE-HEADED ARROW
(cf. U+2B83 "DOWNWARDS TRIANGLE-HEADED ARROW LEFTWARDS OF UPWARDS TRIANGLE-HEADED ARROW")
As the actual name is confusing/misleading and makes it difficult for
users to find the character in code charts etc. when searching for e.g.
"ARROW LEFTWARDS OF", I suggest adding a named alias for U+2B81 when
Unicode 7.0 is released.
NamesList.txt:
2B81 UPWARDS TRIANGLE-HEADED ARROW LEFTWARDS DOWNWARDS OF TRIANGLE-HEADED ARROW
% UPWARDS TRIANGLE-HEADED ARROW LEFTWARDS OF DOWNWARDS TRIANGLE-HEADED ARROW
NameAliases.txt:
2B81;UPWARDS TRIANGLE-HEADED ARROW LEFTWARDS OF DOWNWARDS TRIANGLE-HEADED ARROW;correction