L2/11-163
Date: Sat Apr 30 13:28:54 CDT 2011
Name: Tom Christiansen (tchrist@perl.com)
Subject: PRI179 feedback regarding case-insensitivity
I would like to voice my dissent over the proposal to withdraw the
recommendation for doing full case-insensitive matching in RL2.4
Default Loose Matches.
I believe that this recommendation first became official in tr18-6 released
on 2002-04-21. Because of that recommendation, the 5.8.0 release of Perl
on 2002-06-01 implemented full case folding for case-insensitive matches.
In the nearly nine years since then, users have therefore come to expect
this behavior, and it would be a severe hardship to withdraw it from then
now. If the Unicode Standard makes no provision for permitting the
recommended behavior that it almost nine years old now, you will put us in
a hard place.
It is true that we have had bugs in the handling of full case mappings, but
we have worked hard to eliminate those. One particular place that these
bugs had been a problem for us was in square-bracketed character classes,
although these are now fixed. I will discuss those momentarily, but first
I wish to draw attention to the places where it is very important that full
case folding on case-insensitive matches be supported.
I believe that your example using "ß" U+00DF is insufficient to motivate people
to implement full case folding. This is both because the Latin script has
comparatively few code points with a full 1:Many case mapping, and also
because most of those are there so that round-trip conversion to and from
legacy repertoires will preserve ligatures like FF and FFI. There are
only 16 code points in the Latin script with 1:Many case mappings, and
there are 6 such code points in the Armenian script.
In contrast, the Greek script has 81 such code points, and it is therefore
in Greek that the issue arises most frequently. Please consider these strings:
lowercase: "ᾲ στο διάολο"
titlecase: "Ὰͅ Στο Διάολο"
uppercase: "ᾺΙ ΣΤΟ ΔΙΆΟΛΟ"
lowercase: "\x{1FB2} \x{3C3}\x{3C4}\x{3BF} \x{3B4}\x{3B9}\x{3AC}
\x{3BF}\x{3BB}\x{3BF}"
titlecase: "\x{1FBA}\x{345} \x{3A3}\x{3C4}\x{3BF} \x{394}\x{3B9}
\x{3AC}\x{3BF}\x{3BB}\x{3BF}"
uppercase: "\x{1FBA}\x{399} \x{3A3}\x{3A4}\x{39F} \x{394}\x{399}
\x{386}\x{39F}\x{39B}\x{39F}"
lowercase: "\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI}
\N{GREEK SMALL LETTER SIGMA}
\N{GREEK SMALL LETTER TAU}
\N{GREEK SMALL LETTER OMICRON}
\N{GREEK SMALL LETTER DELTA}
\N{GREEK SMALL LETTER IOTA}
\N{GREEK SMALL LETTER ALPHA WITH TONOS}
\N{GREEK SMALL LETTER OMICRON}
\N{GREEK SMALL LETTER LAMDA}
\N{GREEK SMALL LETTER OMICRON}"
titlecase: "\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}
\N{COMBINING GREEK YPOGEGRAMMENI}
\N{GREEK CAPITAL LETTER SIGMA}
\N{GREEK SMALL LETTER TAU}
\N{GREEK SMALL LETTER OMICRON}
\N{GREEK CAPITAL LETTER DELTA}
\N{GREEK SMALL LETTER IOTA}
\N{GREEK SMALL LETTER ALPHA WITH TONOS}
\N{GREEK SMALL LETTER OMICRON}
\N{GREEK SMALL LETTER LAMDA}
\N{GREEK SMALL LETTER OMICRON}"
uppercase: "\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}
\N{GREEK CAPITAL LETTER IOTA}
\N{GREEK CAPITAL LETTER SIGMA}
\N{GREEK CAPITAL LETTER TAU}
\N{GREEK CAPITAL LETTER OMICRON}
\N{GREEK CAPITAL LETTER DELTA}
\N{GREEK CAPITAL LETTER IOTA}
\N{GREEK CAPITAL LETTER ALPHA WITH TONOS}
\N{GREEK CAPITAL LETTER OMICRON}
\N{GREEK CAPITAL LETTER LAMDA}
\N{GREEK CAPITAL LETTER OMICRON}"
A user making a case-insensitive match of /^ᾲ/i, which I will here indicate with
a trailing /i to mean an embedded (?i), will certainly expect all three of those
versions to be matched. This remains true no matter what case the string or the
pattern.
lowercase w/lowercase: "ᾲ στο διάολο" =~ /^ᾲ/i
lowercase w/titlecase: "ᾲ στο διάολο" =~ /^Ὰͅ/i
lowercase w/uppercase: "ᾲ στο διάολο" =~ /^ᾺΙ/i
titlecase w/lowercase: "Ὰͅ Στο Διάολο" =~ /^ᾲ/i
titlecase w/titlecase: "Ὰͅ Στο Διάολο" =~ /^Ὰͅ/i
titlecase w/uppercase: "Ὰͅ Στο Διάολο" =~ /^ᾺΙ/i
uppercase w/lowercase: "ᾺΙ ΣΤΟ ΔΙΆΟΛΟ" =~ /^ᾲ/i
uppercase w/titlecase: "ᾺΙ ΣΤΟ ΔΙΆΟΛΟ" =~ /^Ὰͅ/i
uppercase w/uppercase: "ᾺΙ ΣΤΟ ΔΙΆΟΛΟ" =~ /^ᾺΙ/i
And indeed, in Perl all 9 of those match. Furthermore, the charclass
negations also all correctly fail to match (where !~ is the negation of =~):
lowercase: "ᾲ στο διάολο" !~ /^[^ᾲ]/i
titlecase: "Ὰͅ Στο Διάολο" !~ /^[^ᾲ]/i
uppercase: "ᾺΙ ΣΤΟ ΔΙΆΟΛΟ" !~ /^[^ᾲ]/i
Those are all true because those strings indeed all begin with
a case-mapping of "ᾲ", so to say it doesn't start with something
that is not that code point is a true assertion.
Based upon conversations I have had with people who actually handle Greek
text, I believe that this functionality is just as important to them as to
us is matching both "Apple" and "apple" with /^a/i. It seems culturally
unfair to deny users of the Greek script the same convenience in matching
that users of the Latin enjoy.
That said, I would like to draw your attention to two different problems
that arise due to full case mappings, or multichar folds as they are
sometimes called. Both relate to user expectations that a square-bracketed
character class specifying single code points will always *match* a single
code point. Under full case mapping, it may not. And this can cause problems.
The first problem is that unlike lookaheads, lookbehinds in regexes are
often implemented such that a fixed-size string is specified. Therefore,
while (?<=[abc]) is permitted, (?<=[abc]+) is not. When matching case
insensitively, character classes that seem to specify only single code
points *can* become variable in length, and thus forbidden from lookbehinds
under most albeit not all implementations.
This is not limited to bracketed character classes, although is usually
where it shows up, because a user has unknowingly included something in the
charclass that has a multichar fold. To demo without charclasses, this:
% perl -cwe '/(?<=\xDF)/'
compiles fine, but making it case-insensitive causes a compilation error:
% perl -cwe '/(?<=\xDF)/i'
Variable length lookbehind not implemented in regex m/(?<=\xDF)/ at -e line 1.
The second problem with multichar folds in case-insensitive matches is that
many patterns which were written for with an 8-bit mind set get transferred to
Unicode unmodified. So a pattern like [^\x80-\xFF], which was equivalent to
[^\x00-\x7F], is now equivalent to [^\x-\x7F\x{100}-\x{10FFFF}].
In the patterns below, the first is performed case sensitively and the second
case insensitively. I believe that users will be confused by the results of
the last two out of three case-insensitive matches.
No: "dress" =~ /[^\x00-\x7F]/
No: "dress" =~ /[^\x00-\x7F]/i
No: "dress" =~ /[\x80-\xFF]/
!! Yes: "dress" =~ /[\x80-\xFF]/i
Yes: "dress" =~ /^[^\x80-\xFF]+$/
!! No: "dress" =~ /^[^\x80-\xFF]+$/i
The reason that is happening is because of this:
No: "dress" =~ /\xDF/
Yes: "dress" =~ /\xDF/i
Although it might be argued that one should not allow multichar folds from
happening in character classes, my earlier Greek example shows that they *must*.
I do not believe you can appease both groups at once. It is possible that a
regex flag regarding simple-vs-full case mapping might help, but then you have
to decide on reasonable defaults. Perhaps a reasonable default is simple only,
so that users needing multichar folds can specify that. But they may not know
to do so.
It is a difficult task to educate users about the pitfalls of ASCII-minded
patterns applied to Unicode. They do not understand the two "!!" matches
given above. Even when they are brought to understand these, they invariably
consider them "wrong". That's because they are thinking in terms of sets
and set-complements when they see bracketed character classes. To them, since
there is clearly no \xDF in "dress", it is unreasonable that something that
says has no \xDF as /^[^\x80-\xFF]+$/i does should turn around and claim that
it found something that isn't there.
The question also arises about what you do with a back references. Can
/(ᾲ)/i match two (or more) code points for group 1, even though only
one was specified? Apparently it must.
Finally, I would like to suggest that case-insensitive matching as it is
currently defined is considerably less useful in practice than it should
be. The purpose for case insensitive matching is to allow a shorthand form
to spare the user from having to enumerate all possible variations of the
same letter. However, apart from a *VERY* few rules such as for ANGSTROM
SIGN, MICRO SIGN, and KELVIN SIGN, and the familiar but painful LATIN SMALL
LETTER SHARP S, you really cannot do that.
For example, even though these are all considered the same letter at the
primary collation strength, they do not match one another case insensitively:
d U+0064 LATIN SMALL LETTER D
ð U+00F0 LATIN SMALL LETTER ETH
U+A77A LATIN SMALL LETTER INSULAR D
ｄ U+FF44 FULLWIDTH LATIN SMALL LETTER D
Given that *they are all the same letter according to the UCA*, I submit that
they *should* (be able to) match each other, case-insensitively.
Furthermore, while the last of those has a K decomposition to the first
of them, the middle two do not.
The same thing happens with an "s":
s U+0073 LATIN SMALL LETTER S
ſ U+017F LATIN SMALL LETTER LONG S
U+A785 LATIN SMALL LETTER INSULAR S
ｓ U+FF53 FULLWIDTH LATIN SMALL LETTER S
The third of those does not count as an "s" matched case sensitively,
but the second does. Again, there is no decomposition that will get
you to something that tests as "s".
This is a problem with many letters. Imagine that you want to match LATIN
SMALL LETTER O no matter what sort of combining marks follow it. Old code
in ISO 8858-1 may have used [óòôöõø] for that, but with all the precomposed
characters like ō and not to mention the possibility of arbitrary
combining marks, that won't work. So you would think that it would be
enough to write
NFD($string) =~ /(?=o)\pM*/
but it is not. That's because not code points that are considered the
same letter as "o" according to the UCA have any available decomposition
that actually starts with "o"!
You have the same problem with many other letters. I can easily produce
a comprehensive list of these.
I am aware of
RL3.4 Tailored Loose Matches
To meet this requirement, an implementation shall provide for loose
matches based on a locale's collation order, with at least 3 levels.
and that would *appear* to resolve the issue. However, I don't believe it
does. First, one should not have to go to the highest possible level of
Unicode regex support merely to achieve this basic functionality that is in
practice so very needed by so many (and it is!). Furthermore, RL3.4
mentions only locales. One should be able to apply the default UCA without
dragging messy locales into it.
I therefore propose that UCA matching without locales be made a Level 2
requirement, and that Level-3 be reserved for locales, since it necessarily
requires tailoring support and plain UCA support should not require such.
Note that UCA matching, at least at the primary strength, solves your
vexing problem of canonical equivalence. This is another reason that UCA
primary strength comparison should be moved into Level 2. What users of
case-insensitively truly want is to be able to compare whether things are
the same letter IN THE UCA SENSE, without respect to casing or accent
marks. They should be able to get at that easily, and under the current
requirements, they cannot do so.
Tom Christiansen
tchrist@perl.com
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --