On 01/14/2011 01:52 PM, Daniel Gibson wrote:
> Am 14.01.2011 07:26, schrieb Nick Sabalausky:
>> "Andrei Alexandrescu"<SeeWebsiteForEmail@erdani.org> wrote in message
>> news:igoj6s$17r6$1@digitalmars.com...
>>>
>>> I'm not so sure about that. What do you base this assessment on? Denis
>>> wrote a library that according to him does grapheme-related stuff nobody
>>> else does. So apparently graphemes is not what people care about
>>> (although
>>> it might be what they should care about).
>>>
>>
>> It's what they want, they just don't know it.
>>
>> Graphemes are what many people *think* code points are.
>>
>
> Agreed. Up until spir mentioned graphemes in this newsgroup I always
> thought that one Unicode code point == one character on the screen.
>
> I guess in the majority of use cases you want to operate on user
> perceived characters.
That's what makes sense for the user in 99.9% case, thus that's what
makes sense for the programmer, thus that's what makes sense for the
language/type/lib designer.
denis
_________________
vita es estrany
spir.wikidot.com

On 01/14/2011 02:37 PM, Steven Schveighoffer wrote:
>
> * I don't even know how to make a grapheme that is more than one
> code-unit, let alone more than one code-point :) Every time I try, I
> get 'invalid utf sequence'.
>
> I feel significantly ignorant on this issue, and I'm slowly getting
> enough knowledge to join the discussion, but being a dumb American who
> only speaks English, I have a hard time grasping how this shit all works.
1. See my text at
https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/U%20missing%20level%20of%20abstraction
2.
writeln ("A\u0308\u0330");
<A + tilde above + umlaut below> (or the opposite)
If it does not display properly, either set your terminal to UTF* or use
a more unicode-aware font (eg DejaVu series).
The point is not playing like that with Unicode flexibility. Rather that
composite characters are just normal thingies in most languages of the
world. Actually, on this point, english is a rare exception (discarding
letters imported from foreign languages like french 'à'); to the point
of beeing, I guess, the only western language without any diacritic.
Denis
_________________
vita es estrany
spir.wikidot.com

On Fri, 14 Jan 2011 08:59:35 -0500, spir <denis.spir@gmail.com> wrote:
> On 01/14/2011 02:37 PM, Steven Schveighoffer wrote:
>>
>> * I don't even know how to make a grapheme that is more than one
>> code-unit, let alone more than one code-point :) Every time I try, I
>> get 'invalid utf sequence'.
>>
>> I feel significantly ignorant on this issue, and I'm slowly getting
>> enough knowledge to join the discussion, but being a dumb American who
>> only speaks English, I have a hard time grasping how this shit all
>> works.
>
> 1. See my text at
> https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/U%20missing%20level%20of%20abstraction
I can't read that document, it's black background with super-dark-grey
text.
> 2.
> writeln ("A\u0308\u0330");
> <A + tilde above + umlaut below> (or the opposite)
> If it does not display properly, either set your terminal to UTF* or use
> a more unicode-aware font (eg DejaVu series).
OK, I'll have to remember this so I can use it to test my string type ;)
> The point is not playing like that with Unicode flexibility. Rather that
> composite characters are just normal thingies in most languages of the
> world. Actually, on this point, english is a rare exception (discarding
> letters imported from foreign languages like french 'à'); to the point
> of beeing, I guess, the only western language without any diacritic.
Is it common to have multiple modifiers on a single character? The
problem I see with using decomposed canonical form for strings is that we
would have to return a dchar[] for each 'element', which severely
complicates code that, for instance, only expects to handle English.
I was hoping to lazily transform a string into its composed canonical
form, allowing the (hopefully rare) exception when a composed character
does not exist. My thinking was that this at least gives a useful string
representation for 90% of usages, leaving the remaining 10% of usages to
find a more complex representation (like your Text type). If we only get
like 20% or 30% there by making dchar the element type, then we haven't
made it useful enough.
Either way, we need a string type that can be compared canonically for
things like searches or opEquals.
-Steve

On 2011-01-13 23:23:10 -0500, Andrei Alexandrescu
<SeeWebsiteForEmail@erdani.org> said:
> On 1/13/11 7:09 PM, Michel Fortin wrote:
>> That's forgetting that most of the time people care about graphemes
>> (user-perceived characters), not code points.
>
> I'm not so sure about that. What do you base this assessment on? Denis
> wrote a library that according to him does grapheme-related stuff
> nobody else does. So apparently graphemes is not what people care about
> (although it might be what they should care about).
Apple implemented all these things in the NSString class in Cocoa. They
did all this work on Unicode at the beginning of Mac OS X, at a time
where making such changes wouldn't break anything.
It's a hard thing to change later when you have code that depend on the
old behaviour. It's a complicated matter and not so many people will
understand the issues, so it's no wonder many languages just deal with
code points.
> This might be a good time to see whether we need to address graphemes
> systematically. Could you please post a few links that would educate me
> and others in the mysteries of combining characters?
As usual, Wikipedia offers a good summary and a couple of references.
Here's the part about combining characters:
<http://en.wikipedia.org/wiki/Combining_character>.
There's basically four ranges of code points which are combining:
- Combining Diacritical Marks (0300–036F)
- Combining Diacritical Marks Supplement (1DC0–1DFF)
- Combining Diacritical Marks for Symbols (20D0–20FF)
- Combining Half Marks (FE20–FE2F)
A code point followed by one or more code points in these ranges is
conceptually a single character (a grapheme).
But for comparing strings correctly, you need to determine the
canonical equivalence. Wikipedia describes it in Unicode Normalization
article <http://en.wikipedia.org/wiki/Unicode_normalization>. The full
algorithm specification can be found here:
<http://unicode.org/reports/tr15/> (the algorithm . The canonical form
has both a composed and decomposed variant, the first trying to use
pre-combined character when possible, the second not using any
pre-combined character. Not only combining marks are concerned, there
are a few single-code-point characters which have a duplicate somewhere
else in the code point table.
Also, there's two normalizations: the canonical one (described above)
and the compatibility one which is more lax (making the ligature "ﬂ"
would equivalent to "fl" for instance). If a user searches for some
text in a document, it's probably better to search using the
compatibility normalization so that "flower" (with ligature) and
"ﬂower" (without ligature) can match each other. If you want to search
case-insensitively, then you'll need to implement the collation
algorithm, but that's getting further.
If you're wondering which direction to take, this official FAQ seems
like a good resource (especially the first few questions):
<http://www.unicode.org/faq/normalization.html>
One important thing to note is that most of the time, strings come
already in the normalized pre-composed form. So the normalization
algorithm should be optimized for the case it has nothing to do. That's
what is said in section 1.3 Description of the Normalization Algorithm
in the specification:
<http://www.unicode.org/reports/tr15/#Description_Norm>.
--
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

On 2011-01-14 01:44:19 -0500, "Nick Sabalausky" <a@a.a> said:
> "Andrei Alexandrescu" <SeeWebsiteForEmail@erdani.org> wrote in message
> news:igoqrm$1n5r$1@digitalmars.com...
>> Thanks. One further question is: in the above example with u-with-umlaut,
>> there is one code point that corresponds to the entire combination. Are
>> there combinations that do not have a unique code point?
>
> My understanding is "yes". At least that's what I've heard, and I've never
> heard any claims of "no". I don't know of any specific ones offhand, though.
> Actually, it might be possible to use any combining character with any old
> letter or number (like maybe a 7 with an umlaut), though I'm not certain.
Correct, there's a lot of combinations with no pre-combined form. This
should be no surprise given that you can apply any number of combining
marks to any character.
mythical 7 with an umlaut: 7̈
mythical 7 with umlaut, ring above, and acute accent: 7̈̊́
I can't guaranty your news reader will display the above correctly, but
it works as described in mine (Unison on Mac OS X). In fact, it should
work in all Cocoa-based applications. This probably includes iOS-based
devices too, but I haven't tested there.
--
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

On 2011-01-14 09:34:55 -0500, "Steven Schveighoffer"
<schveiguy@yahoo.com> said:
> On Fri, 14 Jan 2011 08:59:35 -0500, spir <denis.spir@gmail.com> wrote:
>
>> The point is not playing like that with Unicode flexibility. Rather
>> that composite characters are just normal thingies in most languages
>> of the world. Actually, on this point, english is a rare exception
>> (discarding letters imported from foreign languages like french 'à');
>> to the point of beeing, I guess, the only western language without any
>> diacritic.
>
> Is it common to have multiple modifiers on a single character?
Not in my knowledge. But I rarely deal with non-latin texts, there's
probably some scripts out there that takes advantage of this.
> The problem I see with using decomposed canonical form for strings is
> that we would have to return a dchar[] for each 'element', which
> severely complicates code that, for instance, only expects to handle
> English.
Actually, returning a sliced char[] or wchar[] could also be valid.
User-perceived characters are basically a substring of one or more code
points. I'm not sure it complicates that much the semantics of the
language -- what's complicated about writing str.front == "a" instead
of str.front == 'a'? -- although it probably would complicate the
generated code and make it a little slower.
In the case of NSString in Cocoa, you can only access the 'characters'
in their UTF-16 form. But everything from comparison to search for
substring is done using graphemes. It's like they implemented
specialized Unicode-aware algorithms for these functions. There's no
genericness about how it handles graphemes.
I'm not sure yet about what would be the right approach for D.
> I was hoping to lazily transform a string into its composed canonical
> form, allowing the (hopefully rare) exception when a composed character
> does not exist. My thinking was that this at least gives a useful
> string representation for 90% of usages, leaving the remaining 10% of
> usages to find a more complex representation (like your Text type).
> If we only get like 20% or 30% there by making dchar the element type,
> then we haven't made it useful enough.
>
> Either way, we need a string type that can be compared canonically for
> things like searches or opEquals.
I wonder if normalized string comparison shouldn't be built directly in
the char[] wchar[] and dchar[] types instead. Also bring the idea above
that iterating on a string would yield graphemes as char[] and this
code would work perfectly irrespective of whether you used combining
characters:
foreach (grapheme; "exposé") {
if (grapheme == "é")
break;
}
I think a good standard to evaluate our handling of Unicode is to see
how easy it is to do things the right way. In the above, foreach would
slice the string grapheme by grapheme, and the == operator would
perform a normalized comparison. While it works correctly, it's
probably not the most efficient way to do thing however.
--
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/

"spir" <denis.spir@gmail.com> wrote in message
news:mailman.619.1295012086.4748.digitalmars-d@puremagic.com...
>
> If anyone finds a pointer to such an explanation, bravo, and than you.
> (You will certainly not find it in Unicode literature, for instance.)
> Nick's explanation below is good and concise. (Just 2 notes added.)
Yea, most Unicode explanations seem to talk all about "code-units vs
code-points" and then they'll just have a brief note like "There's also
other things like digraphs and combining codes." And that'll be all they
mention.
You're right about the Unicode literature. It's the usual standards-body
documentation, same as W3C: "Instead of only some people understanding how
this works, lets encode the documentation in legalese (and have twenty
only-slightly-different versions) to make sure that nobody understands how
it works."
> You can also say there are 2 kinds of characters: simple like "u" &
> composite "ü" or "ü??". The former are coded with a single (base) code,
> the latter with one (rarely more) base codes and an arbitrary number of
> combining codes.
Couple questions about the "more than one base codes":
- Do you know an example offhand?
- Does that mean like a ligature where the base codes form a single glyph,
or does it mean that the combining code either spans or operates over
multiple glyphs? Or can it go either way?
> For a majority of _common_ characters made of 2 or 3 codes (western
> language letters, korean Hangul syllables,...), precombined codes have
> been added to the set. Thus, they can be coded with a single code like
> simple characters.
>
Out of curiosity, how do decomposed Hangul characters work? (Or do you
know?) Not actually knowing any Korean, my understanding is that they're a
set of 1 to 4 phoenetic glyphs that are then combined into one glyph. So, it
is like a series of base codes that automatically combine, or are there
combining characters involved?
> [Also note, to avoid things be too simple ;-), some (few) combining codes
> called "prepend" come _before_ the base in raw code sequence...]
>
Fun!

"spir" <denis.spir@gmail.com> wrote in message
news:mailman.624.1295013588.4748.digitalmars-d@puremagic.com...
>
> If it does not display properly, either set your terminal to UTF* or use a
> more unicode-aware font (eg DejaVu series).
>
How to do that on the Windows (XP) command prompt, for anyone who doesn't
know:
Step 1:
Right-click title bar, "Properties", "Font" tab, set font to "Lucidia
Console" (It'll look weird at first, but you get used to it.)
Step 2 (I had to google this step):
For just the current terminal session: Run "chcp 65001". (Ie "CHange Code
Page) Also, you can run "chcp" to just see what codepage you're already set
to.
To make it work permanently: Put "chcp 65001" into the registry key
"HKEY_LOCAL_MACHINE\Software\Microsoft\Command Processor\Autorun"

"Nick Sabalausky" <a@a.a> wrote in message
news:igq9u6$1bqu$1@digitalmars.com...
>
> Step 2 (I had to google this step):
>
> For just the current terminal session: Run "chcp 65001". (Ie "CHange Code
> Page) Also, you can run "chcp" to just see what codepage you're already
> set to.
>
> To make it work permanently: Put "chcp 65001" into the registry key
> "HKEY_LOCAL_MACHINE\Software\Microsoft\Command Processor\Autorun"
>
Forget that step 2, that causes "Active code page: 65001" to be sent to
stdout *every* time system() is invoked. We shouldn't be relying on that.
*This* is what should be done (and this really should be done in all D
command line apps - or better yet, put into the runtime):
import std.stdio;
version(Windows)
{
import std.c.windows.windows;
extern(Windows) export BOOL SetConsoleOutputCP(UINT);
}
void main()
{
version(Windows) SetConsoleOutputCP(65001);
writeln("HuG says: Fukken Über Death Terminal");
}
See also: http://d.puremagic.com/issues/show_bug.cgi?id=1448