I have made my feelings on the matter of "what is a character"
clear in several different discussion threads on Unicode and
characters that have taken place in the SRFI lists. Now I see
it coming up again.
For the sake of posterity and the standardization process, I'll
reiterate the outline once more; but I won't go into exhaustive
detail about this again.
I feel that "unicode default grapheme clusters" more closely
map to what users call "characters" than codepoints do. In
the interests of keeping the abstractions used by the programmer
as close as possible to the abstractions used by ordinary users,
I therefore support defining scheme characters as DCG's.
Consideration of other points only reinforces this opnion,
because this has several other advantages besides the ability
for users and programmers to communicate clearly and without
mistakes about what the other means.
The first technical advantage is that if the units are DCG's,
then ordinary string operations that treat characters as atomic,
leave DCG's unseparated. That is, when I take substrings
at arbitrary indexes of characters and append them to create
a new string, I am in no danger of having a substring that
begins with a combining codepoint which, when appended to
another substring, may create a DCG that did not exist in
either string. Nor is there danger of separating a combining
codepoint from the end of the substring, resulting in a
"substring" that ends with a DCG that did not exist in the
original string. Considering DCG's as characters, naturally
gives string operations such as "substring" and "append" the
unicode-independent semantics I consider appropriate.
Another technical advantage is that adding an accent or other
combining codepoint to a character is semantically different
from creating a string of two characters - as it should be.
A third technical advantage is that with the sole exceptions
of eszett and the deprecated ligature characters, changes in
case do not change string length. Furthermore, by use of the
"Ligating joiner" character to form altercase ligatures, even
the deprecated ligature characters can be converted in case
with preservation of string length. This means that 99% + of
the world never has to deal with the possibility that a string
will change length on casing operations, and helps to minimize
the frequency of occurrence of a source of errors.
A fourth technical advantage is that it's "future proof."
There is still dispute about Unicode's appropriateness,
particularly in asian scripts, and it is reasonable to presume
that Unicode is no more the Last Encoding Ever than was ASCII.
Unicode has several disadvantages such as the use of elephantine
tables for simple operations and the interspersal of dissimilar
character types throughout the codespaces. Indeed, it appears
to be accumulated rather than designed - the mark of a "second
system" standard that eventually gets overturned by something
more deeply consistent.
There is still good reason to use encoding systems that are
not Unicode in many places, still millions of asian characters
(mostly proper names for places and things) that Unicode cannot
and will not represent, and the use of other encodings besides
Unicode is inevitable. I do not want the semantics of the
programming language tied to the idiosyncracies of Unicode's
particular encoding and representation, and the character-as-
grapheme-cluster is more nearly an abstraction of "character"
the concept that people actually use rather than an abstraction
of the means we use to represent them. In other words, it
supports a concept of "character" that is vastly more portable
among different encodings and vastly more amenable to the kind
of string handling that people in langauges not well served by
Unicode will inevitably do anyway.
The fifth technical advantage is where the burden of
implementation lies. If all the grapheme-cluster handling
is part of the language, the implementor has to do it once.
If all the langauge supports is codepoints, then application
programmers have to do it dozens of times or hundreds of
times. And every line of code is first an opportunity to
make a mistake, second a duplication of effort, and third
a source of code-level incompatibilities when some routines
use codepoint strings and other routines assume DCG strings.
Scheme already has a history of abstracting objects of
non-uniform lengths; scheme code does not, for example,
have to care about whether a particular integer is a bignum
or not. I cannot think of a good reason to back away from
this approach when dealing with characters.
Anyway; if you want to look at other opinions from me about
Unicode, just check the SRFI archives; whatever objections
you want to raise, I've probably answered them already several
times and I'm just not going to go there again. This message
is a summary and also a notice that the topic has already
been thrashed in other threads.
It may come down to the simple fact that we disagree about
what is valuable in character handling routines. That's okay.
We can disagree.
Bear