Re: [Gnu-arch-users] Re: How does arch/tla handle encodings?

From:

Tom Lord

Subject:

Re: [Gnu-arch-users] Re: How does arch/tla handle encodings?

Date:

Sat, 28 Aug 2004 13:54:58 -0700 (PDT)

> From: Andrew Suffield <address@hidden>
> On Sat, Aug 28, 2004 at 12:20:05PM -0700, Tom Lord wrote:
> > Is it worth it? Is it worth a brisk but gentle revolution away from
> > UTF-8-everywhere in order to lay down a foundation of software that is
> > good for all humans?
> Probably not, if only because character encoding is not a solved
> problem, and at this point, it's more or less inevitable that unicode
> is going to have to change lots more before it's a realistic option.
Ultimately, "ya plunks down yur nickle and yas takes yur chances".
Unicode has *not* (and *does not claim to*) have solved the global
character set problem --- this seems to be a widespread
misunderstanding in spite of a series of clarifications and Unicode
spec revisions.
What Unicode *has* done, and it's plain for any hacker willing to read
up on languages to see, is to create:
a) a set of very sane guiding design principles
b) a very good (far from perfect) current set of characters
c) a very good ( " ) political process for managing the mess
In other words, while they haven't solved the problem, they've
correctly said what people should do to have the problem solved (and
have even done *much* of the work themselves).
You can't transcribe Chaucer perfectly into Unicode. My
understanding is that many classics in other languages suffer a
similar fate.
In some sense, this analogy holds:
english : ASCII :: human languages in general : Unicode
except that I don't mean to invoke, with that analogy, a *political*
comparison of Unicode to ASCII. (There may or may not be such an
analogy.... I wouldn't know.)
> Trying to tackle that one right now can only result in a solution that
> will be hopelessly broken in a few years, simply because the problem
> wasn't even properly understood at the time.
Stop talking through your hat about whose favorite text, name, or
whatever can or can not be encoded.
For hackerlab and other software: those questions are immaterial.
For the software, the questions are more along the lines of:
* what are the encoding rules for codepoints in strings?
* what are the semantics of "composite characters"
etc. (a very long list)
Unicode critics are often focussed on just one single question:
* is my favorite character there?
and the answer to that question is infinitely maleable, in Unicode,
via an orderly political process with very well designed interfaces
to the impact of that question on the software itself.
> There are plenty of languages for which character sets don't even
> exist yet. It's almost certain that at least one of them will break
> some fundamental assumption.
Because you, asuffield, know more than the various linguists and
software hackers who have plotted out the Unicode design rules?
I agree that their work deserves scrutiny but all of *my* probes and
all of those I've heard reported (even from you) reaffirm that they
are doing a good job and that we hackers can, at the very least,
focus on just the abstract model (treading lightly on the particular
set of assigned characters).
If the political process behind assigning codepoints in Unicode fails:
so what? Meanwhile, the basic design (independent of most particular
assignments) is just great --- so embrace that part, at least, and if
we need something other than Unicode later, odds are very, very good
that very little code will need to change.
-t