On Fri, Sep 17, 2010 at 5:43 AM, Martin (gzlist) <gzlist at googlemail.com> wrote:
> In the example I gave, 十 encodes in CP932 as '\x8f\\', and the
> function gets confused by the second byte. Obviously the right answer
> there is just to use unicode, rather than write a function that works
> with weird multibyte codecs.
That does make it clear that "ASCII superset" is an inaccurate term -
a better phrase is "ASCII compatible", since that correctly includes
multibyte codecs like UTF-8 which explicitly ensure that the byte
values in multibyte characters are all outside the 0x00 to 0x7F range
of ASCII.
So the domain of any polymorphic text manipulation functions we define would be:
- Unicode strings
- byte sequences where the encoding is either:
- a single byte ASCII superset (e.g. iso-8859-*, cp1252, koi8*, mac*)
- an ASCII compatible multibyte encoding (e.g. UTF-8, EUC-JP)
Passing in byte sequences that are encoded using an ASCII incompatible
multibyte encoding (e.g. CP932, UTF-7, UTF-16, UTF-32, shift-JIS,
big5, iso-2022-*, EUC-CN/KR/TW) or a single byte encoding that is not
an ASCII superset (e.g. EBCDIC) will have undefined results.
I think that's still a big enough win to be worth doing, particularly
as more and more of the other variable width multibyte encodings are
phased out in favour of UTF-8.
Cheers,
Nick.
P.S. Hey Barry, is there anyone at Canonical you can poke about
https://bugs.launchpad.net/xorg-server/+bug/531208? Tinkering with
this stuff on Kubuntu would be significantly less annoying if I could
easily type arbitrary Unicode characters into Konsole ;)
--
Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia