Re: Choice of fonts displaying etc/HELLO

From:

Stephen J. Turnbull

Subject:

Re: Choice of fonts displaying etc/HELLO

Date:

Fri, 08 Aug 2008 04:30:59 +0900

Eli Zaretskii writes:
> > From: Miles Bader <address@hidden>
> > Eli Zaretskii <address@hidden> writes:
> > > I meant would it break something if "\\cj" matched only the Katakana
> > > and Hiragana characters instead of what it matches today?
> >
> > I don't know what it would break, but that doesn't seem like
> > particularly intuitive behavior.
>
> ??? Why not?
Because although Katakana and Hiragana are the only uniquely Japanese
word constituents, the written form of the Japanese language also uses
a set of ideographs (Kanji) borrowed from Chinese, as well as an
idiosyncratic set of symbols (eg, precomposed Roman numerals,
precomosed multiletter units such as "mm" and "kg"). Since the
admissible set of ideographs is defined by Ministry of Education
standards, the Japanese *set* of Kanji is not the same as the Chinese
*set*, and therefore need a category of their own. So the Japanese
category should include, at least, Hiragana, Katakana, (Japanese)
Kanji, and the idiosyncratic symbol set.
> > I think emacs' concept of characters belonging to multiple language
> > categories is pretty neat actually.
>
> Maybe I'm missing something, but I don't see how the fact that, say,
> Cyrillic characters are claimed to belong to Japanese category could
> be considered ``neat''.
It's not considered "neat" that Cyrillic is (in old Mule) considered
to be Japanese, at least not by me. However, I do think it's useful,
at least, that the Hanzi (several varieties of Chinese) overlap the
Kanji (Japanese versions of same) and Hanja (Korean version).
Similarly for the accented characters that are used by Spanish and
French alike (although they don't use the same set, there is some
overlap), etc, etc. I suppose that's what Miles meant?
Now, that inclusion of Cyrillic in Japanese is due to the fact that
with a character set size of nearly 10,000 and an official list of
about 6000 characters needed for daily use, the Japanese decided that
a more or less universal character set would be a good idea so they
added Cyrillic, Greek, and a number of math symbols, as well as a
bunch of other scripts and "stuff". In the old Mule encoding I
suppose the \cX categories were implemented basically by looking at
the leading byte, and so if Cyrillic were encoded according to the JIS
standard it would get included in \cj; if it were encoded according to
ISO 8859/5, it would not be included in \cj. (That's true for XEmacs,
Handa-san is of course authoritative for Emacs.)
While I think it is worth the pain to clean up this inelegant
inclusion of Greek, Cyrillic, etc in Japanese (among other things,
"native" fonts can be used instead of typically ugly fonts designed by
foreigners), it probably will break user applications. Eg, I can
imagine an MUA that does things like check for \([[:ASCII:]]\|\cj\)*
to see if a message could be encoded in MIME charset ISO-2022-JP. (I
don't know if any of the mainstream MUAs do that, though.)