Rod Chamberlin <rod@querix.co.uk> writes:>Does anybody know of any work that has been done on lexical analysis of>multibyte character streams.

Quintus Prolog supported multibyte characters a little over 10 years
ago. The systems where it did so used a coding where 0xxxxxxx was an
ASCII code and any multibyte sequence was made up of 1xxxxxxx bytes,
so we just "cheated" and said that those codes were all letters.

The Plan 9 system has a C compiler that works directly with the 8-bit
code stream.

The free Gambit-C Scheme system accepts a range of codings for the
16-bit characters it accepts.

C++ and C9x follow Java's lead in saying that the input is notionally
wide characters, with \uXXXX being a 16-bit character that may appear
in identifiers and strings and \Uxxxxxxxx being a 32-bit character
that may appear in identifiers and strings. If nothing else, this is
a multibyte coding.

IBM's mainframe compilers for PL/I and Fortran accepted 16-bit characters
in comments and strings years ago, and still do.

The Unicode book gives the rules for what numbers and identifiers should
look like if you support Unicode.
--
Richard A. O'Keefe; http://www.cs.rmit.edu.au/%7Eok; RMIT Comp.Sci.
--