>Are there any versions of lex and/or yacc that are capable of>accepting double-byte character streams as input?>>Michael O'Leary>moleary@primus.com>[Yacc doesn't process text, it processes tokens so the character set isn't>much of an issue. Lex is much harder, since all the versions of lex that>I know use 256 byte dispatch tables indexed by character code. This came>up a year ago, suggestions included the plan 9 versions and re2c. -John]

I was asking about this around a year ago. I found that, depending on
the language, you may be able to perform a kludge. I haven't seen any
description of this kludge elsewhere, so here it is.

The language I was working with allowed Unicode characters to appear
in strings, identifiers and comments. Significantly, none of the
keywords required any more than the ASCII character set, and the lex
specification didn't care *which* non-ASCII character it was.
(I believe Java also satisfies these criteria).

In the end, I modified lex's prototype file, to generate a lexer
which:

* Stored the incoming Unicode characters in "yyUnicodeText" - a
parallel (but wider) data structure to "yytext" .
* Stored the 7-bit equivalents of the Unicode characters in the
standard yytext. Where no equivalent code exists, it converted it
into CTRL-A (an unused code).

The lexer specification could then treat all of the non-ASCII
characters identically, catching them all by catching the CTRL-A
character.

If the actual token is required by a parser action, it can be
extracted from yyUnicodeText, rather than yytext.

Other details to consider:
* Unicode has many "space" characters - should it be legal to separate
tokens with them? I converted them all to CTRL-B, instead of CTRL-A,
so I could detect them.
* How should a legitimate CTRL-A (or CTRL-B) in the text be handled?
* Use #defines in the prototype, so Unicode support can be turned off
and on?

I wrote this prototype for the project, and gave it some perfunctory
testing, but then the requirement for Unicode support was postponed,
and finally the project folded. These ideas have not been tested in
the wild.

Julian Orbach
Australian Centre for Unisys Software
[Someone else pointed out that if you use multibyte encodings rather than
wide characters, more often than not a regular 8 bit lexer will do the
right things. -John]