Basically, how does the ^^^... notation work in LuaTeX and XeTeX, exactly?

In 8-bit TeX engines (recent TeX, eTeX, pdfTeX, at least), two consecutive identical catcode 7 characters (typically ^), followed by two lowercase hexadecimal digits, are converted before the tokenization step to the corresponding byte. Namely, ^^6f is exactly equivalent to o: for instance, \sh^^6fw ^^6f will cause TeX to show the letter o.

There is also the notation with two ^ (identical catcode 7 characters), followed by any ascii character (but not two lowercase hexadecimal digits), which is replaced by the character obtained by either subtracting or adding 64 to the character code, remaining among ASCII characters (range 0 to 128).

Unicode-aware engines (I'm thinking of LuaTeX and XeTeX, there are perhaps other less known ones around) also provide ^^^^xxxx and ^^^^^xxxxx for characters whose hexadecimal representation has 4, or 5 digits. But this does not seem to be done in the same way across engines.

For instance, both LuaTeX and XeTeX appear to accept the notation with 4, 5, or 6 carets followed by the same number of hexadecimal digits, but XeTeX also accepts it for 3, while LuaTeX doesn't. Compiling the following with pdfTeX, LuaTeX, and XeTeX gives different results.

My goal (there may be a better way to do this) is to provide a way to test whether passing a given list of tokens through \scantokens is safe. For that, my plan is to go through the \detokenized token list one character at a time, applying TeX's rule for tokenizing (but no need to fully tokenize), and detecting begin-group and end-group tokens, as well as invalid characters.

The \show ^^^^^10101 is obviously a bug. It seems that XeTeX is accessing to hash memory: If I say \show ^^^^^11015 the answer is the character XeTeXintercharclass. The last code that shows this bizarre behavior (on my installation) is 11017. But if I say \def\foo{}, then \show ^^^^^11018 produces the character foo.
–
egregJul 9 '12 at 17:24

@egreg: I think XeTeX is in fact accessing to the string memory (not that I know enough about the internals to be sure), since \show ^^^^^10001 and values close by shows various parts of TeX error messages. Sadly, I don't think this can be used for any useful purpose (if the bug was affecting more than just the \meaning, we could've hoped to be able to run the code for primitives: I tried making characters active to no avail).
–
Bruno Le FlochJul 10 '12 at 11:13

1

Yes, string memory. Probably an error in the implementation of the "five ^" notation.
–
egregJul 10 '12 at 11:15

@egreg: not in the "five ^ notation" directly, since \lccode42="10101\lowercase{\show *} also shows the same \meaning.
–
Bruno Le FlochJul 10 '12 at 11:18

In case of TeX/e-TeX/pdfTeX and LuaTeX the two superscript characters 44 are
followed by two 44 hexadecimal digits, the result is letter D (character code 0x44, decimal 68) and {a} follows that gives variable "a" in math mode. LuaTeX does not
see four superscript characters, because they are not followed by four hexadecimal
digits.

XeTeX first sees four superscript characters. But they are not followed by
four hexadecimal digits. It switches to the case ^^c, where two superscript
characters are followed by a non-hexadecimal character. The result is "t" (0x74
= 0x34 ('4') + 64). The fourth "4" is then treated as superscript that
raises the following {a}. But c is "4", a hexadecimal digit. XeTeX should have
applied case ^^xx. Therefore I consider this behaviour as bug.

(Edit: Correction for next paragraph, 65536 is correct and 256 was wrong — I had
looked at the wrong section in the web change file.)

Also the problem with \show is indeed a bug. The character is printed calling the procedure print with its character code as argument. If this code is less than biggest_char, then the character is printed, otherwise the code is interpreted
as string id and the string with the id is printed instead (procedure print).
The definition of biggest_char:

@d biggest_char=65536 {the largest allowed character number;
must be |<=max_quarterword|}

Characters &leq; U+FFFF are shown correctly, beyond the characters
affected are the characters beyond. This can be used for debugging
the string pool ⌣:

This is XeTeX, Version 3.1415926-2.4-0.9998 (MiKTeX 2.9) (INITEX)
(C:\Users\one\test\test-xetex-strings.tex
[0: .4]
[1: .9998]
[2: buffer size]
[3: pool size]
[4: number of strings]
[5: ???]
[6: m2d5c2l5x2v5i]
[7: End of file on the terminal!]
[8: ! ]
[9: (That makes 100 errors; please try again.)]
[10: ? ]
[11: Type <return> to proceed, S to scroll future error messages,]
[12: R to run without stopping, Q to run quietly,]
[13: I to insert something, ]
[14: E to edit your file,]
[15: 1 or ... or 9 to ignore the next 1 to 9 tokens of input,]
[16: H for help, X to quit.]
[17: OK, entering ]
[18: batchmode]
[19: nonstopmode]
[20: scrollmode]
[21: ...]
[22: insert>]
[23: I have just deleted some text, as you asked.]
[24: You can now delete more, or insert, or whatever.]
[25: Sorry, I don't know how to help in this situation.]
[26: Maybe you should try asking a human?]
[27: Sorry, I already gave what help I could...]
[28: An error might have occurred before I noticed any problems.]
[29: ``If all else fails, read the instructions.'']
[30: (]
[31: Emergency stop]
[32: TeX capacity exceeded, sorry []
[33: If you really absolutely need more capacity,]
[34: you can ask a wizard to enlarge me.]
[35: This can't happen (]
[36: I'm broken. Please show this to someone who can fix can fix]
[37: I can't go on meeting you like this]
[38: One of your faux pas seems to have wounded me deeply...]
[39: in fact, I'm barely conscious. Please fix it and try again.]
==> 1381 strings available.
)
No pages of output.
Transcript written on test-xetex-strings.log.

@PatrickGundlach You can retroactively add a bounty.
–
lockstepJul 27 '12 at 14:30

2

Thanks for the analysis, now xetex prints the letter number "10001 and the character number "10101. for the aforementioned examples.
–
Khaled HosnyJul 31 '12 at 12:06

1

@HeikoOberdiek @KhaledHosny I did run the second code with various xetex versions. With TL12 I get Heiko's output (with a different number of available strings). But with TL13 + a current MiKTeX I get a list which looks like this: ... [38: í €í°¦] [39: í €í°§] ==> 1048576 strings available. I'm wondering if the change in the miktex output is related to my problem with the ^^-notation described in the second edit in tex.stackexchange.com/questions/122923/….
–
Ulrike FischerAug 16 '13 at 7:49