I have a weird encoding problem with just a small number of unicode characters. If I want to use chr($cp) to produce unicode the unicode character with codepoint $cp, this works fine in principle. It seems to work for all codepoints >= 0x100 without limitation, and it also works for the ascii range. Why shouldn't it?

For the range in between (0x80-0xFF) it works to produce the character and directly print it - but as soon as I try to integrate te char in some string, the replacement character U+FFFD is printed instead. No matter if I use the '.' operator, or s/// or whatever.

I inserted the string "01" so that we can easier recognize the byte order. We see that 0x0061 maps to 61 (as expected), 0x00DC becomes efbdbf (don't know whether this is correct, but this is certainly different from your result), and 0x2184 turns into e28684. This is with Perl 5.10.1 running on Cygwin.

Codepoint 0xFFFD would correspond to efbfbd in UTF8 encoding, right?

If you use my example, do you get the same result as I do, or do you still get the encoding for 0xFFFD for char (0xdc)?

Thanks for your answer, rovf. Your example does exactly the same thing on my machine, and I think it reproduces my problem. I find od -cx somewhat confusing - if you use od -ctx1 (print char and hex for each single byte) you see that, again, the replacement char ef bf bd is printed:

Indeed, you are right! Silly that I couldn't see it in the first place.

However, I found a hint why these characters are treated differently. From perldoc -f chr :

Note that characters from 128 to 255 (inclusive) are by default internally not encoded as UTF-8 for backward compatibility reasons.

Of course, this does not explain yet why the problem occurs just with concetanation.

I thought first that it might be related to the fact that catenation puts chr() into scalar mode, but this is not the reason. Even if I put it into list context, then take the first element of the list, and catenate it, the bug appears:

Code

print(([chr(0x00DC)]->[0])."\n")

BTW, it is not only catenation. Interpolation also doesn't work:

Code

print "@{ [chr(0x00DC)] }\n"

Since

Code

print(chr(0x00DC),"\n")

works, I feel that it is not just a bug in chr, but somehow deeper in Perl, when it comes to manipulate Unicode strings.

If you can't find a good explanation in this forum, I suggest that you explain the issue at http://perlmonks.org/, and if they also can't explain it, I would file a bug report....

Note that "\x.." (no "{}" and only two hexadecimal digits), "\x{...}", and "chr(...)" for arguments less than 0x100 (decimal 256) generate an eight-bit character for backward compatibility with older Perls. For arguments of 0x100 or more, Unicode characters are always produced. If you want to force the production of Unicode characters regardless of the numeric value, use "pack("U", ...)" instead of "\x..", "\x{...}", or "chr()".

Perl Unicode has problems with codepoints 128-255 (0x80-0xff). This means that e.g. chr(228) or related things like \x{..} will give you trouble. As far as I can see this is known as the "Unicode Bug", or at least an important aspect thereof. Often it is not called a bug but a backwards comaptibility issue. Whatever ... (see links below)

In my examples above things begin to go wrong as soon as I concatenate these characters with other strings. The links listed below also include indications why this could be so. But after years and years of perl programming with utf8, I still find it very hard to finally understand.

The sources below often suggest utf8::upgrade or utf8::encode but neither worked for me. What did work was what rovf found: pack( "U",0x00DC)