If you're thinking that Perl should test every SV that it passes the char* of to some system call, and examine if it contains a null byte a a position other than the last byte, and what? Die? Issue a warning? Convert the embedded null to a space?

Just because it is possible to do something--like embedding nulls in filenames--doesn't mean that it isn't an obviously bad idea; and ham-stringing the performance of Perl and every other utility program in order to cater for idiots that ignore the obviously bad ideas, would result in today's systems running with roughly the same performance as the 40MHz cpu's that became available in the late 1980's.

Give me pragmatism over perfection every time.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Well, especially for builtins perl should know if the char* it's handing off to the system call will be interpreted as a classic C string or as a wide character.

I question strongly that this is a serious performance issue - among other things, null-containing strings could easily be flagged as such, the way tainted strings are when running under taint mode.

As for what perl should do, I certainly think that a warning when running under -w is appropriate - this is at least as big a problem as interpolating an undef variable into a string. I might even be convinced that perl in taint mode should treat nul-containing strings as tainted when passing them to C APIs - that is, die.

The taint flag is set at a few, very specific points of input. And it remains set until it is modified

Think of all the different ways a string can be read in, constructed or modified. Interpolation of other strings, concatenation, join, pack, unpack, qq//, s///, tr///, substr, chomp, chop, sprintf, read, sysread, vec, promotion of IVs & NVs to PVs etc. etc. Every time a scalar is modified it would be necessary to recheck whether it now (or still) contains one or more null characters--and if it does, whether they are a legitimate part of a multibyte character or not.

Still doubt the performance impact?

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Except of course where the embedded null is a legitimate part of a multi-byte character--which means a full unicode verification of every string passed to every system API

Um, sorry, but if you're talking about the perl-internal representation of unicode, which is utf-8, the only thing that involves a null byte is the unicode code-point U+0000 (i.e. "NULL") -- and this, BTW, is simply the single-byte-null itself in utf-8. For every other utf-8 character, every byte is always non-null. And I don't know of any pre-unicode encodings that use nulls as parts of multi-byte characters.

If a string of octets is supposed to represent utf-16, then sure, we would expect some of those octets to be null -- each octet is supposed to be treated as half of a 16-bit binary "word"; but this is a very different situation. Here we are talking about something more akin to plain old raw binary data, not a string of characters that can be transmuted directly to a char* and treated as a string in C.

You obviously know more about things unicode than I--I've had barely any reason to use them--so I'll ask you:

Is there no possibility that when encoding a string to one of the many forms of unicode for output to an external system that there might legitimately be null bytes embedded within the string?

If there isn't, then detecting and warning of embedded nulls would only require a single pass of every scalar passed to a system api looking for nulls.

If there is--and I feel sure that some of the MS wide character sets contain some characters where one half of the 16-bit values can be null, but I don't have proof yet--then it would require two passes in order to ensure against false positives causes spurious warnings/dies.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Is there no possibility that when encoding a string to one of the many forms of unicode for output to an external system that there might legitimately be null bytes embedded within the string?

The only forms of unicode that can involve a null byte as part of a non-null character are the 16-bit encodings: UTF-16LE and UTF-16BE.

"UTF-16" without the byte-order spec refers either to "whatever the native byte-order is on the current cpu" or to a data file encoded as 16-bit unicode characters and having a byte-order-mark (BOM, U+FEFF) as the very first character, so that unicode-aware readers know whether they need to swap bytes in order for their current cpu to see the intended 16-bit character values. Think of UTF-16* as if it were 16-bit PCM audio data: you need to handle it in two-byte chunks, and you need to know which of the two is the "least significant byte"; if you treat it as just bytes, anything can happen.

So it's a pretty nice feature that Perl uses utf-8 as its internal string representation, and not utf-16. This encoding is analogous to uuencoding or base64 encoding, though it's actually a bit more clever: the intention is to convey 16 bits worth of data using only a limited range of possible byte values, but the number of bytes needed to convey that value will tend to be fewer for the "simpler" characters (those in the lower range of the 16-bit space) than for the "heavier" characters (those in the higher range).

Because of the design, ASCII characters (00-7F) remain single-byte characters in utf-8; code points U+0080 through U+08FF need two bytes, and from U+0900 through U+FFFF you need three bytes. In the multi-byte "wide" characters, all bytes have their high bits set, so as not to be confusable with ASCII. (The "Unicode Encodings" section of the perlunicode man page provides all the details quite nicely.)

Of course, the whole notion of "wide characters" in C has the same status as the notion of "strings" -- i.e. it's a convenient fiction; that's why all the pre-unicode wide-character encodings (for Chinese, Japanese and Korean) never used a null byte as a component of a multi-byte character.

(<update> Regarding this question: I feel sure that some of the MS wide character sets contain some characters where one half of the 16-bit values can be null... -- Well, now that you mention it, I've looked at hex dumps of Word files containing unicode characters, and they actually alternate at block boundaries (2KB blocks, I think, but I forget) between single-byte character encoding for blocks that don't contain wide characters, vs. UTF-16LE encoding for blocks with wide characters in them. Pretty scary stuff -- I would call it brain-damaged. But none of the 2-byte "legacy" MS/DOS code pages (e.g. CP936) ever used null bytes as part of a code point. </update>)

I guess if people wanted to pursue the notion of granting special status to character strings in order to enable some sort of trap or check for "embedded-null-byte", there would have to be a flag on the SV that says "this is a character string (so if you see an embedded null byte, that would mean something is wrong)."

Since SV's are used to store all kinds of stuff, some of which is expected to include null bytes by nature, there would have to be something similiar to the utf8 flag, that says "this is really character data, and I'd be worried if there were a null byte in it". Then, every SV-to-char* operation would need to know whether the char* is going to be used as a character string in C, and if so, check that flag.