Raw and Unicode String Literals; Unified Proposal (Rev. 2)

Introduction

Two papers,
N2209, UTF-8 String Literals, by Lawrence Crowl, and
N2146, Raw String Literals (Revision 1), by Beman Dawes,
propose additional forms of string literals for C++. Both have
been approved by the Evolution Working Group and are ready for
processing by the Core Working Group. Both papers make changes to
the same text in the Working Paper. This proposal unifies the
changed wording to avoid race conditions in editing the text.

The motivation, discussion, and other details from the
original proposals remains unchanged.

"The terminating d-char-sequence of a raw-string shall be the same
sequence of characters as the initial d-char-sequence,": "shall be" implies
an implementation has to diagnose it, which isn't the case. consensus:
Replace "shall be" by "is".

Added footnote: "For a specification of Unicode and UTF-8, see ISO
10646."

Removed the note that read "[Note:
Implementations are encouraged to accept as physical source file
characters all the permissible characters whose character short name in
ISO/IEC 10646 is 0000NNNN. --end note]" The core working group
considers that market forces are adequate to motivate implementors.

In the grammar, the sequence "d-char-sequence opt[r-char-sequenceopt]d-char-sequenceopt" was factored into a new non-terminal named
raw-string, at the request of Jens Maurer. Clark Nelson suggested the
name.

Wording for r-char was clarified in response to comments from
Jens Maurer, Clark Nelson, and Lawrence Crowl.

The space character was excluded from d-char in response to a
request by Lawrence Crowl.

The uR example was corrected, fixing an error noticed by James Widman.

d-char has been limited to the basic source character set, as
suggested by Clark Nelson and Beman Dawes.

The proposed text is the same as in the original papers (N2209,
N2146),
except:

The original raw string literal syntax allowed the 'R' that
denotes a raw string literal either before or after other
prefixes. Thus either LR or RL were valid. To reduce the
combinatorial explosion caused by the addition of the u, U, and
u8 prefixes, the R is now only valid following the other
portion of a prefix. This is the same as in Python.

The original UTF-8 string literal wording made any source
character set extensions to the basic source character set
implementation-defined, but only for literals. It seemed
awkward to make the source character set
implementation-defined, but in only literals, so that was
changed to apply to the entire source file. Non-normative
encouragement to support all 16-bit ISO/IEC 10646 characters
was added to encourage physical source file character set
uniformity. That's existing practice for compilers such as
VC++.

Proposed Text

Change 1.7 The C++ memory model [intro.memory] as indicated:

The fundamental storage unit in the C++ memory model is the
byte. A byte is at least large enough to contain any member of the
basic execution character set
and the eight-bit code units of
the Unicode UTF-8 encoding form and is composed of a contiguous
sequence of bits, the number of which is implementation-defined. The least
significant bit is called the low-order bit; the most significant
bit is called the high-order bit. The memory available to a C++
program consists of one or more sequences of contiguous bytes. Every byte has
a unique address.

Change 2.1 [lex.phases], paragraph 1 as indicated. (Note to
reviewers: the ISO/IEC short name wording is the same as used in
2.2 Character sets [lex.charset] paragraph two.)

1. Physical source file characters are mapped, in an
implementation-defined manner, to the basic source character set
(introducing new-line characters for end-of-line indicators) if
necessary. The set of physical source
file characters accepted is implementation-defined. Trigraph
sequences (2.3) are replaced by corresponding single-character
internal representations. Any source file character not in the
basic source character set (2.2) is replaced by the
universal-character-name that designates that character. (An
implementation may use any internal encoding, so long as an
actual extended character encountered in the source file, and the
same extended character expressed in the source file as a
universal-character-name (i.e. using the \uXXXX notation), are
handled equivalently.)

Change 2.1 [lex.phases], paragraph 1 as indicated:

5. Each source character set member, escape sequence, or
universal-character-name in
character literals and string
literals
a character literal or a string literal, or escape sequence in
a character literal or a non-raw string literal, is
converted to the corresponding member of the execution character
set (2.13.2, 2.13.4); if there is no corresponding member, it is
converted to an implementation-defined member other than the null
(wide) character.17)

r-char:any member of the source
character set, except, (1), a
backslash \followed by a u
or U, or,
(2),
a right square bracket
]followed by the initiald-char-sequence (which may be empty) followed by a double quote
".universal-character-name

d-char-sequence:d-chard-char-sequence
d-char

d-char:any member of the basic source character set, except space, the left square bracket [, the
right square bracket ],or the control characters representing horizontal tab,
vertical tab, form feed, or new-line.

A string literal is a sequence of characters (as defined in
2.13.2) surrounded by double quotes, optionally
beginning with one of the
lettersprefixed by
R,u8, u8R, u,
uR,U, UR,L, or
LR, as in
"...", R"[...]"
, u8"...", u8R"**[...]**",u"...",
uR"*@[...]*@",U"...",UR"zzz[...]zzz",L"...", or LR"[...]", respectively.

A
string literal that has an R in the prefix is a raw
string literal. The terminating d-char-sequence of a raw-string
is the same sequence of characters as the
initial d-char-sequence.
A d-char-sequence shall consist of at most 16 characters.

[Note: A source-file new-line
in a raw string-literal results in a new-line in the resulting
execution string-literal, unless preceded by a backslash.
Assuming no whitespace at the beginning of lines in the following
example, the assert will succeed:

const char * p =
R"[a\bc]";assert(std::strcmp(p, "ab\nc") ==
0);

--
end note]

A string literal that does not begin with u8,u,
U, or L is an ordinary string literal,
and is initialized with the given
characters.

A string literal that begins with
u8, such as u8"asdf", is a
UTF-8 string literal and is initialized with the given characters
as encoded in UTF-8.footnote

footnote
For a specification of Unicode and
UTF-8, see ISO 10646.

Ordinary string literals and UTF-8
string literals are also referred to as
a narrow string
literals.
An ordinarynarrow string literal has
type “array of nconst char”,
where n is the size of the string as defined below,
itand has static storage duration (3.7)
and is initialized with the given
characters.

A string literal that begins with u, such as
u"asdf", is a char16_t string literal.
A char16_t string literal has type “array of
nconst char16_t”, where n is
the size of the string as defined below; it has static storage
duration and is initialized with the given characters. A single
c-char may produce more than one char16_t
character in the form of surrogate pairs.

A string literal that begins with U, such as
U"asdf", is a char32_t string literal.
A char32_t string literal has type “array of
n const char32_t”, where n is
the size of the string as defined below; it has static storage
duration and is initialized with the given characters.

A string literal that begins with L, such as
L"asdf", is a wide string literal. A wide string
literal has type “array of nconst
wchar_t”, where n is the size of the string
as defined below, it has static storage duration and is
initialized with the given characters.

Whether all string literals are distinct (that is, are stored
in nonoverlapping objects) is implementation-defined. The effect
of attempting to modify a string literal is undefined.

In translation phase 6 (2.1), adjacent string literals are
concatenated. If both string literals have the same prefix, the
resulting concatenated string literal has that prefix. If one
string literal has no prefix, it is treated as a string literal
of the same prefix as the other operand. If a UTF-8 string literal token is adjacent to a
wide string literal token, the program is
ill-formed. Any other concatenations are
conditionally supported with implementation-defined behavior.
[ Note: This concatenation is an interpretation, not a
conversion. —end note ] [ Example: Here are some
examples of valid concatenations:

contains the two characters ’\xA’ and
’B’ after concatenation (and not the
single hexadecimal character ’\xAB’).
—end example ]

After any necessary concatenation, in translation phase 7
(2.1), ’\0’ is appended to every string
literal so that programs that scan a string can find its end.

Escape sequences in
non-raw string
literals and universal-character-names in string
literals have the same meaning as in character literals (2.13.2),
except that the single quote ’ is
representable either by itself or by the escape sequence
\’, and the double quote " shall be preceded
by a \. In a narrow string literal, a
universal-character-name may map to more than one char element
due to multibyte encoding. The size of a char32_t or
wide string literal is the total number of escape sequences,
universal-character-names, and other characters, plus one for the
terminating U’\0’ or
L’\0’. The size of a
char16_t string literal is the total number of
escape sequences, universal-character-names, and other
characters, plus one for each character requiring a surrogate
pair, plus one for the terminating
u’\0’. [ Note: The size of a
char16_t string literal is the number of code units,
not the number of characters. —end note ] Within
char32_t and char16_t literals, any
universal-character-names must be within the range 0x0 to
0x10FFFF. The size of a narrow string literal is the total number
of escape sequences and other characters, plus at least one for
the multibyte encoding of each universal-character-name, plus one
for the terminating ’\0’.