Adding digit separators introduces the potential for ambiguous C++ programs.
We would prefer to avoid ambiguity,
and failing that would prefer to have usable rules
for disambiguating the source.
In particular,
the interaction with user-defined literals
[N2747][N2765]
should be carefully considered.

The lexical structure of C++
is shared with C, Objective C/C++, and other tools through the preprocessor.
Any introduction of digit separators
should carefully consider compatibility
with the existing lexical structure of these languages.

Richard Smith questions the value of compatibility here.

This problem only arises if:

Someone is attempting to write a file
which is to be shared between C++14 and other languages,
and

They include text in that header
which simply does not work in those other languages.

I find it hard to believe that this will be a real problem,
and it seems like a clear case of user error.
(If you're writing a header which works in C and C++,
the burden is on you to make sure it works in C).

This is not a new issue.
The same problem already exists with C++11's raw string literals,
and to a lesser extent with user-defined-literals
and with C's hex floats (which allow 'p+' within pp-numbers).

C++ is often used as the basis for extended languages,
notably Objective C/C++,
but also many languages that are smaller and less widely used.
Invalidating those extension languages
has costs that are hard to predict.

The basic source character set consists of 96 characters:
the space character,
the control characters representing
horizontal tab, vertical tab, form feed, and new-line,
plus the following 91 graphical characters:
[Footnote:
The glyphs for the members of the basic source character set
are intended to identify characters from the subset of ISO/IEC 10646
which corresponds to the ASCII character set.
However, because the mapping
from source file characters to the source character set
(described in translation phase 1)
is specified as implementation-defined,
an implementation is required to document
how the basic source characters are represented in source files.
—end footnote]

Of particular note,
the only printable ASCII characters
not used in the C++ basic character set
are
$ (dollar),
@ (commercial at sign), and
` (grave accent, back tick).
All of these characters have been used for extension characters.
Dollar has also been used as an identifier character,
e.g. in VAX/VMS system functions names.

A preprocessing token is the minimal lexical element of the language
in translation phases 3 through 6.
The categories of preprocessing token are:
header names, identifiers, preprocessing numbers,
character literals (including user-defined character literals),
string literals (including user-defined string literals),
preprocessing operators and punctuators,
and single non-white-space characters
that do not lexically match the other preprocessing token categories.
If a ' or a " character matches the last category,
the behavior is undefined.
Preprocessing tokens can be separated by white space;
this consists of comments (2.8),
or white-space characters
(space, horizontal tab, new-line, vertical tab, and form-feed),
or both.
As described in Clause 16,
in certain circumstances during translation phase 4,
white space (or the absence thereof)
serves as more than preprocessing token separation.
White space can appear within a preprocessing token
only as part of a header name
or between the quotation characters in a character literal or string literal.

The implication here is that
no valid C++ program should have an isolated single or double quote character.
Unfortunately, that information is less useful that it might appear
because an isolated single quote could be in use
to signal an extension language interpretation.

There are three primary typographic conventions for digit separators:
a comma, base-line dot, and a (thin) space.

C++ already uses the comma for an operator,
and using it for a digit separator would introduce ambiguities
in expressions such as ++a-3,4-b++,
or even more simply, f(12,345).

C++ already uses the base-line dot as a radix point,
and so it is essentially not usable as a digit separator.

Bjarne Stroustrup has suggested using a space as a separator.

Pronounce 7 237 498 123.

Compare 237 498 123
with 237 499 123 for equality.

Decide whether 237 499 123
or 20 249 472 is larger.

While this approach is consistent with one common typeographic style,
it suffers from some compatibility problems.

It does not match the syntax for a pp-number,
and would minimally require extending that syntax.

More importantly,
there would be some syntactic ambiguity when a
hexadecimal digit in the range [a-f] follows a space.
The preprocessor would not know
whether to perform symbol substitution starting after the space.

Ville Voutilainen, among others,
suggests using a grave accent (`) (back tick) as a digit separator.

Pronounce 7`237`498`123.

Compare 237`498`123
with 237`499`123 for equality.

Decide whether 237`499`123
or 20`249`472 is larger.

This character is not part of the C++ basic source character set.
The proposal has the advantage that introducing for this purpose
cannot yield any ambiguity with existing C++ code.
There are two disadvantages.
First, using this character in the language
invalidates any meta-languages using this character to distinguish
between the C++ base layer and any meta information.
Second, existing preprocessors
would not recognize the grave accent as part of a preprocessor number,
and may thus yield incorrect results.

Daveed Vandevoorde suggests using a single quote
[N2747].
The single quote can be thought of as an "upper comma".

Pronounce 7'237'498'123.

Compare 237'498'123
with 237'499'123 for equality.

Decide whether 237'499'123
or 20'249'472 is larger.

There are two problems with this approach.
First, an odd number of single quotes would result in a line of text
that does not meet the preprocessor syntax for a token.
While most preprocessors do not tokenize lines
that are ignored in #if/#else,
some preprocessors are known to emit errors for such cases.
Second, existing preprocessors
would not recognize the single quote as part of a preprocessor number,
and may thus yield incorrect results.

Daveed Vandevoorde explains the incompatibility in more detail.

For example:

#if defined(__cplusplus)
double pie = 3.141'593;
#endif

In C, the preprocessor-tokens that are #if'ed out
are (not including the double quotes)
"double", "pie", "=",
"3.141", "'", "593",
and ";".

However, single and double quotes
that aren't part of a larger preprocessor-token
are deemed undefined behavior (C99, 6.4/3).

Typical C compilers (GCC, clang, EDG, and MSVC for example)
have no problem with it
(presumably they don't try to tokenize #if'ed-out lines),
but James Dennett mentioned at least one older C compiler didn't like it.

Pete Becker points out that many tools,
such as syntax highlighting in editors,
rely on quotes being paired.
The adaptability of the tools to new expressions is an open issue.

N.M. Maclaren suggests that single quote
will lead to very bad error messages with some macro-based libraries.

The Ada programming language
uses an underscore (technically, a low line)
for the digit separator
[AdaLRMnumlit][AdaRDnumlit].
This approach seems to be used in VHDL and Verilog,
also possibly in Algol68.
(VHDL also appears to have literal suffixes.)
This approach has been proposed more than once for C++,
going at least as far back as 1993
[N0259].

Pronounce 7_237_498_123.

Compare 237_498_123
with 237_499_123 for equality.

Decide whether 237_499_123
or 20_249_472 is larger.

In all known cases,
the primary proposal has been to permit only
a single underscore between digits
[N0259][N2281][N3342].
However,
[N0259]
presents an option to permit underscores
between the digit sequence and any prefix or suffix.

Underscores work well as a digit separator for C++03
[N0259][N2281].
But with C++11, there exists a potential ambiguity with user-defined literals
[N2747].
While the likely resolution will be some form of "max munch" rule,
some mechanism must be present to disambiguate
when max munch is too much.
We use the term suffix separator to indicate this mechanism.

... one possibility that occurs to me
would be to allow a trailing underscore in an integer literal.
The ambiguity with user-defined literals
would be resolved in favor of the plain integer literal;
a user could disambiguate a user-defined literal
by ending the integer part with a trailing underscore.
(Double underscores would not be permitted in an integer literal.)
Thus:

The ambiguity with this approach
arises when the suffix begins with one or more underscores.

John Spicer suggests something slightly different.

At some point I had suggested using underscore
and having a special lookup rule
so that something like 0xabc_de
would look for the "de" user-defined literal operator,
and if not found,
would treat the "de" as part of the hex literal.
If you wanted to force the use of the operator,
you could write 0xabc__de.
If you wanted to force the use of a _de operator,
you would have to write 0xabc___de.

Another alternative would be to look for the "de" form
and then the "_de" form if the first was not found.
That way would only require the use of three underscores
in cases where you had both a "de" and "_de" operator
and wanted to force use of the second.

[N2747] suggests the scope operator (::)
as a potential suffix separator.
The scope operator would be a pure syntactic extension,
as it could not otherwise follow a literal.
However, it would make substrings of a literal
separately subject to preprocessor symbol substitution.

[N3342] suggests disallowing
a leading underscore followed by a digit as a user-defined literal suffix.
The intent was to make a suffix separator unnecessary.
However, [N3448] points out
that [N3342] fails to disambiguate hexadecimal digits,
particularly in hte example 0xdead_beef_db,
where db could be either decibel
or the hexadecimal digits d and b.

One could simply not allow user-defined literals with hexadecimal literals.
However, this restriction is not desirable.

Discussions in the October 2012 standards meeting
settled on using whitespace as the suffix separator.
Unfortunately,
that approach causes parsing problems for Objective C/C++.

Richard Smith explains.

An Objective-C message send works like this:

message-expression:

[expression message-selector]

message-selector:

identifier

keyword-arguments

keyword-arguments:

identifieropt :
expression keyword-argumentsopt

In particular, this is a valid Objective-C message send:

[self setValue: 0xff units: "cm"]

Hence any proposal which
folds a pp-number followed by an identifier into a single literal
will break a significant quantity of Objective-C code.

Doug Gregor elaborates.

There are two issues
with allowing spaces between a literal and its suffix for Objective-C.
One is a true ambiguity and one is a problem for error recovery.

The true ambiguity occurs because
one can omit a parameter name from the method declaration,
in which case there is no identifier before the ':' in the call.
For example, one could have a message send that looks like this:

[a method:10 :11]

which calls the method "method::". Now, consider

[a method:10 _suffix:11]

Currently, this parses (unambiguously)
as a message send to "method:_suffix:",
i.e., it's parsed as

[a method:(10) _suffix:11]
// _suffix is the name of the second argument; calls method:_suffix:

However, if we allow a space between a literal and its suffix,
there is a second potential parse:

The error-recovery issue is that
Objective-C(++) parsers tend to rely heavily on the fact that
an expression in C/C++ cannot be immediately followed by an identifier.
If we see an expression followed by an identifier in an expression context,
it's fairly likely that this is a message send
for which the '[' has been dropped.
For example, Clang detects these cases
and automatically inserts the '[' for the user;
this was one of the top error-recovery requests,
and a regression here would be considered a major problem for our users.

Jeremiah Willcock suggests using ".." as the suffix separator.
This notation is already permitted by the pp-number syntax.
It is also not presently permitted by any numeric literal.
Its primary disadvantage seems to be that it is unfamilar.

Edit paragraph 1 as follows.
Note that each ?
will be replaced by the actual chosen digit separator character(s).

An integer literal
is a sequence of digits
that has no period or exponent part,
with optional digit separators.
These separators are ignored when determining its value.
....
[Example:theThe number twelve can be written
12, 014, or 0XC.
The literals
1048576,
1?048?576,
0X100000,
0x10?0000, and
0?004?000?000
all have the same value.
—end example]

Edit within paragraph 1 as follows.
Note that each ?
will be replaced by the actual chosen digit separator character(s).

....
The integer and fraction parts
both consist of a sequence of decimal (base ten) digits,
with optional digit separators.
These separators are ignored when determining its value.
[Example:
The literals 1.602?176?565e-19
and 1.602176565e-19
have the same value.
—end example]
....

Edit paragraph 1 as follows.
Note that each ?
will be replaced by the actual chosen digit separator character(s)
and each ??
will be replaced by the actual chosen literal separator character(s).

If a token matches both user-defined-literal
and another literal kind,
it is treated as the latter.
[Example:123_kmand 123??kmis a user-defined-literalare user-defined-literals,
but 123?456 and 12LL
is an integer-literalare integer-literals
—end example]
....