C7 Fix

There
is a serious security problem when people delete noncharacters. In
response, we are making one addition to C7 in 5.2 (the last, bolded,
bullet below), but we need to be stronger; we've seen more instances of
this pop up, and it has become clear that that clause in C7 is very
problematic.

Here is the current text:

C7
When a process purports not to modify the interpretation of a valid
coded character sequence, it shall make no change to that coded
character sequence other than the possible replacement of character
sequences by their canonical-equivalent sequencesor the deletion of noncharacter code points.

Replacement of a character sequence by a compatibility-equivalent sequence does modify the interpretation of the text.

Replacement
or deletion of a character sequence that the process cannot or does not
interpret does modify the interpretation of the text.

Note that security problems can result if noncharacter code
points are removed from text received from external sources. For more
information, see Section 16.7, Noncharacters, and Unicode Technical
Report #36, “Unicode Security Considerations.”

...

Fundamentally, C7 is about meaning. The principle is that
"abc<cedilla>d" means the same as "ab<c-cedilla>d". C7 just
gives a more practical statement to that principle.

So far, so good. But the last clause fails that test - nobody says that:

"abc/.<nonchar1>./d" means the same as "abc/../d".

Fixing
C7 doesn't mean that you can't remove <nonchar1> -- it just means
that when you remove it, you are changing the meaning, because, well, the strings don't mean the same thing. If they did mean the same thing, then the logical implication is that I could arbitrarily insert a <nonchar1> into an arbitrary string.

Note that we don't say that "abc/.<surrogate1>./d" means the
same as "abc/../d", either; and nor should we - deletion there is just
as problematic (and of course insertion is awful). C7 says what makes a difference in the interpretation of text; yet the
presence or absence of noncharacters does make a difference. Like
private use characters, you may not be able to know what
the
noncharacter means, but its presence or absence does make a difference,
especially for security.

Allowing replacement in C7 by FFFD we briefly discussed, but we can't
say that "abc/.<nonchar1>./d" means the same as
"abc/.<FFFD>./d", because that would imply the reverse as well.

The cleanest approach is to fix C7, and verify that we have
sufficient warnings against the use of nonchar or surrogates in open
interchange, and warnings that it is a real problem to delete them on
input; a good alternative technique is to map to FFFD on input.

This would also require changes in the 3rd and 4th paragraphs in Section 16.7 for consistency.