L2/07-152
From: Doug Ewell
Date: 2007-05-09
Subject: Re: Unicode Security Exploit
In L2/07-116, Mark Davis proposes a deterministic mechanism for
processes that interpret Unicode code unit sequences to handle
ill-formed sequences. While I question whether implementers will
uniformly adopt this mechanism, which requires the decoder to "push
back" the first code point that identifies the sequence as invalid, it
is a well-defined mechanism that resolves an ambiguity in the Standard.
L2/07-134, written by Kent Karlsson, proposes some changes to L2/07-116:
1. It reverses the proposal to exclude the first "good" code point from
the invalid sequences, instead leaving the interpretation up to the
implementation. Making the interpretation consistent is the main point
of L2/07-116. Either of the possible interpretations (include or
exclude), applied consistently, would be better than formally
establishing this as an implementation dependency.
2. It requires, as an alternative to aborting the interpretation
process, the replacement of invalid sequences by sequences of U+001A
rather than U+FFFD. The character U+001A, originally defined in ASCII
as SUBSTITUTE, has led a double life for three decades as an end-of-file
character in CP/M and MS-DOS systems, and there are still text processes
today that stop reading a text stream upon encountering this character.
The use of U+FFFD is preferable in this situation and is specifically
mentioned in the conformance requirements.
I recommend that these two provisions of L2/07-124 be rejected by the
UTC and the corresponding provisions of L2/07-116 be approved.
--
Doug Ewell * Fullerton, California, USA * RFC 4645 * UTN #14
http://users.adelphia.net/~dewell/
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages