I noticed the specification usually treats null characters U+0000 by
replacing them with the replacement character U+FFFD . The other cases
it will be ignored by the tree construction stage when the mode is 'in
body', 'in table text', 'in select'.
Would it not be simpler and more consistent to just have the Input
Stream Preprocessor replace all null characters with the replacement
character. I don't see the point in filling the specification with
handling of null characters just so sometimes it can be ignored
instead of included as a replacement character. And with character
references it is replaced with U+FFFD (ie. &#00 becomes the
replacement character).
If the Input Stream Preprocessor convert them it would result in
minimal changes to the output as I believe most HTML documents in the
wild do not include null characters.
On a similar note why have the other invalid unicode characters,
U+0001 to U+0008, U+000E to U+001F, U+007F to U+009F, U+FDD0 to
U+FDEF, and characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF,
U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE,
U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF,
U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE,
U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF,
U+10FFFE, and U+10FFFF
as part of the input stream to the tokenizer and tree construction?