On Tue, 7 Aug 2007, Cristina Fiorentini wrote:
> Ok scuse me,
> the address of one of my documents is
> http://ww4.comune.fe.it/scuole/index.phtml?id=259 and the current validator,
> from some days, does not report error for word " ' " apostrophe.
Cristina,
thank you for the information. I'm taking the liberty of Cc'ing the
validator list, since you seem to have encountered a problem in the
current version of the validator. It's probably not a bug but might need
some clarification in the documentation.
> I declare my pages XHTML 1.0 Strict - iso-8859-1
I can reproduce the problem in a trivial test document
http://www.cs.tut.fi/~jkorpela/test/test.htmlx
that contains octet 146 (decimal), which is not reported by the
W3C validator but is reported by the WDG validator. If I test with HTML
4.01, such an octet is reported as an error, as before.
The problem appears both for XHTML 1.0 documents served as text/html and
for them served as application/xhtml+xml.
I'm afraid this takes us deep into character problems. And I'm not sure I
understand the issue well enough (even though I _should_; I've devoted
several pages to the discussion of characters in markup languages in my
book "Unicode Explained"...). But this is how things seem to be:
When you have octet 146 in a document declared to be iso-8859-1 encoded,
it is interpreted as denoting a control code in the C1 Controls area. The
meanings of those control codes have not been defined in the ISO 8859-1
standard, but they correspond to the C1 Controls area of Unicode, so that
e.g. 146 decimal (92 hexadecimal) maps to the Unicode character U+0092.
Such characters (code positions) are forbidden in HTML 4.01 (or any
pre-XHTML version of HTML), so the validator correctly reports them as
erroneous ("non SGML character"). However, in XML, and hence in XHTML, C1
Controls like U+0092 are allowed, though discouraged. Formally, thus, they
cannot be reported as errors.
> Can i declare my pages encoding as windows-1252?
Yes. (This changes the picture, since now e.g. octet 146 is interpreted
according to the windows-1252 encoding, where it denotes a printable
character.)
> It's not a problem for accessibility?
Hardly. Windows-1252 is widely supported by browsers, even on platforms
other than Windows, simply because it is widely used on web pages
(though often with declarations that claim that the encoding is
iso-8859-1).
--
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/