Below are my comments on draft-iab-idn-encoding-00.txt
(http://tools.ietf.org/html/draft-iab-idn-encoding-00).
I have cc'ed the IAB list (because this is an IAB document), the list of
the idnabis WG as the most directly affected WG, and the public-iri
list, where discussions about the IRI specification are held. Everybody,
please reduce cross-posting for contributions that are not of interest
to all three lists.
In general, I think the document is easy to read and understand.
Mentioning ISO-2022-JP for encoding Japanese domain names raises some
suspicion. ISO-2022-JP may well be (or have been) used in the DNS or a
similar system, but such use would be atypical, and should be documented
by a reference. Based on the general "division of labor" of the three
classical Japanese encodings (ISO-2022-JP, EUC-JP, Shift_JIS), one would
expect EUC-JP or Shift_JIS rather than ISO-2022-JP in such a case.
[Among the three, ISO-2022-JP makes it easiest to explain the "heuristic
encoding detection" scenario described at the end of Section 1.1. But
without a reference, it may look to some as if ISO-2022-JP was a made-up
example.]
For the bulleted list at the end of Section 1.1, it should be pointed
out that UTF-8 can be detected, and distinguished from other 8-bit
encodings, with much higher precision than just "a byte in the string
has the 8th bit set". For details, please see
http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf.
The heuristic for punycode that is given in Section 1.1 is "starts with
xn--". However, on the level of getaddrinfo, we are dealing with domain
names, not single labels, and something like www.xn--foo.jp should
definitely be punycode even if it doesn't start with xn--.
The solution that the document seems to be pushing most is heuristic
detection, i.e. an API where strings in different encodings are fed in
and the API sorts things out heuristically, converting if necessary. To
some extent, this may be an unavoidable evil, but it would be good if
the document were pushing more for clear encoding identification (for
which I think GetAddrInfoW() (UTF-16) would be an example).
It may be a good idea to also look into the issue of escaped forms of
domain names being fed into resolver APIs. One form of escaping is
(UTF-8-based) %-encoding in URIs (and IRIs), which is allowed in URIs
according to RFC 3986, is the only way to encode non-ASCII in the host
part of an URI where punycode isn't appropriate, and may be the result
of a conversion from an IRI to an URI. For further background and
discussion, please see
http://lists.w3.org/Archives/Public/public-iri/2009Aug/0012.html
and http://lists.w3.org/Archives/Public/public-iri/2009Aug/0024.html and
the followup discussion.
Another potential kind of escaping are HTML/XML numeric character
references (of the form &#xABCD;), although I expect them to be less of
a problem because they are used higher up in the application and usually
removed early on.
Regards, Martin.
--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp