IDN in Google Chrome

Background

Back in the day, hostnames could only consist of the letters A to Z, digits, and a few other characters. Internationalized Domain Names (IDNs) were devised to support arbitrary Unicode characters in hostnames in a backward-compatible way. This works by having user agents transform a hostname containing Unicode characters to one fitting the traditional mold, which can then be sent on to DNS servers. For example, http://?bb.at is transformed to http://xn--bb-eka.at. The transformed form is called punycode.

Ideally, user agents could always display the Unicode version of a hostname. However, different characters from different languages can look very similar, and this can make phishing attacks possible. For example, the Latin "a" looks a lot like the Cyrillic "а", so someone could register http://ebаy.com (http://xn--eby-7cd.com/), which would easily be mistaken for http://ebay.com. This is called a homograph attack.

In a perfect world, domain registrars would not allow such nefarious domain names to be registered. Some TLD registrars do exactly that, mostly by restricting the characters allowed, but many do not. For some TLDs that are meant to be international, this would be nontrivial to do (e.g., .com).

As a result, all browsers try to protect against homograph attacks by displaying punycode instead of the original IDN if the hostname does not fulfill certain properties. They try to do this in a way that allows IDN to be shown for valid hosnames, but protects against phishing.

Google Chrome's IDN policy

Google Chrome decides if it should show IDN or punycode for each component of a hostname separately. To decide if a component should be shown in IDN form, Google Chrome uses an algorithm that depends on the languages that the user claims to understand. On Windows and Linux, these languages can be configured in the Google Chrome's Fonts and Languages dialog. On Mac OS X, they are currently derived from the system language. The algorithm is:

The characters belonging to a given language are specified by the Unicode Consortium's CLDR dataset (schema; you can also view the exemplary characters of every language online, here's for example the set for Japanese). In Google Chrome, this is implemented by the ICU function ulocdata_getExemplarSet(), with the characters a-z added for whitelisted languages whose glyphs can't be confused with a-z. The whitelisted languages currently are Chinese (zh), Japanese (ja), and Korean (ko).

Consequences / Examples

Google Chrome will display IDN for components of a hostname consisting solely of characters that belong to one of the languages selected in the language settings—even on .com and .net domains, not only in domains native to that language. For example, http://россия.net will be displayed in IDN form if you claim to speak Russian or another language written in Cyrillic, and as punycode otherwise. Likewise, http://私の団体も.jp/ will be shown in IDN form only if you claim to speak Japanese in Google Chrome's options.

Google Chrome will always display punycode for components of a hostname that contains characters not in the main exemplary character set of any language. For example, http://?.net/ will always be displayed as punycode in Google Chrome.

Google Chrome will always display punycode for components that mix letters from multiple languages. For example, there is not a single language that contains all characters found in http://???????ē???????m????????t??.de, so this will be shown as punycode. Likewise, http://ebаy.com (with a Cyrillic "а") will always be shown as punycode, even if both English and Russian are in the accepted languages. This is true even if the domain is below a TLD whose registry takes care to protect against homograph attacks.

Behavior of other browsers

IE

IE displays URLs in IDN form if every component contains only characters of one of the languages configured in "Languages" on the "General" tab of "Internet Options", similar to what Google Chrome does.

Firefox

Firefox has a whitelist of TLDs whose registrars take care that no homographically confusable domains can be registered. URLs under such top-level domains are shown as Unicode unless they contain one of several blacklisted characters. For TLDs that are not whitelisted (e.g., .com), Firefox always displays punycode.

Opera

Safari

Safari has a whitelist of scripts that do not contain confusable characters, and only shows the IDN form for whitelisted scripts. The whitelist does not include Cyrillic and Greek (they are confusable with Latin characters), so Safari will always show punycode for Russian and Greek URLs.