The Internet Archive discovers and captures web pages through many different web crawls.
At any given time several distinct crawls are running, some for months, and some every day or longer.
View the web archive through the Wayback Machine.

Content crawled via the Wayback Machine Live Proxy mostly by the Save Page Now feature on web.archive.org.

Liveweb proxy is a component of Internet Archive’s wayback machine project. The liveweb proxy captures the content of a web page in real time, archives it into a ARC or WARC file and returns the ARC/WARC record back to the wayback machine to process. The recorded ARC/WARC file becomes part of the wayback machine in due course of time.

IDN in Google Chrome

Background

Many years ago, domains could only consist of the Latin letters A to Z, digits, and a few other characters. Internationalized Domain Names (IDNs) were created to better support non-Latin alphabets for web users around the globe.

Different characters from different (or even the same!) languages can look very similar. We’ve seen reports of proof-of-concept attacks. These are called homograph attacks. For example, the Latin "a" looks a lot like the Cyrillic "а", so someone could register http://ebаy.com (using Cyrillic "а"), which could be confused for http://ebay.com. This is a limitation of how URLs are displayed in browsers in general, not a specific bug in Chrome.

In a perfect world, domain registrars would not allow these confusable domain names to be registered. Some TLD registrars do exactly that, mostly by restricting the characters allowed, but many do not. As a result, all browsers try to protect against homograph attacks by displaying punycode(looks like "xn-- ...") instead of the original IDN, according to an IDN policy.

This is a challenging problem space. Chrome has a global user base of billions of people around the world, many of whom are not viewing URLs with Latin letters. We want to prevent confusion, while ensuring that users across languages have a great experience in Chrome. Displaying either punycode or a visible security warning on too wide of a set of URLs would hurt web usability for people around the world.

Chrome and other browsers try to balance these needs by implementing IDN policies in a way that allows IDN to be shown for valid domains, but protects against confusable homograph attacks.

Google Safe Browsing continues to help protect over two billion devices every day by showing warnings to users when they attempt to navigate to dangerous or deceptive sites or download dangerous files. Password managers (like Google Smart Lock) continue to remember which domain password logins are for, and won’t automatically fill a password into a domain that is not the exactly correct one.

How IDN works

IDNs were devised to support arbitrary Unicode characters in hostnames in a backward-compatible way. This works by having user agents transform a hostname containing Unicode characters beyond ASCII to one fitting the traditional mold, which can then be sent on to DNS servers. For example, http://öbb.at is transformed to http://xn--bb-eka.at. The transformed form is called ASCII Compatible Encoding (ACE) made up of the four character prefix ( xn-- ) and the punycode representation of Unicode characters.

April 2017 update

Specific instances of IDN homograph attacks have been reported to Chrome, and we continually update our IDN policy to prevent against these attacks. One specific instance of this general issue was reported to Chrome security on Jan 20. It was marked a medium-severity bug. A fix landed on March 23. The researcher whose report led to the fix was awarded $2,000 under Chrome's Vulnerability Reward Program. That fix will be released in Chrome 58, which has a stable release around the end of April.

This fix is an attempt to balance the needs of our international userbase while protecting against confusable homograph attacks.. The fix uses punycode for domain names that are made entirely of Latin lookalike Cyrillic letters when the top-level domain is not an internationalized domain name, meaning that the check only applies to top-level domains like "com", "net", and "uk". We’re working on additional fixes, for example, for confusables within one script set -- “l” (lowercase L) could be confused with “I” (small dotless i character). We will keep this article updated with our current IDN policy below.

Google Chrome decides if it should show Unicode or punycode for each domain label (component) of a hostname separately. To decide if a component should be shown in Unicode, Google Chrome uses the following algorithm:

If the component contains either U+0338 or U+2027, punycode is displayed.

If the component uses characters drawn from multiple scripts, it is subject to a script mixing check based on "Moderately Restrictive" profile of UTS 39 with an additional restriction on Latin. Failing the check, the component is shown in punycode.

Latin, Cyrillic or Greek characters cannot be mixed with each other

Latin characters in the ASCII range can be mixed with characters in another script as long as it's not Greek nor Cyrillic.

Han (CJK Ideographs) can be mixed with Bopomofo

Han can be mixed with Hiragana and Katakana

Han can be mixed with Korean Hangul

If two or more numbering systems (e.g. European digits + Bengali digits) are mixed, punycode is shown.

If there are any invisible characters (e.g. a sequence of the same combining mark or a sequence of Kana combining marks), punycode is shown.

Consequences / Examples

[The old content here was completely inaccurate and has been removed. TODO: add examples of the above]

Behavior of other browsers

IE

IE displays URLs in IDN form if every component contains only characters of one of the languages configured in "Languages" on the "General" tab of "Internet Options", similar to what Google Chrome does.

Opera

Safari

Safari has a whitelist of scripts that do not contain confusable characters, and only shows the IDN form for whitelisted scripts. The whitelist does not include Cyrillic and Greek (they are confusable with Latin characters), so Safari will always show punycode for Russian and Greek URLs.