IPv6 dual-stack client loss in Norway

With the kind assistance of two of my customers,
A-Pressen Digitale Medier and
VG Multimedia, I've been able to measure end
user behaviour towards dual-stacked web sites. APDM and VG are both
interested in making their content available over IPv6, but wanted first to
make sure that this did not cause any unwanted consequences. Both APDM and
VG are Norwegian-language online news publishers - their users are completely
ordinary and non-technical.

The primary purpose of the measurement is therefore to determine to what
extent we lose expected HTTP accesses, in the case when the end user's
web browser is given the choice to access the content via either IPv4 or
IPv6 (from a dual-stack host-name, that is), compared to the situation
where the web browser has only one choice - IPv4. The assumption is
that with all else equal, a larger loss of accesses to the dual-stack
host-name indicates that this is caused by end-users/clients having some
kind of difficulty accessing content available over both IPv4 and IPv6.
I use the term client loss when referring to this unexpected
additional loss of accesses from the dual-stack hostname compared to
that of the IPv4-only hostname. (I also describe the full setup
and calculations used further down.)

The secondary purpose of the measurement is to determine why I observe
client loss - what are the underlying causes? The answer to that appears to be
old versions of the Opera web browser and
Mac OS X. This is because they prefer
transitional IPv6 connectivity (6to4, Teredo) above more reliable IPv4 in
certain cases, and this makes them less likely to succeed in contacting a
dual-stack web server than a single-stack IPv4 one.

UPDATE 2010-12-21: Today, we deployed IPv6 on both A-pressen and
VG's sites. This means that the brokenness percentage starting from now on
will likely be much understated, as broken users will be predisposed to not
reach the measurement rig in the first place. The brokenness percentage in
the period 20101214-20101220 was 0.024% (I've saved a snapshot of how this page looked on that day.)

UPDATE 2011-05-09: Last week (2nd-8th of May) we turned off IPv6
in order got get a new (and final) brokenness measurement. The result was
a brokeness percentage of 0.015%. This result was presented at the
IPv6-Kongress; the slide deck is available
here.

Current status

The first graph shows the current overall client loss, while the second
one shows a breakdown of the IPv6 traffic I see to the dualstack host.

The Mac OS X problem

Mac OS X has a problem in that it, in versions older than 10.6.5, will
prefer 6to4-based IPv6 over IPv4. That is very unfortunate, as 6to4 is
much less reliable than IPv4. Most of the 6to4 traffic I see from OS X
hosts are using EUI-64 derived IPv6 addresess, indicating that some other
device in the end user's network is performing the 6to4 tunneling. I've
made an ASCII art illustration of such
a network. The following numbers and graphs are intended to show how the
situation will look like, when all Mac OS X users have upgraded to version
10.6.5 or above. Unfortunately, the patch is only installable of users
that are already running 10.6 "Snow Leopard", for the users of 10.5 "Leopard"
and 10.4 "Tiger", no patch is available.

The graphs are generated by simply removing all log lines that contains "Mac
OS X" in the User-Agent field prior to running the calculations.

With Mac OS X out of the picture, the amount of 6to4 traffic drops
significantly (and with it the amount of IPv6 traffic in total).

To further emphasise Mac OS X' IPv6 problems, I've made the following
graphs and numbers that shows the client loss amongst OS X-based clients
only. Client loss is much, much higher than on the internet in general,
and so is use of 6to4.

The following graph shows the distribution of Mac OS X versions I see in my logs. Regarding the different versions shown:

10.7 (Lion): Includes all known brokenness fixes as well as a implementation of «Happy Eyeballs» (details here).

10.6.8+ (Snow Leopard): Includes fixes for two possible causes for brokenness (details here and here).

10.6.5+ (Snow Leopard): First version to prefer the use of IPv4 over 6to4 (details here).

10.6.0+ (Snow Leopard): Has none of the fixes mentioned above, but may be upgraded to 10.6.8+ for free, or to 10.7 Lion for a price.

10.6.? (Snow Leopard): Hits from hosts running 10.6 that do not disclose the Mac OS X patch level in the User-Agent string.

10.4/10.5 (Tiger/Leopard): Has no free upgrade to improved versions available, but may purchase a upgrade to 10.6 or 10.7 if the hardware is x86-based.

In a perfect world...

These numbers and graphs show my idea of an ideal situation, where all
Opera users have upgraded to 10.50 or later, and all Mac OS X users have
upgraded to 10.6.5 or later. The client loss number is very close to 0% at this
point - I believe that in this situation, dualstack client loss would
no longer be a concern for most content providers.

dualstack.cust.no. 5 IN A 87.238.40.2
dualstack.cust.no. 5 IN AAAA 2a02:c0:1010:2::2
ipv4-only.cust.no. 5 IN A 87.238.40.3
dualstack-exp.cust.no. 5 IN A 87.238.40.4

Everything is hosted on the same web server. The MTU is set to 1280 and
TCP MSS to 1220 for IPv6 and 1240 for IPv4. The PNG is small enough that
the entire HTTP response fits comfortably inside a single packet anyway - the
only thing that I've seen require fragmentation are HTTP requests with
very long headers.

When determining the client loss, I simply parse the HTTP access logs on the
test server, and count the number of hits to the linkgen.php (N) script
and the 1x1 PNG via the ipv4-only hostname (Ns) and the dualstack
hostname (Nd). Hits that re-use an already seen ID string (on the same
hostname) are discarded. With all else equal, the assumption is that we should
see an identical number of 1x1 PNG hits on the dualstack and the ipv4-only
hostnames. If there is a difference, it is considered client loss. So for
instance, if we have:

The 10 second timeout

Since the IFRAME and PNGs are loaded in the background, it is likely that
some of the apparantly successful hits to the 1x1 PNG only occur after an
initial attempt via IPv6 timed out - the user will generally not notice
this. However, if such a timeout were to happen on the main site itself,
it would cause an unacceptable service degradation. The 10 second timeout
variant is an attempt to compensate for this effect. What I do is simply
to discard all 1x1 PNG requests that occur more than 10 seconds after the
linkgen.php script that generated the IMG links have run (including ones
to the ipv4-only hostname), and the remaining log file is processed in the
same way as the no-timeout variant. The graphs included on this page all
apply the 10s timeout, however you can find graphs without the timeout
being applied here.

The assumption is that the 10 seconds will not be sufficient for any
application to fall back from a failed initial IPv6 attempt to IPv4. The
quickest systemic fallback time we've been able to identify is about 21
seconds (non-Opera browsers on Windows). I've made
a graph that compare the 10s
timeout to 5 and 20s ones - it shows very little difference between the
three timeouts, which I think means that the assumption holds.

Acknowledgements

I'd like to thank APDM and VG for allowing me to perform experiments on
their readers, my own employer Redpill Linpro for encouraging me to use
time on this, and Steinar H. Gunderson from Google for helping out
tremendously all along.

Also I'd like to thank Opera Software for working with me and fixing the
problem in their browser, Apple for fixing Mac OS X, and Fedora,
Canonical, Gentoo, Novell, Mandriva, and Debian for applying my patches
to glibc in their respective Linux distributions.