Subscribe

The problem was so bizarre that for a moment I suspected I was witnessing
a man-in-the-middle attack using valid certificates. Many popular HTTPS websites, including
Google and DuckDuckGo, but not all HTTPS websites, were taking up to 20 seconds to load.
The delay occurred in all browsers, and according to Chromium's developer tools, it was
occurring in the SSL (aka TLS) handshake. I was perplexed
to see Google taking several seconds to complete the TLS handshake.
Google employs TLS experts
to squeeze every last drop of performance out of TLS,
and uses the highly efficient elliptic curve Diffie-Hellman key exchange.
It was comical to compare that to my own HTTPS server, which was
handshaking in a fraction of a second, despite using stock OpenSSL and the
more expensive discrete log Diffie-Hellman key exchange.

Not yet willing to conclude that it was a targeted man-in-the-middle
attack that was affecting performance, I looked for alternative
explanations. Instinctively, I thought this had the whiff of a DNS
problem. After a slow handshake,
there was always a brief period during which all handshakes were fast, even if I restarted
the browser. This suggested to me that once a DNS record was cached, everything was
fast until the cache entry expired. Since I run my own recursive DNS server locally,
this hypothesis was easy to test by flushing my DNS cache.
I found that flushing the DNS cache would consistently cause the next TLS handshake to be slow.

This didn't make much sense: using tools like host and dig, I could find no DNS problems
with the affected domains, and besides, Chromium said the delay was in the TLS handshake.
It finally dawned on me that the delay could be in the OCSP check. OCSP, or Online Certificate
Status Protocol, is a mechanism for TLS clients to check if a certificate has been revoked. During
the handshake, the client makes a request to the OCSP URI specified in the certificate to check
its status. Since the URI would typically contain a hostname, a DNS problem could manifest
here.

I checked the certificates of the affected sites, and all of them specified OCSP URIs
that ultimately resolved to ocsp.verisign.net. Upon investigation, I found
that of the seven name servers listed for ocsp.verisign.net (ns100.nstld.net
through ns106.nstld.net), only two of them (ns100.nstld.net and ns102.nstld.net) were returning a
response to AAAA queries. The other five servers returned no response at all, not even a response to say
that an AAAA record does not exist. This was very bad, since it meant any attempt to resolve
an AAAA record for this host required the client to try again and wait until it timed out,
leading to unsavory delays.

If you're curious what an AAAA record is and why this matters, an AAAA record is the type of DNS
record that maps a hostname to its IPv6 address. It's the IPv6 equivalent to the A record,
which maps a hostname to its IPv4 address. While the Internet is transitioning from IPv4 to IPv6,
hosts are expected to be dual-homed, meaning they have both an IPv4 and an IPv6 address. When one system
talks to another, it prefers IPv6, and falls back to IPv4
only if the peer doesn't support IPv6. To figure this out, the system first attempts an AAAA lookup, and if
no AAAA record exists, it tries an A record lookup. So, when a name server does not respond to AAAA queries,
not even with a response to say no AAAA record exists, the client has to wait
until it times out before trying the A record lookup, causing the delays I was experiencing here.
Cisco has a great article
that goes into more depth about broken name servers and AAAA records.

(Note: the exact mechanics vary between operating systems. The Linux resolver tries
AAAA lookups even if the system doesn't have IPv6 connectivity, meaning that even IPv4-only users experience these
delays. Other operating systems might only attempt AAAA lookups if the system has IPv6 connectivity, which
would mitigate the scope of this issue.)

A History of Brokenness

The unofficial response from Verisign was that the queries are being handled by a GSLB, which apparently means that we should not expect it to behave correctly.

"GSLB" means "Global Server Load Balancing" and I interpret that statement to mean Verisign
is using an expensive DNS appliance to answer queries instead of software running on a conventional
server. The snarky comment about such appliances rings true for me. Last year,
I noticed that my alma matter's website
was taking 30 seconds to load. I tracked the problem down to the exact same issue: the DNS servers for
brown.edu were not returning any response to AAAA queries. In the process
of reporting this to Brown's IT department, I learned that they were using
buggy
and overpriced-looking DNS appliances from F5 Networks, which, by default,
do not
properly respond to AAAA queries under circumstances that appear to be
common enough to cause real problems.
To fix the problem, the IT people had to manually configure every single
DNS record individually to properly reply to AAAA queries.

I find it totally unconscionable for a DNS appliance vendor to be shipping a product
with such broken behavior which causes serious delays for users and gives IPv6 a bad
reputation. It is similarly outrageous for Verisign to be operating broken DNS servers
that are in the critical path for an untold number of TLS handshakes. That gives
HTTPS a bad reputation, and lends fuel to the people who say that HTTPS is too slow.
It's truly unfortunate that even if you're Google and do everything right with IPv6,
DNS, and TLS, your handshake speeds are still at the mercy of incompetent certificate authorities
like Verisign.

Disabling OCSP

I worked around this issue by disabling OCSP
(in Firefox, set security.OCSP.enabled to 0 in about:config). While OCSP may
theoretically be good for security, since it enables browsers to reject
certificates that have been compromised and revoked, in practice it's a total
mess. Since OCSP servers are often unreliable or are blocked by restrictive
firewalls, browsers don't treat OCSP errors as fatal by default. Thus, an
active attacker who is using a revoked certificate to man-in-the-middle
HTTPS connections can simply block access to the OCSP server and the browser
will accept the revoked certificate. Frankly, OCSP is better at protecting
certificate authorities' business model than protecting users' security, since it
allows certificate authorities to revoke certificates for things like credit card
chargebacks. As if this wasn't bad enough already, OCSP introduces a minor privacy leak because it reports
every HTTPS site you visit to the certificate authority.
Google Chrome doesn't even use OCSP anymore because it is so dysfunctional.

Finally Resolved

While I was writing this blog post, Verisign fixed their DNS servers and now every
single one is returning a proper response to AAAA queries. I know for
sure their servers were broken for at least two days. I suspect
it was longer considering the slowness was happening for quite some time before
I finally investigated.