Domain-Sharding and SSL

Some time ago I noticed some weird behaviour on our website http://www.alice-dsl.de so I thought it would be a good idea to revisit this issue. Part of the above mentioned website is a selfcare area, where our customers are able to view their bill, to change the product options etc. etc. This area requires a login and so the login itself, as well as the subsequent pages are secured by SSL.

SSL and Performance Optimization was for quite some time a very difficult beast to debug, but fortunately with Pat Meenans WebPageTest this was solved at least for IE. In the old days you could use HTTPWatch for example, to see the HTTP interaction within the SSL tunnel, but you were unable to view the TCP Connection flows. With Microsofts Visual Roundtrip Analyzer you were able to see the TCP Connection flows, but you were unable to see the HTTP Conversation within the SSL Tunnel. Pat’s tool was the first (maybe still is the only one) that allowed to see both, the TCP Connection flow as well as what happens within the SSL Tunnel.

So, back to our page. In one of our former optimization iterations we decided to do some domain sharding. With this we wanted to improve performance especially for IE6 and IE7 users, as these Browsers only open up 2 TCP Connections per domain. Resulting in a performance limit, that is often lower than what your bandwidth is able to provide.

We implemented that in a way, that we could follow another performance rule, which is “Make static content cookie-free”. So we set up another Domain, static.alice.de, which we used for the above mentioned domain sharding, as well as keeping most of the content cookie-free.

There are some reasonable concerns regarding domain sharding in conjunction with SSL, as you not only have an additional DNS lookup and TCP handshake, but also an additional SSL handshake, which can be quite time-consuming. But we were pretty confident, as our customers are solely based in small Germany with a RTT of ~50 ms, that the benefit of 4 connections would outweigh the impact of that additional DNS lookup and TCP/SSL Handshake.

So, when I did a run with Webpagetest.org, the result in IE7 looked like this:

Hmm… that was weird… As you can see, you have some 2 connection waterfall behaviour, but for the first 7 objects, it seems like the TCP / SSL handshake is closed after each object, and re-established for the next one. Now THAT is much more painful, than what we would have expected…

So I was almost picking up the phone to call the Webserver Admin, that probably the Connection: Keep-alive Headers were not set correctly in Apache, but I decided first to check the HTTP Headers. And they looked like this:

As you can see, BOTH ends have a “Connection: Keep-Alive Header” using HTTP1.1. So why in the hell was the connection closed???

So I fired up Wireshark to look at the raw packet flow. Wireshark normally does not allow me to see the content within the SSL Tunnel, but in this case it didn’t matter, as I only wanted to see, WHO is closing the connection.

So it was ME (Or my machine) which was closing the connection, not the Server. As you can see in the trace, I am the one sending the TCP FIN.

So far, so bad.

Thinking about a solution / workaround I was wondering: Only the CSS files are affected. The Headers are correct and the same as the other assets, which are downloaded later on from the same domain. And with these assets I see a perfect persistent connection… Maybe it simply is due to being hosted on another domain?

Setting up another test page I dropped the idea of domain sharding, and used just one domain. The domain, where the base page resides. And the result was:

And the issue is gone… Meanwhile the Time-to-Render has improved from 2.2 to 1.6.

As it seems, the buggy TCP Connection behaviour seems to show up, when you are delivering CSS files VIA SSL from a different domain, than the base page.

I did a further test (waterfall coming soon), where I have shifted only the CSS files back to the domain, where the basepage resides. Result is

As you can see, the connection was kept-alive during the CSS downloads. BUT, later on, when the CSS-Images were downloaded from the sharded domain, suddenly these CSS-Images showed the Connection: Close behaviour!

Now, which IE Versions are affected? IE 7, as you can see above. IE 6 as well, IE 8 and IE 9 are not also affected. (Though I have to check with IE 8 and IE 9, if the issue would only be visible, if I add a seventh CSS file in the HEAD. But then you have another issue anyway). With IE 8 and IE 9, though, the issue is a lot less impacting, as you will be hit ONLY if you have more than 6 Stylesheets referenced in HEAD. Something which you should avoid anyway.

Firefox is not affected.

Finally: Who should care (If my observations/assumptions are true. Feel free to comment)?

– If it is not SSL, you have no problem.

– If you have inlined CSS, you have no problem.

– If your CSS files and CSS Images are on the same domain as the base page, you have no problem.

– If you have just a single CSS file via SSL in the HEAD, it is cacheable and your customers have a rather low latency, you shouldn’t worry too much.

– BUT, if you have multiple CSS files, or different CSS files on each page, which are on a different domain, then it might be an issue. More so, if they are not cacheable, and even more so, when your customers DO have high latency. Remember, NO RENDERING, until all CSS files from the HEAD are received! Blank Page!

Finally a big thank you to Thomas G., Daniel G. and Michael S. who supported me in this analysis!

-Markus

UPDATE:

I got in contact with Microsofts IE Team via their blog and indeed, they confirmed the above mentioned behaviour. So this is not a unique fail case with our domain, but actually a bug in IE, present in versions 6 to 9. In the same response they said they are looking to fix this in an upcoming IE version.

Like this:

Maybe the solution is to have only a single CSS file per site. Additional styles per page could be inlined. The assembly process (a CSS file per site and inlined style sheets) could take place in a content management system.

Yes, that would probably a good path to follow. Actually a couple of month ago, after our last performance sprint, we were “manually” down to 2 CSS files. So the issue moved below my radar again.
But now, after a couple of new releases, we are again back to 6. So indeed, we need to find some way of automatic assembly, because otherwise all the efforts are useless (meaning performance improvements are gone) after a couple of new releases.