Upgrading our stack for web performance: early flushing, http/2, and more

What’s fast, jank-free, and has what you need for your home? Wayfair’s mobile web site, that’s what!
Behind this spiffy Android experience, and behind all of the ways into Wayfair that we make available to our customers, are some fresh technical upgrades and an evolution of our programming and product culture, with increased attention to web performance.

For some time we have used RUM, synthetic and server-side execution metrics to stay on top of how fast or slow we are. We recently set out to adopt a group of the best practices that have emerged over the last few years, hoping to observe big changes in our numbers. We were a little surprised by the trickiness of configuration and verification. These techniques have been possible for a while now, but many proxies, monitoring and measurement tools, web servers, and supporting libraries either don’t support them at all, support them but in versions that have only very recently been packaged by popular Linux distributions, or must be carefully massaged into non-default states for them to work. Our guides were the blog posts, books, and Velocity talks of web performance industry leaders Ilya Grigorik, Patrick Meenan, and Steve Souders. We figure a practically minded how-to from a big site like ours might be useful to people. We’ll focus on how to make sure these things are working, rather than on the mechanics of configuration beyond the basic idea, because the details of different web and CDN platforms will differ widely from site to site.

Our goal for the web performance program is simple: a better user experience through faster perceived performance. As proxies for that somewhat vague goal, we focused on measurements of the experience-ready type, using events in the performance-timing RUM data, and various metrics that are available in synthetic tools like WebPageTest and Phantomas. We have found Google’s RAIL model to be an excellent frame of reference. We have tinkered with the metrics we use, and TTFB (time to first byte), first paint, above-the-fold render time, DOM-ready, ‘interaction-ready,’ document-complete, and speed index have figured in our thinking at various points. We have a lot of tactics and techniques on our roadmap, but we decided to focus at first on early flushing and, to a lesser extent, http/2 or ‘h2’ as it appears in some tools. Our analysis told us they would have a big impact. We also felt that if we didn’t do them first, or at least early, we would be messing around with smaller optimizations that would probably not work exactly the same way without those two things in place, so we would just have to redo them.

Before we get into the details, here’s a quick thought on internal justifications for the effort involved: we don’t really buy the web performance industry’s simple sales math for e-commerce sites. The standard argument goes something like this: visitors who experience faster page views convert at higher rates. Move the slower crowd to faster, speed everyone up and hammer down the long tail of slow experiences, and they will convert at the same rate as the fast people. I think a lot of speed tools have been sold based on that argument. If only! We’re not buyers. What if the people who are converting at a high rate for some combination of reasons just tend to have fast computers? Unfortunately the ROI on speed is not just a simple calculus problem, where you raise up the juiciest part of the speed curve, and sit back and count the money. This is a deep question, and one worth pondering before you put too much of your life energy into a performance effort, especially if divorced from other product concerns. But that doesn’t mean there isn’t real money at stake, and it’s certainly not worth dwelling on when you’re working hard, as we have been doing, to give your customers a fast and excellent experience.

Let’s break down the techniques of early flushing and h2.

If you’re unfamiliar with early flushing, it’s an easy idea to understand: as early as reasonably possible in the process of building up a stream of html, send or flush the first part, including instructions for downloading critical assets like CSS and Javascript. There are some fine choices to make. Can you inline the most critical CSS, JS or even images? Can you get something meaningful out in 14K or less, so it will fit in the first TCP congestion window? Also, there are typically some obstacles. If, after flushing the first batch of bytes, including HTTP code 200/OK, you encounter an error that previously would have triggered a code other than 200, what do you do? Is *anything* in your stack buffering the stream? In our case this meant PHP (easy enough to change the code and the configuration file), Nginx (proxy_buffering directive), two inline appliances we had doing various things in our data center, and our CDNs. All things being equal, and as a general rule, buffering is your friend, and we probably could have gotten all this to work by reducing buffer sizes. However, in the spirit of keeping things simple and cutting to the chase, we just removed the appliances and made other plans for what they were doing, turned buffering off in all the other places, observed only a very slight increase in server load, and went with that.

H2 is a big topic. In short it’s a new protocol, based on Google’s SPDY, that aims to reduce network overhead, round trips, and payloads. It’s available, for practical purposes, for TLS/HTTPS only, and it allows you to re-use a single connection for multiple downloads. It’s supported, we think fairly typically for North American and European sites, by more than half of the browsers we see. You need graceful degradation to http 1.1 for the other 40+%, but there are good ways to do that. To see it in action in Google Chrome, add this extension:

Click on the thunderbolt, and you’ll see the screen below. It shows all the h2 sessions your browser has going. I’ve opened up Google, Facebook and Twitter, which are three sites that populate this tab for many people. Interspersed among those, you can see the domains wayfair.com, wayfair.co.uk, and the separate domains we use for our CDNs to serve static assets.

The two domains for assets are an http/1.1 graceful degradation hack, where you resolve two different domains to the same IP address. H2 is smart enough to make one connection to the IP, but http/1.1 treats it as two, so the browser’s connection limit is doubled. In this way, you get the benefits of domain sharding for http/1.1 clients, and the benefits of connection re-use for h2. Domain sharding is the best of the optimizations from the older set of best practices that we’re replacing, so even as we abandon some of the other techniques, we want to continue to take advantage of this one for the legacy-client population, which is still quite large.

How do you know early flush and h2 are working? Open up Chrome developer tools and look at the timeline view. Let’s look at early flush first.

Notice how the purple CSS and yellow Javascript downloads start happening half way through the blue html. Also, see how we get a first ‘paint’ of just the top nav, before the full page shows up two frames later. If you’re on a slow connection and you type in the home page, but really want to go somewhere else on the site, you can actually use that to navigate. So it’s useful to some users, and kicking off all the downloads early speeds everything up across the board. Here is a webpagetest video of early flushing vs control:

Notice that you see something of interest sooner, and the overall time is 2.6 seconds vs 3.5. These are under synthetic test conditions: the discrepancy is more dramatic than in real life, on average, and the raw numbers are also different. But it illustrates perfectly the idea of what you’re looking for, and what people are actually experiencing on the site now.

For h2, back in Chrome developer tools, go to the network tab, right-click on the column-headings and enable ‘Protocol’. A new column will appear between ‘Status’ and ‘Domain’ You’re looking for the value ‘h2’ instead of ‘http/1.1’. Then fire up webpagetest.org, run a test, and look at the details tab. The waterfall view shows a large number of requests to one of our CDN urls.

But then the connection view shows that many requests are multiplexed over the connections to the static asset urls.

Web page test is telling this story pretty well, but we’re still seeing more connections than we should, according to the theory. So was the theory not implemented correctly by Akamai or Fastly and Google Chrome? Or is Web page test misleading us? Let’s fire up Wireshark and find out. And… everything’s encrypted, so we can’t see it. Let’s go see Chris the protocol wizard (wearing his wizard cap today!), who’s savvier with Wireshark than me.

A quick, friendly man-in-the-middle attack on our own TLS infrastructure later, and Wireshark shows this (click to enlarge, and note that HTTP2 is magic!):

And then we see one TCP stream to the ip address behind our assets urls.

Victory! WPT is showing more ‘connections’ than there actually are TCP streams, but there can be only one stream to that address if it’s working right, and that’s what we see.

Obviously, we’re missing an opportunity here to serve everything off one domain, in order to reduce the number of connection handshakes even further and get the most out of server push. That domain would have to be www.wayfair.co.uk for our UK site, from the example. We may be more aggressive about that in future, but that would require more refactoring, and we’re happy for now, with the browsers being able to grab all their assets from one place, without domain sharding or other tricks.

Bottom line, we got a 15-20% decrease in the metrics we care about, from early flush, 5-10% from h2. The examples above are for desktop, because it’s easier to see what’s really going on than with mobile. But these techniques work for mobile too, and of course the reduction in network round trips is much more of a benefit for phones than it is for desktops on fast networks. Early flush works the same way. Like a lot of e-commerce sites, we use adaptive delivery more than responsive design, although we’re doing more of the latter than we used to. We are delivering a smaller payload to phones. Whatever number of bytes you can flush, if it’s constant across platforms, will be a bigger percentage of the total, so more of a meaningful part of the page, on mobile than on desktop.

A couple of configuration notes. Your experience of CDNs may differ from ours. Just as with nginx, turning proxy buffering off is probably not the default configuration. We terminate TLS at the CDN, so their support for http/2 was important to us. If you’re doing this yourself, a pretty advanced Ubuntu, as of this writing, is the only mainstream server operating system that has a default openssl package advanced enough for http/2 to work. We set that up during testing: Ubuntu + openssl + nginx + with proxy_buffering off and h2 on, and everything worked. If you are using a CDN, even if you are encrypting your origin endpoints, you don’t actually have to enable h2 at the origin. As long as the edge node does the h2 thing properly, the long-haul traffic from edge to origin can use http/1.1 without a huge opportunity cost. This may be of help to you, if you have existing proxies, test scripts, etc., that benefit from an easy-to-use text-based protocol (http/1.1) in some way, or that do not yet support h2.

For metrics and monitoring, there are a lot of available tools and platforms. There are great cloud resources you can use, but that’s not always very practical to run internally, and the emulations of real devices that they provide don’t have the right chips to give you a realistic idea of what’s happening. We use them to some extent, and if we didn’t have a more accurate option we would be using them more. To get a more accurate read, we set up a lab in the office, which has a very “Pat Meenan’s basement from 10 years ago” vibe to it. Here’s Sam, from our Performance Engineering team, among the Intel NUCs, phones and tablets that we set up:

This way, we get chips like in the real world, and we use traffic shaping software to control for the fact that we’re on a fast network in the office.

A word on opportunistic roadmapping (not to be confused with its evil twin, scope creep). When we began this work, our main focus was early flushing, and our metrics improvements have now demonstrated that this was indeed the most important of the things we have tried so far. However, half way through the project, by a pure happy coincidence, the SEO community started to notice that Google had finally cleaned up its 301 redirect penalty problem, which had been foiling its own attempts to get everyone to go to all TLS all the time. We had wanted to do that for our biggest sites for a while, but the 15% in lost natural search traffic was holding us back. It was incredible good fortune that this barrier was removed at the time it was, while we were in the middle of a concerted effort to improve the stack, with all the associated rounds of thinking, coding, probing and testing, all lined up. When we saw that, we decided to really push on the h2 roadmap items, to get them done before the holidays this year. The reward, if the technique worked out, was going to apply not just to our smaller sites that were already all TLS, but to our largest site, wayfair.com. We got all that done, and we’re very happy with the results.

With these pieces in place, we are psyched about our roadmap going forward. We can start really coding for http/2, which will be great for desktop browsers, single-page apps for mobile web, and everything in between. Other things are just out, imminent or planned for pretty soon: resource hints, pre-rendering, dns-prefetch, HSTS and OCSP stapling. We’ll share more later on all that, if there’s something interesting to share. In retrospect, I would describe the web performance optimizations we had been doing before we made all these changes, as strenuous efforts with one arm tied behind our backs. On this modern stack, we’re looking forward to being able to use all the latest clever tricks to give our customers a really awesome experience. In the mean time, internally, we have put in place well socialized KPIs, training for our developers on all these techniques, and better tooling than we had before, with the goal of building sound principles of web performance into our product and engineering culture, and into everything we build. To give one example (and we’re not the first to do this) we made a bookmarklet based on Pat Meenan’s Javascript function for calculating Speed Index, which he opensourced here, that’s on all our developers’ Chrome toolbars now.