Earlier this week we began experimenting with using Amazon CloudFront as
a CDN for serving static assets. We've also rolled out some general asset
delivery optimizations. Depending on how far away you are from our main
Washington D.C. datacenters, you should see a nice decrease in overall page
load times.

The rest of this post goes into detail on how we implemented this stuff and also
a bit on how we're measuring performance around the world.

Some background: over the years we've spent a lot of time optimizing asset
delivery. This includes things like js/css asset bundling and using
multiple asset hosts. Recently, we started in on another round of optimizations
driven by a general goal of decreasing page load times outside of the US, and
also by changes in the page load performance profile due to the move to SSL for
all asset delivery.

Measuring Page Load Performance

We're using BrowserMob to monitor full page load performance on a few key
pages. BrowserMob is interesting for this kind of profiling for a couple of
reasons. First, it measures full page load time in a real browser (including all
assets and Ajax requests), as opposed to monitoring the response time of an
individual request. A report like the following is available for each run:

The green portion of each bar represents connect time (more than half of which
is usually attributed to "SSL Handshaking"). Purple is waiting for the response
to begin. Grey is the actual receiving of the response data.

If this looks familiar, it's probably because these reports are very similar to
the Network/Resource graphing tools built into most modern browsers's
development tools. What's great about BrowserMob, though, is that these run at
regular intervals and from multiple locations around the world. The results are
then graphed on a nice timeline.

Here are the results for the past week's worth of changes for a public
repository page:

Each point on the graph is the overall page load time for a run at a specific
location. The big red circle areas are timeouts or other errors.

Using a Single Asset Host

The first thing we wanted to test was moving to a single asset host. i.e.,
assets.github.com instead of assets0.github.com, assets1.github.com, ...

Since github.com went 100% SSL, we've found that the cost of performing SSL
handshakes against multiple asset hosts slightly outweighed the benefits
provided by the browser's ability to do more request parallelization. This gets
worse as you move further away and incur more latency. Most modern browsers
support between four and eight simultaneous connections now, too, so
distributing requests between asset hosts has less of a payoff in general.

The BrowserMob report shows that this didn't have a massive impact on average
good page load times but it seemed to stablize things quite a bit. With multiple
assets hosts, timeouts and drastically different load times were frequent. This
leveled out after moving to a single asset host.

Problems with Multiple Points of Origin

This is something we surfaced during our research that hasn't been addressed
yet. It's worth mentioning for anyone hosting assets from multiple servers in a
load balancing setup.

At GitHub, we currently have six frontend servers. They run nginx and also the
GitHub application code, background jobs, etc. Each asset request is routed
round robin to one of these hosts. This results in assets having multiple points
of origin, which, depending on your server and deployment configuration, can
lead to a couple of subtle performance issues:

The last modified times on files may vary between machines based on when
the assets were deployed. (This is especially true if you use git for deployment
as timestamps are set to the time of last checkout.) When the same asset has
different timestamps on different origin machines, conditional HTTP GET requests
using If-Modified-Since can lead to full 200 OK responses instead of nice,
contentless 304 Not Modified responses.

Using long-lived expiration headers avoids many of these requests altogether
but browsers love to validate content in a number of circumstances, including
manual refresh, hitting <ENTER> in your URL bar, and also randomly on
Tuesdays.

SSL handshake needs to be performed on each of the origin servers. A browser
opening six connections may land on six different hosts and need to perform
six different handshakes. Stated succinctly:

We're not doing this yet, on account of it requiring a fairly major redesign of
our frontend architecture. Luckily, most CDNs like CloudFront have tuned their
SSL negotiations fairly well and we're able to take advantage of that for asset
requests.

Implementing CloudFront

We used CloudFront's support for Custom Origins. This means we don't have
to deal with shipping assets to an S3 bucket on deploy. Instead, you point the
CloudFront distribution to your existing asset host (assets.github.com in our
case) and then change the asset URLs referenced in page responses to the
CloudFront distribution. If an asset isn't available at a CloudFront server, it
will be fetched and stored for subsequent requests.

This was fairly easy to get working using Rails's built in support for
configurable asset hosts. One issue we did run into is that CloudFront
ignores the query string portion of the URL, which is used to force the browser
to reload cached assets when changed. We got around this by moving the asset
id into the path portion of the URL. So instead of assets being referenced as
/stylesheets/bundle_common.css?85e47ae, they are now referenced as
/85e47ae/stylesheets/bundle_common.css. A simple Nginx rewrite handles
locating the file on disk when these URLs are requested.

One other thing worth mentioning is that, while CloudFront supports SSL, you
won't be able to use a custom domain name. All of our assets are currently
referenced with these ugly https://d3nwyuy0nl342s.cloudfront.net/... URLs.

Oh well.

CloudFront Performance

NOTE:BrowserMob runs on Amazon EC2 but all CloudFront access is performed
over external interfaces. Still, the relationship between the EC2 and CloudFront
networks should be taken into account when interpreting these results.

Our goal in moving assets to a CDN is mostly to decrease load times around the
world by serving them from hosts that are geographically nearer to the client
making the request. We expected to see decent gains on the US West Coast and in
Europe, and large gains as you moved closer to the other side of the world. So
what actually happened?

We were disappointed in the BrowserMob results for the US West Coast and Europe.
They stayed relatively flat or got worse. This may be something strange with
BrowserMob's server locations, however. Running similar comparisons from a
browser on my desktop in San Francisco shows good gains. We'll be in touch with
BrowserMob to determine if there might be problems with their DNS not resolving
hosts to near servers properly.

The results for Singapore were more in line with what we were hoping for. Here's
a single run from before we turned CloudFront on:

There's one other bit of weirdness we're noticing with CloudFront that we left
off of the timeline graph above. For some reason, page load time from Dallas
became extremely erratic when assets were switched over to CloudFront:

We don't want to draw any conclusions from such a small sample but this would
seem to indicate that individual CloudFront nodes are susceptible to some kind
of temporary overload or other source of instability.

That's where we're at. This is still very much an experiment and we'd like to
compare performance of other CDNs and collect a little more data before making a
final decision.