Redeploying Debian-Administration.org ...

For the past nine years this site has been hosted upon a single dedicated server, graciously donated by my employer Bytemark. Over time it has been upgraded, but despite that it has become apparent that a single-server wasn't sufficient, unless it was a huge server - so with that in mind I've recently redeployed this site in a mini-cluster.

On the face of it this site should be fast, it hosts content which is admirably suited for caching and it doesn't change too often. The entire contents of the MySQL database hosting the site can easily fit in RAM, and yet the performance has been degrading for many months.

I run this site as a hobby, but it was becoming frustrating having to
wait for the slowly loading pages, and the all too frequent "Server On Fire" error message.

When it came to scaling there were a few options, to simplify things I went with what I thought was the simplest solution, splitting the site into logical components each of which would be handled differently:

The database, which stores the articles, comments, and similar data.

The application server which drives the site, interfacing between the clients and the database.

The "Planet" aggregation server, which essentially builds a static HTML file every few minutes, and serves that static file.

The planet-site was the most basic to move. I created two virtual machines, added a floating IP address, and configured them. These machines have a simple cronjob to rebuild the planet, and it is then served by nginx. There is no caching in place because nginx is fast.

The database was similarly straight-forward. I installed the MySQL database on two new hosts, and configured them in master-master mode such that either of them could accept writes - which would then be replicated to the slave. In the normal course of events only a single one would be used - but when that dies the other is ready to be used.

FACT: All servers die, it is just a question of how often it will happen, and how painful the recovery will be.

Finally the application servers needed to be designed. Here I went overboard. This site was originally created to document the things I was struggling to remember, or to discuss and share knowledge of how I thought things should work. With that in mind I was absolutely happy to use this site itself as another experiment - downtime would be embarrassing, and annoying, but this site is one I know well and has a decent level of traffic which makes it a great playground.

(Despite the near year-long hiatus search-engine spiders are relentless, and contributed hugely to the site-slowdown.)

So rather than taking the cheap way out, using a hardware load-balancer, I came up with an interesting layout for the application servers:

Create four application-servers, each of which will run Apache to serve the content.

Have a single floating IP address which any one of four machines can claim.

On port 80 of the floating IP we have varnish listening. Varnish is the well-known caching reverse proxy server.

Varnish can talk directly to all four of the back-end machines, using its built in load-balancing facilities.

This scheme is a little more complex than just using a load-balancer, but it avoids a single point of failure in a way that a more traditional load-balancer wouldn't. If the traffic went like this then the loss of the load-balancer would cause the site to immediately become unavailable:

On this basis my initial temptation to use three machines for Apache and one for Load-Balancing purposes was ruled out early on.

In the current scheme each of the four hosts can claim the floating IP, and that
means that each of them can become the load-balancer. Regardless of which host is the "master" the load-balancer will be able to proxy content between visitors and whichever of the back-end servers are online.

A simplified diagram of the new setup looks like this:

Of course life is hard so there were some other challenges. In the past the content of the site was dynamic, but featured layers of caching. Because the caching was local-only that had to be removed to avoid getting into a situation where different hosts were serving different content:

web4 sees a new article - it invalidates its local cache - this means it is now current and up-to-date.

There were a couple of different solutions here. I could have promoted the caching to an external, shared, memcached instance. That would allow all the cache hits, purges, and reads to come from a separate source - but given the changes I had to make I decided I would remove the local caching entirely. If four servers can't keep up I'm doing something wrong!

So, what stops the new site from melting? Two things:

We have four instances of Apache running, so each one will receive 1/4 of the prior peak-load.

We cache at the proxy layer.

The varnish installation not only works as a load-balancer it also caches as much content as it can.

The site code-base has been reworked to avoid serving cookies for anonymous visitors - this means that the 90% of our content which is viewed by search-engines, and anonymous viewers, can come from the cache.

The final advantage of running Varnish on the shared IP, instead of running four copies behind a load-balancer is that there is only ever one instance of Varnish running - and it runs on the well-known shared IP address - so when we need to send a flush command we only do it once to a known address. The cache invalidation doesn't need to care about how many back-end hosts there are.

There were some minor loose ends involved in the migration; in a similar way to the (old) cache invalidation we assumed that when a new article was published we could regenerate the RSS feeds. In the new deployment that caused issues, as the hosts could each have out of sync RSS feeds.

My solution to this problem was to add a cronjob, it generates RSS feeds every five minutes, and if the new feed differs from the old feed the cache is flushed. This means that on the publication of a new article the cache is potentially flushed four times - but that's a small price to pay.

(There are also rules in place to always cache the RSS feeds, stripping cookies, etc. These rules are pretty site-specific and will almost certainly evolve.)

The dashboard receives events when different things happen across all the machines in the cluster:

A host is rebooted.

RSS feeds are updated.

The cache is flushed.

A new article is published.

A user logs in / logs out / creates a new account.

A poll vote occurs.

A new comment is submitted.

The dashboard allows me a near-realtime update on the status of the cluster.

In conclusion this site was previously hosted upon a single machine with 8GB RAM and two 500GB drives, it struggled, but now it should no longer do so. The cluster is comprised nine new hosts:

2 x DB servers.

2GB RAM & 50GB disk

2 x Planet servers.

2GB RAM & 20GB disk

4 x Application / Cache hosts.

4GB RAM & 20GB disk.

1 x Misc host / status panel.

4GB RAM & 50GB disk.

The new deployment should scale pretty much indefinitely now. If the site is slow then I'll tune the databases. If there is too much load I'll add more application servers. If the planet gets popular I'll add varnish there too.

Finally because I've been annoyed too much by the narrow layout, on a site where the content should be king, I've removed the left-panel, and made the right-panel collapsible. That allows more room for the delicious crunchy text.

I hope this was an interesting entry, it is probably the one that has taken the most effort to engineer, plan, and execute, in the history of the site.

What about just round robin DNS with a low TTL across the 4 servers? You could even automate removal of a dead server via cron, assuming bytemark has a DNS API. I've run into a couple bugs before with VIPs, plus it's harder to get right than DNS in my opinion.

The lowest (sane) TTL you can set is long enough that people unlucky enough to hit, and cache, the address of any dead servers is too long for me.

There is a complication of having floating IPs, and even in the current deployment varnish won't poll back-ends more than every half-minute, but I think using DNS for load-balancing isn't suited to dynamic servers. There's just too large a window where folk will suffer from, and see, downtime.

This is a great document of your thinking and your infrastructure, Steve. Thanks for sharing it.

I just noticed that the redeployed site now relies on javascript from http://ajax.googleapis.com. For those of us who use RequestPolicy to restrict cross-domain requests (or https users with browsers that reject "mixed content"), that means that some parts of the site don't work. In particular, i'm now unable to add tags to my weblog entries using the "add new tag" dynamic thing.

The only thing fetched from ajax.googleapis.com appears to be jquery, which you're also offering locally. Any chance you could pull from the local jquery instead of making the cross-domain request? The local version isn't minified, but i'm sure you could minify it if size of transfer is an issue :)

No problem. I was chatting to a colleague who didn't know you could restrict keys by IP so it seemed like an obvious thing to document & "expose". - (Wrong article!)

Re: jQuery - absolutely happy to serve it locally instead of the google-version. It was a change I made for testing that slipped live. I'm in the middle of optimizing the site for speed and I'll be reworking the js/css over the weekend so it'll be done then.

Great, thanks. You might look into fixing the flattr js inclusion as well. if you can't avoid a cross-domain request, you should at least use a scheme-less URL (e.g. href="//api.flattr.com/js/0.5.0/load.js?mode=auto") so that https users don't get a mixed-content warning (i just tested and they do serve that js over https). But avoiding the cross-domain request is better IMHO if you can do it.

Good, clear, well-written article Steve, thanks very much for sharing.

I've done this sort of thing before using haproxy [0] and a farm of appservers and db servers behind it. Did you consider that solution and if so any reason to discard it?

I really like the idea of any of the app servers can be the reverse proxy. You've then taken care of the SPOF at that level (and this might answer my question above). I'm interested in what software does the virtual IP failover between the app servers - I take it something from linux-heartbeat/pacemaker project?

I take it the mysqld's are using InnoDB? AIUI tuning at that level involves getting the "working set" into memory to avoid expensive disc i/o. That said MyISAM is probably fine is you're doing mostly reads and inserts.

Seems like it's Perl/CGI behind the scenes? I'm starting to write my webapps using Perl/CGI running in a mod_perl environment and the speed-up is there but at the cost of some complexity and quirks. Did you deliberately keep it simple with plain CGI because the speed was fine with nginx?

I've used haproxy before, and I didn't choose it this time round for two main reasons:

I knew I'd want to use the varnish cache anyway, to reduce the load on the back-end servers.

I wanted to decouple the IP-sharing from the load-balancing.

The avoidance of the single point of failure was a nice bonus, and I can say that it worked - a few weeks ago there was an outage affecting one database server, and two of the application servers. I tested the site and was pleased it all worked as designed!

The IP sharing / failover is handled using ucarp, which is nice and lightweight.

As for the other comments; yes minor tuning of MySQL to make sure all data fits into RAM, and due to historical reasons the whole code-base is written as a modular series of Perl modules & CGI-scripts. Typically the perl + CGI overhead isn't the bottleneck, it is the SELECT queries and the per-user lookups which are slower than I'd like - those have been tuned a little, but there is room for more work in that direction.

Very interesting article; been chewing on a similar upgrade to another site. Can you expand a little on how the floating IP works? The linked article describes floating an IP between two servers for failover, with an active and a standby server. Doesn't putting 4 servers there mean one is active, and 3 are idle? How does the IP move around to spread the load? Thanks for all your work!

You're almost there. As you say there are four hosts each of which can have the floating IP - which is the IP visitors read.

That feels like it should mean one server is running all the load, and the other three should be idle.

But the thing you've missed is that the floating IP runs varnish; and varnish is configured to fetch pages from each of the four backend machines.

Varnish will remove a host from the list if it is down/unreachable.

So the end result is :

All four machines run Apache non-stop.

one host will have the floating IP moved to it, and will have varnish started.

The master host will thus be able to make requests from any machine which is alive and running apache.

This means if a single machine fails we're in a simple state: If the machine that fails is not the master then apache will stop, and varnish will not use it to serve visitors any more.

If the failed machine is the master then the IP-failover will work, and promote a new host to be the master-varnish node. That will then try to poll each of the four, find one dead, and serve from the other three (i.e. this includes itself.)