Lanyrd’s Big Move

Last week we pulled off two major changes to the infrastructure which runs
Lanyrd, and both with no downtime - but what exactly did we do, and why?

The first major change was perhaps the tougher of the two - we changed the
main database that Lanyrd runs off of from MySQL to PostgreSQL.

For those unfamiliar with the intricacies of databases, this is tough as it
means we need to convert the entire site's data format in one go - and if it
goes wrong, we have to roll back or, if things aren't monitored closely, there's
a small risk of losing some data.

The reasoning for this move was mostly for database features - MySQL
lacks a full transaction model, fast column addition, and it's quite bad at
using multiple CPU cores, whereas PostgreSQL has all of these - allowing us
to make changes to the site in the future with no downtime or read-only mode
at all.

The second major change was us moving from Amazon Web Services - EC2 and RDS in
particular - to running on dedicated hardware, which we rent through Softlayer.

There's nothing wrong with AWS - indeed, we still run a staging environment there -
but our database benefits greatly from the low latencies of physical disks, and
there aren't very many hosted PostgreSQL services on EC2 that fit our needs.

As part of the move, we also rearranged our services so that we have no
single point of failure - everything is either running on multiple servers
(like our Django code) or has a warm standby (the databases and load-balancer).

Read-only mode

Both changes required us to stop saving new data to Lanyrd during the move, and
so we opted to do them at the same time to minimise the amount of time we spent
in read-only mode. There's some risk involved here, of course - doing two major
changes at once requires more careful planning and rehearsal - but we want to
minimise the time we spend in read-only mode (in fact, this is only the second
time since Lanyrd's launch).

During the week before the move, I scripted the entire move process as much as
possible, giving us one command that would sync all of our main database,
our Redis data and our search data, and did several dry runs onto a test
environment, with Tom and Simon helping out with checking and ideas over the
week.

We caught a few bugs, mostly with the database conversion. The conversion was
performed by a dump converter I'd written myself, and we had a few problems
with escaping and missing indexes, but those were both spotted by the
eagle-eyed Lanyrd team during the testing phase.

If you're interested in the database conversion script we used, you can find it in our GitHub repository - there are some more technical details about how it works in my blog post.

Running the gauntlet

We'd analysed our traffic and picked Tuesday morning as the time that would
impact the least number of people. One of the advantages of being a UK startup
is that the time difference means that the US and Canada are asleep during the
morning, giving a nice low-traffic area that's still in working hours.

With that set, the Monday was a final dry-run, a quick load-test of the new site
using a traffic-replaying system we have, and then the move took place on the
Tuesday, at 10am.

Apart from one minor hiccup with getting read-only mode turned on, the move went
quite smoothly, and we were back up and out of read-only mode before midday.
Lanyrd stayed available throughout the move, and read-only mode did its job
admirably.

With only more more minor problem during last week - which we were able to deal
with swiftly - things do seem to have gone rather well, and we're eager to start
putting our more powerful servers and new database features to good use!

James: almost all of our database queries are handled by Django's ORM, which abstracts away most of the differences. We had to fix a couple of places that were using group by and attempting to order by a column that wasn't grouped (MySQL is OK with this, PostgreSQL isn't) - they were .annotate() queries that had a default ordering, and the fix was to add .order_by() to the end. We had a few hand-written queries which needed fixing. We tested using our existing unit tests and by running PostgreSQL on our staging server for a few weeks before the move.

We still had the existing site running on MySQL on EC2, and since it was in read-only mode there were no changes being made to that database - if the migration to PostgreSQL/SoftLayer had failed, we would have switched the old site out of read-only mode and continued as we had before (then figured out what went wrong and scheduled another migration window).