Joining the WireHive

We moved to Rackspace a few years ago when we started using MODX as our CMS platform, I designed an infrastructure using Puppet and the new “auto scaling” product that they had to offer.

The concept was simple - design a solution that would have nearly no SPOF but be cost effective as well. For a few years this process worked well but we grew… grew so much that the total disk space for all clients applications was just over 20GB and we started to see some issues with auto scale times over the past six months.

There are two approaches to auto scaling:

Start with a “fresh” server image which is the base flavour at Rackspace and then on boot run our whole puppet script which would checkout every site in a synchronous manner, and start nginx when it’s complete.

Or create our own server image nightly with everything ready to go and then let it play catch-up with that day's deployments.

Both of these concepts work fine with a small amount of changes - but with bigger disk sizes they but take way too long to scale.

With 1) We saw the puppet execution time take nearly 25-30 minutes to run from scratch. (we even tried to host git repos on our own gitlab instance to reduce network latency for git checkout)

The second option was what we had always done, but because of the way Open Stack handles server images we were getting penalised for the network transmit. When you create a server image it is moved off to Cloud Files for storage so when you create a new server from the image it has to copy it back first. This meant our auto-scale took anywhere from 5 minutes to 25 minutes.

This caused a lot of intermittent “slow downs” that caused us frustrations due to not being able to scale quickly enough to cater for our clients traffic needs.

We also saw some disk latency issues with our Percona Galera Cluster where the cluster would have to wait for disk IO to free up before it could continue processing SQL statements.

I’ve been an advocate for Rackspace for many years - but I think we finally outgrew their cloud offering. So the time has come to migrate our hosting to another provider to allow us to continue controlling our infrastructure.

Welcome WireHive & Amazon Web Services (AWS). Moving to AWS was an interesting choice because they have a lot of offerings and we need to be sure they can cater for our use-case.

The Objective

To move all client sites from Rackspace London to AWS Ireland with minimal downtime for all clients and ensure the least amount of impact across the service

Initial Challenges

The two biggest issues we had surrounding the migration were:

We only control DNS for about 9% of client websites, so needed to ensure clients were prepared to update DNS

All MODX websites upload files to Rackspace Cloud Files - we needed to migrate this to S3 and update all code/links to reference the new S3 buckets.

Where to begin

To start with I created a Google doc highlighting the details of every client that was impacted. This displayed as much information as possible to help maintain visibility across the whole migration. Some of the columns were:

Domain Name

Any/all aliases also hosted on our infrastructure

New DNS records

Whether we control the DNS

Any integrations that need to be notified of new IPs (Sagepay / firewalls / etc)

The current state of Client Comms

Notified of move?

Notified of new DNS records?

Responded?

Cloud files moved?

Container Name

S3 Bucket Name

Server Moved?

Risks

This sheet was shared with the other developers and the account managers at Adido. Thankfully our account managers dealt with all client communications amazingly well and everyone was prepared for when the time comes to update DNS records.

Once we had our AWS account set-up and ready to go the time came to start the move.

First up: Cloud Files

Originally I planned to move cloud files after the servers, but the dev team had a little downtime at the beginning of the week to get ahead of this. Surprisingly getting files out of Cloud Files was quite a challenge. We had time-out issues, concurrency issues, and things just taking ages.

To give an idea of how much content we had to move - 250,000 files totalling 30GB in about 60 different containers. The key to this content migration was to tackle each client one at a time and to move the files as quickly as possible - to minimize the risk of a client uploading a file to the old container before the settings were updated to look at S3.

We tried cyberduck to download the containers locally and then to upload direct to an S3 bucket but we had issues with the total files in folders/containers which would cause cyberduck to crash.

There was also Turbolift which I have used in the past to bulk upload vast quantities of files very quickly. Unfortunately I could not get the download functionality to work at all.

I wrote a very basic script that would download a container through the API but this was very slow as it was only doing one file at a time. However at the eleventh hour whilst reading Hacker News a tool appeared on the front pages called rclone. This would let us copy files from a container directly to S3 from a server and is also concurrent.

So we now had a process to migrate clients from cloud files to S3:

Start migration of files

Add our bespoke url replace script to the site and test in development that all cloud files links update (links are possible within content areas, template variables, product information and content blocks details)

When migration finishes - update Production Media source information to be S3

Run url replace script in production

Testing

This process worked flawlessly for most of our clients. We had issues where the file migration took a long time (the longest was nearly 6 hours long). So these ran overnight and updated the production details in the morning.

We immediately saw some issues with the MODX S3 media source that is to do with generating image thumbnails and the amount of images within folders. So we made some changes to the media sources we use internally and hoping to push these back to the community in the coming weeks.

On to the next task

Server Provisioning

Because we had issues with the boot time of servers at Rackspace, I was aware that this was something we would have to get right from the start and using AWS allowed us to create our own AMIs.

Previously we were using Puppet to orchestrate our servers and ensure they were up to date, however over the years our Puppet codebase got quite messy and hacked apart so it was time to start afresh. Welcome Ansible.

Ansible looked a lot cleaner and also the WireHive team are all promoters of the language so they can give support when we need it on these scripts. Our playbook had two main objectives:

Install and setup all required software on a clean server

Checkout all our git repos and allow fast deployment for our team

The first objective is fairly simple with the yaml structure of roles allowed us to create a few different folders and tasks to cover our whole infrastructure relatively simply.

Once this objective was complete it was time for the tricky one: How do you manage many client sites and only deploy one if you need to. The solution was using tags and variables. By writing our own role and creating a group variable which included all client sites and their respective branch + commit hash allowed me to use the “with_items” option and loop our build task for every client. It also gave us the ability to run a deployment for a single client - couple this with the ec2 inventory script we had a fully dynamic deployment system.

So now we had ansible doing deployments and managing the server infrastructure we needed a way to invoke Ansible and manage everything on a human level. Previously we were using Hipchat for our Chat Ops but have moved to Slack for our Chat Ops in the hope that we can add more tools for the rest of the company and role out to everyone increasing visibility to everyone.

Welcome Marvin

Marvin is our new robot. Based on Github’s Hubot he handles all our deployments and AWS interactions. From creating nightly AMIs and updating our auto-scale groups to deploying new versions of websites through Ansible. In the coming weeks and months we are going to extend him further to integrate more with our internal tooling.

In our main Ops room we also have New Relic sending us performance alerts if client websites are performing poorly and I’ve set up an AWS Lambda function to send notifications into slack about scaling activity and also alarms like high CPU and unhealthy nodes.

We were nearly there.

Third party updates

One thing we had to ensure was that once we moved any third party integrations would remain working. So I had a list of sites that would require extra testing from the new infrastructure. Once we had Ansible updating the auto-scale group (it was sat on idle as it was receiving no traffic at the moment), I could direct my own traffic (updating my local hosts file) to the respective client sites and test the process end-to-end. This gave us the reassurance that everything would be OK

Database Migration

We moved from our Percona Galera Cluster to a MySQL RDS Instance. After some playing around with security groups I finally managed to get all our data imported (Thank you mydumper and the great Percona blog for explaining how we set this up)

After the 90 minutes to import all the data, the RDS instance was set up as a slave and another thing was checked off the list. Although we wouldn’t be sure what settings to put in the Parameter group - we had a few that we could set based on our previous database setup. But this would be something that would need tweaks over the coming weeks.

One thing that we had to be careful about was when migrating the sites - We had to not break replication. This meant some very clever DNS tricks were needed. All our sites use the same hostname for the DB connection with is a CNAME to its actual destination so I updated the hosts file on the old infrastructure to reference the old database and used the normal DNS to reference the new server. This would mean we wouldn’t need any code updates for the new DB connections.

Traffic Manipulation and Go-Live

When we were finally happy to go live on the new infrastructure, there was a very simple process to follow:

Direct all traffic to a single node in the Percona Cluster

Set up a big instance at AWS with an elastic IP

Lock the tables on the Percona Cluster to stop writes

Run the script to update all Rackspace Load Balancers with the new elastic IP and remove any nodes

This process will send all traffic to AWS and the new infrastructure without requiring any DNS updates in the immediate future.

So late Sunday night - I hit the switch to redirect the traffic and after what can only be known as the longest single minute of my life. After all the load balancers updated the sites were back up and it’s onto the next task.

DNS Updates

We had access to a few clients sites and this was key to get clients moved off a single server and being proxied through Rackspace and onto the Auto Scale infrastructure at AWS. Each client had there own CNAME record which is managed in route 5s. This would allow us in the future to move their sites around and not have to trouble them with DNS updates.

So Monday morning our account managers got busy notifying clients telling them it is time to update to the new DNS records. Because we could not use a CNAME on the root domain of a website we maintain an A record for the root and CNAMe for the www. Record.

In conclusion

Five days on and the infrastructure has settled down. It’s been a bit of a steep learning curve with understanding how EC2 instances work and what would be best for our set-up but with the help of WireHive and www.ec2instances.info, it looks like we have a scaling solution that works well for us.

Lessons Learnt

There are a lot of settings within AWS and EC2 to get setup right and can be quite confusing at first. I had some fun with security groups and getting servers talking to each other at first.

Moving large amounts of files across the internet results in many errors. Mainly connection drop outs and transfer failures - build in retry attempts to allow you to carry on instead of starting again.

AWS Application Load Balancers only support 1 SSL certificate - this stopped my original plan but using a classic load balancer allowed us to mimic our infrastructure, even if it’s just for the moment

By giving clients a bespoke DNS record for their website - it gives us more control about moving their site in the future.

Performance improvements

The main point of this exercise was to improve performance of our clients websites and to reduce the time it takes our platform to scale. We are now scaling new servers in around 1-2 minutes which is considerably lower than the old method.

In terms of site performance the below graphs are from an e-commerce site, on our old platform against our new platform

Discover Attention: what is it and why you need it to succeed in the digital realm. Turn brand strangers into loyal brand followers and get your brand the focus it requires within a world of marketing noise.