This is a guest post by Oleksiy Kovyrin, Head of Technical Operations at Swiftype. Swiftype currently powers search on over 100,000 websites and serves more than 1 billion queries every month.

When Matt and Quin founded Swiftype in 2012, they chose to build the company’s infrastructure using Amazon Web Services. The cloud seemed like the best fit because it was easy to add new servers without managing hardware and there were no upfront costs.

Unfortunately, while some of the services (like Route53 and S3) ended up being really useful and incredibly stable for us, the decision to use EC2 created several major problems that plagued the team during our first year.

Swiftype’s customers demand exceptional performance and always-on availability and our ability to provide that is heavily dependent on how stable and reliable our basic infrastructure is. With Amazon we experienced networking issues, hanging VM instances, unpredictable performance degradation (probably due to noisy neighbors sharing our hardware, but there was no way to know) and numerous other problems. No matter what problems we experienced, Amazon always had the same solution: pay Amazon more money by purchasing redundant or higher-end services.

The more time we spent working around the problems with EC2, the less time we could spend developing new features for our customers. We knew it was possible to make our infrastructure work in the cloud, but the effort, time and resources it would take to do so was much greater than migrating away.

After a year of fighting the cloud, we made a decision to leave EC2 for real hardware. Fortunately, this no longer means buying your own servers and racking them up in a colo. Managed hosting providers facilitate a good balance of physical hardware, virtualized instances, and rapid provisioning. Given our previous experience with hosting providers, we made the decision to choose SoftLayer. Their excellent service and infrastructure quality, provisioning speed, and customer support made them the best choice for us.

After more than a month of hard work preparing the inter-data center migration, we were able to execute the transition with zero downtime and no negative impact on our customers. The migration to real hardware resulted in enormous improvements in service stability from day one, provided a huge (~2x) performance boost to all key infrastructure components, and reduced our monthly hosting bill by ~50%.

This article will explain how we planned for and implemented the migration process, detail the performance improvements we saw after the transition, and offer insight for younger companies about when it might make sense to do the same.

Preparing for the switch

Before the migration, we had around 40 instances on Amazon EC2. We would experience a serious production issue (instance outage, networking issue, etc) at least 2-3 times a week, sometimes daily. Once we decided to move to real hardware, we knew we had our work cut out for us because we needed to switch data centers without bringing down the service. The preparation process involved two major steps, each of which has a dedicated explanation in their own sections below:

Connecting EC2 and SoftLayer. First, we built a skeleton of our new infrastructure (the smallest subset of servers to be able to run all key production services with development-level load) in SoftLayer’s data center. Once the new data center was set up, we built a system of VPN tunnels between our old and our new data centers to ensure transparent network connectivity between components in both data centers.

Architectural changes to our applications. Next, we needed to make changes to our applications to make them work both in the cloud and on our new infrastructure. Once the application could live in both data centers simultaneously, we built a data-replication pipeline to make sure both the cloud infrastructure and the SoftLayer deployment (databases, search indexes, etc) were always in-sync.

Step 1: Connecting EC2 and Softlayer

One of the first things we had to do to prepare for our migration was figure out how to connect our EC2 and our SoftLayer networks together. Unfortunately the “proper” way of connecting a set of EC2 servers to another private network – using the Virtual Private Cloud (VPC) feature of EC2 – was not an option for us since we could not convert our existing set of instances into a VPC without downtime. After some consideration and careful planning, we realized that the only servers that really needed to be able to connect to each other across the data center boundary were our MongoDB nodes. Everything else we could make data center-local (Redis clusters, search servers, application clusters, etc).

Since the number of instances we needed to interconnect was relatively small, we implemented a very simple solution that proved to be stable and effective for our needs:

Each data center had a dedicated OpenVPN server deployed in it that NAT’ed all client traffic to its private network address.

Each node that needed to be able to connect to another data center would set up a VPN channel there and set up local routing to properly forward all connections directed at the other DC into that tunnel.

Here are some features that made this configuration very convenient for us:

Since we did not control network infrastructure on either side, we could not really force all servers on either end to funnel their traffic through a central router connected to the other DC. In our solution, each VPN server decided (with the help of some automation) which traffic to route through the tunnel to ensure complete inter-DC connectivity for all of its clients.

Even if a VPN tunnel collapsed (surprisingly, this only happened a few times during the weeks of the project), it would only mean one server lost its outgoing connectivity to the other DC (one node dropped out of MongoDB cluster, some worker server would lose connectivity to the central Resque box, etc). None of those one-off connectivity losses would affect our infrastructure since all important infrastructure components had redundant servers on both sides.

Step 2: Architectural changes to our applications

There were many small changes we had to make in our infrastructure in the weeks of preparation for the migration, but having deep understanding of each and every component of it helped us make appropriate decisions reducing a chance of a disaster during the transitional period. I would argue that infrastructure of almost any complexity could be migrated with enough time and engineering resources to carefully consider each and every network connection established between applications and backend services.

Here are the main steps we had to take to ensure smooth and transparent migration:

For each stateful backend service (database, search cluster, async queues, etc) we had to consider if we wanted (or could afford to) replicate the data to the other side or if we had to incur inter-data center latency for all connections. Relying on the VPN was always considered the last resort option and eventually we were able to reduce the amount of traffic between data centers to a few small streams of replication (mostly MongoDB) and connections to primary/main copies of services that could not be replicated.

If a service could be replicated, we would do that and then make application servers always use or prefer the local copy of the service instead of going to the other side.

For services that we could not replicate with their internal replication capabilities (like our search backends) we made the changes in our application to implement replication between data centers where asynchronous workers on each side would pull the data from their respective queues and we would always write all asynchronous jobs into queues for both data centers.

Step 3: Flipping the switch

When both sides were ready to serve 100% of our traffic, we prepared for the final switchover by reducing our DNS TTL down to a few seconds to ensure fast traffic change.

Finally, we switched traffic to the new data center. Requests switched to the new infrastructure with zero impact on our customers. Once traffic to EC2 had drained, we disabled the old data center and forwarded all remaining connections from the old infrastructure to the new one. DNS updates take time, so some residual traffic was visible on our old servers for at least a week after the cut-off time.

A clear improvement: Results after moving from EC2 to real hardware

Stability improved. We went from 2-3 serious outages a week (most of these were not customer-visible, since we did our best to make the system resilient to failures, but many outages would wake someone up or force someone to abandon family time) down to 1-2 outages a month, which we were able to handle more thoroughly by spending engineering resources on increasing system resilience to failures and reducing a chance of them making any impact on our customer-visible availability.

Performance improved. Thanks to the modern hardware available from SoftLayer we have seen a consistent performance increase for all of our backend services (especially IO-bound ones like databases and search clusters, but for CPU-bound app servers as well) and, what is more important, the performance was much more predictable: no sudden dips or spikes unrelated to our own software’s activity. This allowed us to start working on real capacity planning instead of throwing more slow instances at all performance problems.

Costs decreased. Last, but certainly not least for a young startup, the monthly cost of our infrastructure dropped by at least 50%, which allowed us to over-provision some of the services to improve performance and stability even further, greatly benefiting our customers.

Provisioning flexibility improved, but provisioning time increased. We are now able to exactly specify servers to meet their workload (lots of disk doesn’t mean we need a powerful CPU). However, we can no longer start new servers in minutes with an API call. SoftLayer generally can add a new server to our fleet within 1-2 hours. This is a big trade-off for some companies, but it was one that works well for Swiftype.

Conclusion

Since switching to real hardware, we’ve grown considerably – our data and query volume is up 20x – but our API performance is better than ever. Knowing exactly how our servers will perform lets us plan for growth in a way we couldn’t before.

In our experience, the cloud may be a good idea when you need to rapidly spin up new hardware, but it only works well when you’re making a huge (Netflix-level) effort to survive in it. If your goal is to build a business from day one and you do not have spare engineering resources to spend on paying the “cloud tax”, using real hardware may be a much better idea.

If you’re passionate about engineering at the intersection of software and infrastructure, Swiftype is hiring Senior Technical Operations Engineers.

Reader Comments (12)

great article. I just have to add some wise words from a Google engineer(I forgot where I saw it but must've been some presentation): Don't expect synchronous data replication over distances. coding your application to deal with this is the better way. you may lose performance but you won't end up with inconsistent data

I laugh and cry every time when I see when company lefts Amazon and experience cost decrease :) I'm from Poland, but many countries have their own cloud providers and everyone should check them before going to bigger player. I did many calculations for hundreds of customers and avg. cost comparing Polish cloud with Amazon show that Amazon is 7-20x more expensive than providers I took to comparsion. I know that those providers can't scale to 1000Gbps or PBs of data, but if people have network issues and performance degradation it's just funny that they still use such bad provider.

I did many setups for few day events and it always worked like a charm. Two setups were working for few months (everything according to plan) and I didnt experience any problems.

Interesting post, Oleksiy. Having used both EC2 and Softlayer extensively myself, I've reached different conclusions.

Softlayer's back-end network isn't as stable as I'd like. I've tried running distributed filesystems such as glusterfs there, and had to give up - the network was too unreliable. Ceph, for what it's worth, was able to handle the bumps in the night better, but they were still there.

Last time I checked (within the last few months) It's not possible to get 10G networks from SL. It is from EC2. Additionally, it's fairly common knowledge if you want a server to yourself, which instance sizes to provision at EC2.

I'm glad your customer service experience has been good so far. One of the problems with physical servers is things break, and when you're renting the server, you're stuck with those bits. Open a trouble ticket, try to get a drive/whatever replaced, hope the support tech knows enough to go and take care of things instead of trying to troubleshoot and tell you the system's fine. I suggest you remember the phrase "please escalate" and use it without hesitation. In a cloud environment, you fire up a new instance and get back to work, letting somebody else deal with the hardware issues.

That said, it's much easier to get somebody from SL on the phone compared to EC2. :)

One thing you didn't mention is if you want to get decent pricing at SL, you need to get a quote from your sales rep. This in general would lower the price of a system by a large chunk in my experience.

I think the most important thing is if you're building an app for the cloud, you need to architect a solution that won't tumble down if Amazon reboots a system you're on. This is not a unique situation, though - an instance crash is much less painful than a server going belly up.

Everyone knows Netflix takes all the best servers on EC2. They boot 30,000 machines and run performance tests, then they keep only the top 5% and close all the other instances. But on a serious note, your devops team is probably too inexperienced to use EC2 to the full potential. Netflix and Reddit have no problem running on EC2.

I'm surprised at the outage rate for this sized environment. Was this infrastructure implemented as a modern ephemeral and autoscaling stack, or was it a traditional stack design? I've managed platforms of thousands of nodes in AWS with much lower failure rates.

No infrastructure is perfect. I would like to know the long-term failure rates, and more specifically MTTR, of the platform in a traditional environment. Also, what are the opportunity costs of innovation from being in a slower physical environment?

we run 14K instances on EC2. Not sure if your deploying correctly if you are seeing issues like VM hangs, etc on just 40 instances. This kind of migration is feasible for small environments. Not really many options for large scale operations other than AWS or roll your own