Sign up for our $0 plan

This will be a two part series due to popular demand. This first part deals with the decisions behind switching to Amazon, the second part will deal with the nitty gritty implementation details we use to auto scale our infrastructure.

For a good number of months we had our infrastructure based on Hetzner but due to our service growing and a need to scale dynamically we decided to research cloud platforms and move to the cloud. Additionally we experienced severe networking issues from time to time which forced us to switch.

Requirements

We had several other requirements as well, but following are our core needs a cloud service has to address.

Scaling

We want to be able to scale to any size at any time without limiting ourselves by having to set up or manage physical servers. A cloud provider with good automation support was thus the only viable route to go for us.

Flexibility

We want to stay flexible in what type of server infrastructure (number of cores, RAM size, …) we use so we can innovate continuously and also provide different infrastructure for the differing needs of our customers. This also has a major role in our costs as we can use the exact server type we need for a specific task.

Automation

As a hosted continuous deployment service releasing changes and scaling automatically is a necessity. When automating the whole process of deployment and scaling we can provide a much better service with fewer points of failure and faster recovery in case of an error. Additionally it allows us to decouple the scaling of our infrastructure with scaling our team. We simply do not have to hire additional server admins because we run a much larger infrastructure.

Providers

We looked at various Providers including Rackspace, Linode, ElasticHosts and Amazon EC2. Especially Rackspace was great and their support is just incredible, but in the end we decided to go with AWS EC2 as they provide a lot of flexibility with their various Instance Types. We looked into ElasticHosts only a little, but they seem fine as well. Linode was out of the picture fast, as they do not provide per hour pricing.

Implementation

We will split this section into two parts, Automated Deployment and Automated Scaling, which describe how we introduce changes to our backend infrastructure and how we scale our backend respectively.

Automated deployment

One of our guiding principles is, not surprisingly, to automate deployment as much as possible. Implementing continuous deployment for a website is rather easy (especially when using Codeship) compared to continuously deploying Amazon AMIs. We use the following workflow

When pushing to our backend repository on GitHub a new Codeship build is triggered

The build runs the backend unit tests to make sure our scripts work fine

After all unit tests succeed a new Amazon Instance is started with a default Ubuntu 12.04 image. We then connect to that image via SSH, upload the necessary setup scripts and start a background setup process.

The setup process includes installing various linux packages and setting up our LXC guest system which we use to virtualize the EC2 servers. This process takes approximately 2 hours as we compile and install lots of software.

After everything is set up a test build with our test repository is run to make sure everything is installed correctly and works as expected.

If everything works fine we instruct Amazon to create a new AMI that we can use as a base image to start new servers.

As soon as the AMI is ready it is automatically used by our scaling system. We have used this process before we switched to Amazon as well with our Hetzner Infrastructure, but improved it drastically when we switched to Amazon. It makes changing our backend incredibly easy and gives us extreme power and control in innovating our service. We simply love it and couldn’t imagine working without it anymore. And not to forget having this automated system makes it much safer and less error prone. It is really hard now to accidentally introduce errors into the system as it is checked several times and improved continuously.

Automated scaling

We automatically scale our infastructure up and down depending on the number of builds we currently need to run. Every time a new build is started we make sure that enough resources are available for the build to start immediately. If there aren’t enough backend servers running we start another EC2 instance. As EC2 instances start in a matter of seconds the delay is only very minimal and not noticeable. Upon completion of every build and every ten minutes through a cron job we check if there are any EC2 resources currently not necessary and stop the servers accordingly. We make sure that we use as much of every EC2 hour as possible by only stopping them shortly before they incur further expenses.

Conclusion

It took us quite some time to build the current infrastructure and the automation to get the most out of it. But this is only the beginning. We have major plans to innovate on this very stable platform to give you unparalleled speed, flexibility and prize. For example we will shortly give you the ability to run your tests on Instances with up to 8 cores to parallelize your tests and reduce the time your tests take dramatically. We are really thrilled for the future of Codeship and can’t wait to tell you all the good news and updates we have in store over the next weeks.

Subscribe via Email

Over 60,000 people from companies like Netflix, Apple, Spotify and O'Reilly are reading our articles. Subscribe to receive a weekly newsletter with articles around Continuous Integration, Docker, and software development best practices.

We promise that we won't spam you. You can unsubscribe any time.

Join the Discussion

Leave us some comments on what you think about this topic or if you like to add something.

Hi. I like your workflow. Why are you using a base Ubuntu image instead of a customized image with preinstalled packages?

ben

@artovuori:disqus We do a lot of changes to the image, we install new packages, we upgrade packages etc. Having a customized base images doesn’t give a huge advantage since we have to rebuild the base image daily. instead of maintaining 2 images we just have one image where we pack everything in and rebuild it every day. building our images takes around 45min and we release a new image every day. if there are critical fixes we release more often. there is still a lot room for improvements, building the image isn’t optimized in any way. i think we can easily cut the build time in half if we spent some time on it, but that wasn’t need yet.