catlee already covered this in Mozilla’s Platform Meeting, but this multi-month project is a massively important step forward for Mozilla’s Release Engineering infrastructure as well as for all Mozilla’s developers, so is worth calling attention to three important details:

Security

Seamless integration

Dynamic allocation

Security

The security of our RelEng infrastructure is obviously important to Mozilla, so we setup these Amazon-based VMs inside a Virtual Private Cloud (VPC). While it is technically possible to have the VMs inside the VPC connect directly to the external internet, we felt it was safer to prohibit any access from the VPC to or from the internet. Therefore the only connection we have to/from our VPC is over a VPN link directly into Mozilla’s existing Build network, within Mozilla’s secured infrastructure.

If an Amazon VM needs to reach an external site for any reason, it can only do this by going from Amazon over VPN to Mozilla’s Build Network and then out through Mozilla’s firewalls. If a Mozilla person wants to access one of our Amazon VMs, they have to do this by going through Mozilla BuildNetwork over the VPN link to the VPC. We designed this very restricted access to help protect these vital systems. It was reassuring to also see all the security audits that Amazon has done.

Seamless integration
We integrated Amazon’s VMs nicely into our existing mix of VMs and physical machines in the Mozilla build network. The easiest way to see if your specific build was handled by an Amazon VM (called an “EC2 instance” in Amazon-speak), is to look at the machine name on tbpl.mozilla.org.

The only other way that you can tell we are using AWS for some of the builds is that the additional compute capacity is helping reduce wait times for our builds!

Dynamic allocation
As you can see from looking at our monthly load posts, load on our RelEng infrastructure varies over different times of the day, and over different days of the week. To handle this efficiently, we now dynamically add and remove Amazon VMs from production at any given time to match the demand at that time. We do this as follows:

Our automation monitors the queue for pending builds

If there is a backlog of pending builds in the queue, our automation dynamically starts reviving enough VMs in our Virtual Private Cloud to handle the backlog.

As each of these VMs come online, they connect to a buildbot master, indicating they are idle and ready to process jobs.

A buildbot master assigns a pending build job to the newly available idle slave.

Once the build job is completed, the slave goes back to the master looking for another job.

If there are no backlog of pending jobs for 60 minutes, then our automation starts suspending the idle Amazon VMs. Suspending VMs like this allows us to quickly bring VMs back into production in a few seconds to handle any new backlog, while also reducing costs during low load times. Also, note that the 60minute threshold worked well for us in staging, but we’ll likely adjust this in the near future as we more experience with real-world load.

As of today, we only let some B2G builds overflow onto AWS like this, and we continue to monitor builds and the dynamic allocation carefully. Assuming this continues to work well, we will soon let the rest of the B2G builds overflow to AWS. Then next will be fennec/android builds, and then linux desktop builds. Our focus in the immediate short term will be to siphon excess load from our Mozilla build machines over to AWS, allowing us to better handle the increased number of B2G and Fennec builds being enabled in production recently. This also allows us to reimage some/all of our physical linux builders as physical win64 builders to immediately help with our win64 builder wait times. Eventually, we may start running win64 builds, and maybe even some unittests, on AWS but that need further investigation – stay tuned!

Its hard to overstate how important this is for us.

The increase in build types for B2G and fennec and desktop, combined with the increase in number of checkins per day has RelEng systems continually under heavy load. We first tried using AWS in 2008, but the Amazon VMs that we were using kept being restarted, usually before the build completed; the build would automatically restart everything once revived a few seconds later, but it still blocked us from actually being able to use these in production. Some renewed experiments in summer2011, and discussions with others companies who were doing similar investigations looked promising, so we started work on this in full force in Feb2012.

We hope you like this, and of course, if you see any problems, let us know asap, or file a bug!

Its still a bit early to tell proper economics of this, but our early modelling of costs looked comparable. As we gather some real usage costs we will, of course, share.

Aside: Our usage over these first few months is going to be atypical. As our wait-time posts to mozilla.dev.tree-management clearly show, our current inhouse capacity isn’t enough for current demand, so this pay-per-use instant-extra-capacity is improving those wait times. This extra capacity came online just in time to handle the increase in load from new B2G jobs being enabled in production. Over the coming days, we’ll continue to vent over more and more backlog pressure from different types of jobs over to AWS. Each of those reduces pressure on our inhouse machines, which we can then reimage to help in the next-worst-backlogged areas. For example, as more linux builds are handled on AWS, we will reimage our linux physical builders as win64 physical builders, etc, etc. If we coordinate all this just right, no-one should notice a thing, except that all build queues get faster! Once things settle down, we should be able to get better understanding of “typical usage” and “typical costs”.

As for network traffic, so far so good, but its a valid concern, both for reliability and for costs. We’ve been setting up mirrors and caching as part of our work in AWS for exactly these reasons. Not all these systems are fully in production yet, but we’re working as fast as we can.

This is cool. Do you have any numbers to share? Like the difference in latency between AWS jobs and Mozilla jobs, any sort of averagish price per build for the two, etc.

Are all/most AWS builds effectively clobber builds, or do the AWS builders stay up long enough to do depend builds? Probably not relevant yet, if many of the builds are try builds anyway. (Though if we manage to move to tup or something like it, maybe try builds could eventually become depend builds too!)

Seems like the security is mostly only needed for builds, not tests. If the interface by which builders report back status results is restricted enough, you could use untrusted nodes for tests. Then you could use something crazy like https://gridspot.com/compute/ (until its pool dries up).