The personal blog of Oliver Hookins and Angela Collins

As I mentioned in my presentation at PuppetCamp, one of the goals of our Puppet code testing was to run Puppet Faces-style compile tests on all of our modules. Since this requires a somewhat realistic deployment environment, a clean system, and our modules to be installed as packages (since that’s how we are distributing our own modules) it makes sense to run this stage in a throwaway virtual machine.

This is where Vagrant comes in. I won’t dive into what Vagrant is all about since the website describes it well, but instead try to focus on my testing methodology. Using Vagrant VMs is a natural fit for this use case, but there are a few aspects which also make it a bit difficult to do this kind of testing (probably irrespective of the tool you use):

VMs are typically set up on a private NATed network on the hypervisor machine. This has the added benefit of not requiring cooperation from the regular network, but means that you cannot straightforwardly connect to each VM as needed by your testing system.

Related to the previous point, your test jobs will be running on a Jenkins build master or slave, which may then need to SSH/RPC to the Vagrant/Virtualbox host, which will then need to SSH to the Vagrant VMs. Not very nice.

Vagrant has the same VM spin-up lag problems that the cloud has – you want a new instance to become available rapidly so testing completes fast. Alternatively you can pre-build your VMs but this introduces complexities in the testing system.

Related to the previous point, bugs/limitations in Vagrant/Virtualbox mean that you can’t spin up several VMs simultaneously due to some race conditions.

Your testing system exists outside the VMs – you want to get your code to test inside the VM.

These were probably the hardest problems I had to solve (aside from just writing the Rake tasks to wrap around Puppet Faces). I’ll break these down into sections and describe the solutions. I won’t go into the usual nightmare of how to get an appropriately recent version of Ruby/Rubygems etc onto a CentOS 5.5 machine so that you can actually run Vagrant – that’s surely been solved many times already.

Networking

This was a relatively easy one. Vagrant VMs run on a host-only NATed network, but we need to somehow get the Jenkins jobs running something initially on the Vagrant host. Basically, I decided the easiest course of action would be to make the physical machine a Jenkins slave right from the start. So I have the master connecting to this machine as the jenkins user, which is also the user I am using to run all of the Vagrant stuff. Jenkins owns the VMs, so to speak, which makes connectivity and permissions on them much easier.

Once a job has started it is possible, by using the vagrant command, to connect to any of the VMs and run commands on them, destroy them, bring them back up again etc.

SSH and jobs within jobs

As I mentioned in a previous post, working with Net::SSH can be fiddly, but it is by far preferable to command line SSH invocation and all of the usual string-manipulation, input/output and error-handling nightmares that entails. The basic principle for running these test jobs is:

Run the job from Jenkins, which runs an initial rake task on the Vagrant hypervisor machine (the test wrapper)

The test wrapper calls the Vagrant libraries and sets up configuration for the VM in question.

The build tools (previously checked out from version control to the Vagrant hypervisor machine) are copied to the Vagrant VM’s shared directory on the hypervisor. They are now available to the VM as well.

A Net::SSH call is made to the VM, which calls the “real” test task from the build tools in the shared folder.

Testing runs in the VM.

Net::SSH call to the VM finishes, and the VM is destroyed.

Now we only really have a single RPC via Net::SSH which is relatively easy to manage programmatically. Of course, we lose granularity in the output since we are running through an SSH session. We cannot raise exceptions or pass structured data – but have to rely on exit codes and STDOUT/STDERR. So far this doesn’t seem like a big limitation in this use case.

VM Lifetime

My first mental approach was to have a number of VMs constantly running on the machine (roughly equal to the number of physical CPU cores), and a helper script or daemon that handled allocation of VMs to incoming test jobs. After each job completes, the helper would mark the VM as unavailable, destroy and recreate it from the base box image and when ready again return it to the available pool of VMs. This seemed like a nice idea at first but I realised the coding would take some time.

As soon as I had a basic idea for what the testing tasks themselves would look like, I decided to implement them for a single job first as a proof of concept. It was hard-coded to a single VM which made things much simpler, and it just worked. Pretty soon I moved on to creating the rest of my test jobs and had to find a way to handle the parallel builds. The simplest possible solution (when using Jenkins, at least) is to prepare a fixed number of VMs (say 8, to match the physical number of cores) and tie each one to a specific build executor of the Jenkins slave (you obviously need to also configure Jenkins to use no more than this number of executors on the slave).

Thus, as long as your create/destroy sequences work (hint: put them in a begin/rescue block in the ensure section) then you will have no problem running jobs continually – each new job run gets a fresh instance of the VM, tied to that executor. The problem that I ran into here was the bug/limitation of Vagrant and/or Virtualbox that prevents multiple simultaneous VM creations. Inevitably if your number of jobs exceeds your number of build executors and the jobs are doing similar things, they will all finish at similar times and attempt to cycle the VMs at the same time, which inevitably fails.

It seemed like I was back at square one, looking at having a smart helper script which automated VM recycling. I actually implemented locking around what I judged to be the critical code section where the VMs were created but I continued to experience problems. In the end, Patrick Debois of DevOpsDays fame made a suggestion to keep all the VMs running and use his sahara add-on to Vagrant in order to do lightweight snapshots/rollbacks on the VMs rather than a full-blown create/destroy between tests. Now the general run procedure looks like:

Start wrapper task on Vagrant hypervisor machine

Call real test task through Net::SSH to VM

After Net::SSH call completes, run vagrant sandbox rollback on VM

The rollback takes a matter of seconds (the snapshot is generated just once, when I first create each VM on hypervisor boot), and now each compile job takes generally about 45 seconds in total. This is far better than the < 10 minutes time goal I had originally set, and the system is working much better than I had originally imagined, despite it being effectively a “workaround” for my mental design.

Sharing code with the VM

A while back I forcibly separated out my build toolchain from my Puppet code, so that it would be possible to store our Puppet modules in any repository so that Dev teams could maintain their own modules but build them using our common toolchain. This has generally been a success, and as a bonus it also makes them pretty easy to share and copy around. In the case of Vagrant, Jenkins automatically checks out the build toolchain at the start of the job and calls the test wrapper task.

The test wrapper task then copies the checked-out build toolchain directory into the Vagrant VM’s shared directory (in my case, /vagrant/puppetvm#{ENV[‘BUILD_EXECUTOR’]} which is the directory that the VM’s Vagrantfile lives in). Inside the VM, the files are also visible at the same path. The basic mechanism for the test is now as follows:

Copy the build tools to the VM shared directory

Call Net::SSH to the VM and install the Rake gem

Call Net::SSH to the VM and run rake on the “real” test task

Actually running the tests

So what does my “real” test task look like? This will look different for everyone, so I’ll just list the general steps:

Set up Yum repository fragments so we have access to the Puppet packages, and the repository that contains our syntax-checked Puppet modules.

Yum install puppet and supporting packages (puppet-interfaces and interface-utils which were pre-release parts of Puppet Faces we’ve been using a while) and the package of the Puppet module we wish to test.

Run Puppet Faces compile tests on the module

Exit with a reasonable exit code for the wrapper task to catch and use as the result of the job.

The hardest part of this all is making changes to the testing toolchain, since when you want to look at how something failed your VM has already been rolled back. This is just a sad fact of having to tie VMs to build executors, but we also can’t just leave failed builds lying around with their VMs (especially in the case of real test failures). If anything, it has strengthened the need to pull core logic out of the Rake tasks and into a separate set of build libraries that can be tested separately with Test::Unit or RSpec, but given the reliance of so much of the testing on shell interaction, it is difficult to test adequately (especially when you resort to mocking).

Since the base box is overlayed with the Vagrant COW image anyway, I assume you are talking about making the COW immutable? In which case, how would you make any changes to the VM during the test run. Or have I misunderstood you?

I’m all for reducing number of components to make the testing suite easier, so I’d like to use this approach if possible (although using Sahara is very fast).

Nikolay Sturm

Now I am confused, I don’t see vagrant using COW images. It seems to copy the basebox vdi file for each VM. Am I missing something?

No, you are completely right. I was under the impression COW images were being used because bringing up one of my CentOS base boxes is quite fast, although I think this is just due to my fast hardware and small box size.

I’m sure your solution will be faster in the reverting phase but my only uncertainty is whether the vagrant reload will hit the same concurrency issues as running vagrant up, but I’ll test this out today and let you know 😉

I tested your method, and indeed it works. The drawback now is that a vagrant reload takes around 1m47s on my current hardware whereas a sahara sandbox rollback takes a few seconds. This would definitely impact the performance of my build pipeline as we have 20-30 jobs in this stage.

If it were easier to enable immutable images I would probably consider it but given it is actually easier to install the sahara gem and set up the VMs to use it, I’m not sure how much gain there is. Still, handy to know – thanks!