Sensible SSH with Ansible: Vagrant Setup

This is the second in a series of several posts on how to manage ssh via Ansible. It was inspired by a warning from Venafi that gained traction in the blogosphere (read: my Google feed for two weeks). I don't know many people that observe good ssh security, so my goal is to make it more accessible and (somewhat) streamlined.

This post looks at how to quickly and easily mimick common environments in Vagrant. If you're using a different tool, feel comfortable with Vagrant multi-machine setups, or plan on running Vagrant from a pleasant operating system, you can probably skip this post.

Code

Note

If you're using Vagrant on Windows with Hyper-V, there's a good chance you'll need to append --provider=hyperv to any vagrant up commands. If you're not sure, don't worry. Windows will be very helpful and crash with a BSOD (technically green now) if you use the wrong provider. The first post has more information on this useful feature.

Creating the Status Quo

Rather than try to make up some convoluted scenario that sounds really cool but is really just a tiny special case (hello, math degree), I'm going to use what I know. My environments across work, my VPS, my home network, some mobile access, and all those ssh services (well, just GitHub at home and GitLab at work, but that's more than one) are complicated enough to serve as a good basis, and I feel like they incorporate a majority of the primary use cases. Some of the things my environment covers are

a reused ssh key, work_rsa, that serves as my primary work identity

a collection of users I am authorized to authenticate as, e.g. vertical@vertical-production

unique keys per machine for GitHub access

a reused ssh key, wizards_rsa, that serves as my primary LLC identity

a horrible mess of intertwined authorized_keys below the surface

mostly default sshd_configs, be it the base default from a fresh install or VPS (Digital Ocean, RackSpace) default/suggested configs (I did actually change the port on one personal host but we have never done that at work)

A Word of Caution

Your ssh files are private. You should never store them in a repo, be it public or private. You should never share them, which also means you shouldn't use shared keys unless you absolutely have to. Remember, your keys can be used to authenticate as you, which means anyone that has them can pretend to be you.

I'm going to display a good chunk of information that you should never expose. I'm comfortable doing that for a few reasons:

The exposed config is being used to illustrate the process, so obfuscation potentially makes understanding more difficult.

You can duplicate everything I show using the provided Vagrantfile, so there are no secrets. Everything's already public.

Absolutely none of the config here (aside from the Ansible runner) exists in the real world. If it did actually exist as presented, I wouldn't have duplicated it with Vagrant.

I plan to investigate a few secure methods for sharing ssh files, such as GitLab secrets and Ansible Vault. I'll try to update this warning with those caveats when I'm able to write (and attempt to break) that code.

Vagrant

While I could just develop all of this in the wild using my existing configuration on real machines, I try follow a solid aphorism, "Everything always breaks" (which, surprisingly, isn't common enough to have legit Google results). If you want to develop on prod and push code at 4p on a Friday, you have my blessing as well as my condolences. If not, or you like making sure things work before giving them escalated access on sensitive systems, Vagrant is an easy solution.

Vagrant supports multi-machine configurations, which makes mimicking an existing network a breeze. We can create a network of wildly different operating systems with a messy connection diagram, just like the real world. Or, for a quick demo, we can create a uniform network with common options because it's really fast to make. I went with the latter for now.

Defining the Control Machine

As Ansible doesn't work on Windows, the control machine must use a good operating system. I don't use one of those, unfortunately, so I built up a virtual control machine in Vagrant. I'd argue developing with and testing on a virtual control is a useful step even if you do use an operating system that can be used for more than running proprietary office software.

Most of the servers I work with on a day-to-day basis are in the RHEL ecosystem. Most run Centos 7, and those that don't are either dev machines running Scientific or Fedora or old Centos 6 machines that need to be rebuilt with Centos 7. As such, I'm going to use Centos 7 as a base for everything. Because Vagrant is awesome, it doesn't care about distro-specific APIs (e.g. package managers) and just works (via judicious sh usage).

For now, ignore the contents of the Ansible tasks. Assume they work (or don't, which is probably a better assumption) and focus on the Vagrant config.

I know a perfect world exists, but, working from Windows, I know it does not exist around me.

ansible_local Specificity

I had you going, didn't I? You thought this issue was going to be about Windows. Gotcha! The last issue is about Windows, and the next one is only mostly about Windows. This one is not.

Vagrant is very good at holding your hand. So much so that, in some situations, it's difficult to get Vagrant to let go. Assuming we tried to provision the perfect config with ansible.verbose = true added for good measure, we'd get this error:

An absolute path on the guest machine where the Ansible files are stored. The ansible-galaxy and ansible-playbook commands are executed from this directory. This is the location to place an ansible.cfg file, in case you need it.

On first glance, it's picked up the proper config file (which contains an explicit path to the inventory), but it's not actually finding the inventory. On second glance, it failed because Vagrant has added flags --limit and --inventory-path with defaults even though we didn't specify them. The docs are currently wrong about inventory_path, but state limit is reasonably added by default to restrict Ansible to only tasks that affect the host.

synced_folders Type

On Vagrant hosts that can easily interact with the rest of the world, you can use normal shares like nfs or rsync. If you explicitly want to make things hard for yourself or you're running Vagrant from Windows, you can run into trouble. For example, on Windows, a samba share is your best bet, via type: 'smb'. However, exotic mount types (i.e. not vanilla) might cause problems. Continuing the Windows example, when attempting to sync via type: 'smb', Centos can't run the generated mount -t cifs ... command because cifs-utils isn't a default package. Not a problem, because Vagrant can mix-and-match synced_folders with shell provisioners... except it is a problem, because Vagrant mounts everything before any provisioning. The suggested solution, adding disabled: true, doesn't actually send any mount information to the guest (to verify, you can run vagrant up --debug with a disabled folder and consume the log in such a way that you can search for the folder mount point).

An obvious solution is to use a more specific Vagrant box. However, the suggested Windows box I kept seeing, serveit/centos-7, doesn't exist anymore (404 on the external host). The others, like geerlingguy/centos7, don't support the Hyper-V provider. I could build my own and publish it, or I could just explicitly tell any consumers of my Vagrantfile that they'll have to run

Networking

Vagrant is very candid about the Hyper-V limitations. Unsurprisingly, Vagrant can't configure networking yet because Hyper-V exposes a completely foreign and proprietary networking API. Reading between the lines, it seems like Vagrant is saying it should be possible to control the networking from Windows, like with a PowerShell script called from Vagrant. However, I've now run the linked script, built directly from the official docs, on three different machines with three different results. None of them worked. As far as my experience shows, confirmed with the sysadmins at work, the Hyper-V GUI (virtmgmt.msc) is the only way to create reliable networks. That means you need to set everything up before you run Vagrant. If you set up Hyper-V and allow it to create a default switch, you'll probably have access to Default Switch. With a working switch, you can either select it manually during vagrant up|reload or use this clever hack to sidestep it:

config.vm.network 'public_network', bridge: "the name of your working switch"

Brief Aside: Default Switch

It appears the VMSwitch feature underwent some fairly major changes between whatever came before 1709 and 1709. When I described this to the sysadmins at work, they had never seen Default Switch before. All of us figured I screwed something up during the feature installation. Last week, one of them started using a copy of 1709 and immediately ran into the same issues I've been having. It's also started to turn up online. When I started trying to manage Hyper-V via PowerShell a few weeks ago, there was absolutely no documentation on the feature.

Defining the Nodes

Once again, assume the Ansible stuff works (or doesn't) because I want to focus on the Vagrant config here.

Without the Ansible detail, adding additional machines to the Vagrantfile is as simple as defining each machine and including its Ansible provisioning. For example, using my simplified example where several options are set on config rather than controller, you can set an array of machines and iterate over it, defining each one:

Because I will be using Ansible to provision each of the hosts, I'm not too concerned about actually setting up the boxes beyond starting with the correct OS and using the correct hostname. You could easily add more keys to Host and define other options like networking if you're running Vagrant from a host that respects a desire for common interfaces.

Given the volatility of Ansible in development, I'd highly recommending setting a provider-specific name for each machine which you can use in a pinch in case everything goes south and Ansible has totally removed vagrant's ability to act locally and remotely. I'm not quite sure how to screw it up locally, but I lost several remotely due to improper ssh settings or bad sudo configs, so there's a chance you might be able to screw things up worse than I did. It's always good to have a backup fubar solution in case vagrant destroy doesn't work. A quick example shows naming is pretty simple:

# inside the machines loop from above
virtual_machine.vm.provider 'virtualbox' do |virtualbox|
virtualbox.name = machine # (or machine.hostname if you went that route)
end
virtual_machine.vm.provider 'hyperv' do |hyperv|
hyperv.vmname = machine # (or machine.hostname if you went that route)
end

The final component is the Ansible integration. By default, if you define provisioners for each machine, they will be run in sequence. However, the docs show how to provision in parallel, which is a pretty neat party trick. Neither is important here; one of the jobs of the initial playbook, initial-setup.yml is (or will be soon, if it's not there by the time you read this) to provision each of the nodes. I'd suggest a similar method if you're attempting to mimick your environment; it's much easier to rigorously define and validate a playbook than it is a full VM. Another alternative would be to create a Docker container for each node, but then you'd have to build all those individual Dockerfiles with so much copypasta maintenance would be a nightmare. My method, provisioning through the ansible_local playbook, adds no additional code to the Vagrantfile and can be used to build a nice collection of roles to feed your new Ansible habit. Alternative methods, like copying existing config via Vagrant, will require extra setup in Vagrant that has zero use once you're done with it.