I was building VM images for Google Cloud with Packer, and provisioning them with Ansible. Everything had been working in the morning, but in the afternoon one computer wasn’t working after I had upgraded Ansible with Homebrew. I was having a really tough time figuring out why Ansible and Packer were running fine on one computer, and not on the other. I was getting the following error:

googlecompute: fatal: [default]: UNREACHABLE! => {"changed": false, "msg": "SSH Error: data could not be sent to remote host \"127.0.0.1\". Make sure this host can be reached over ssh", "unreachable": true}

After checking the versions of Python, Ansible, and Packer on the two computers, I found one difference. On the computer that wasn’t working, when running ansible --version it had a config file listed:

I then used the extra_arguments option for the Ansible provisioner to pass [ "-vvvv" ] to Ansible. I ran this on both computers and diffed the output. I saw that the dynamically generated key was being successfully provided on the working computer (after one other local key). On the failing computer I had many SSH keys that were being tried before I could get to the dynamic key, and I was getting locked out.

SSH servers only allow you to attempt to authenticate a certain number of times (six by default). All of your loaded keys will be tried before the dynamically generated key provided to Ansible. If you have too many SSH keys loaded in your ssh-agent, the Ansible provisioner may fail authentication.

Running ssh-add -D unloaded all of the keys from my ssh-agent, and meant that the dynamic key Packer was generating was provided first.

I hope this is helpful to someone else, and saves you from hours of debugging!

Postscript

I was very confused by seeing that my computer was trying to connect to 127.0.0.1, instead of the Google Cloud Platform VM. My best guess is that Packer/Google Cloud SDK proxies the connection from my computer to the VM.