TomEllis.io

Whilst writing a follow-up to my last post, I noticed that Ansible was failing to connect to a newly spun up Linux server on the Rackspace Cloud and spent a bit of time troubleshooting the connection. All articles I've read using Rackspace & Ansible didn't mention much about ssh connection timeouts so I thought I'd put this together.

I started by modifying my previous playbook to add another play to add the server into a group and set 'wait' to yes, including adding a 'wait_timeout', these will be explained in further detail in my next post.

I noticed upon running this playbook that the connection could fail, in a number of occasions it was fine but during others I got the error:

SSH encountered an unknown error during the connection. We recommend you re-run the command using -v, which
will enable SSH debugging output to help diagnose the issue

I immediately tried manually connecting to the spun up server and it appeared to be running fine. I'm well versed with Rackspace and other Cloud providers so I know that this can happen when API calls return as the server has been created but is still booting, so there is a slight race condition where Ansible tries to connect before the SSH daemon or networking is fully running.

I also put Ansible into debug mode to provide the above output:

ansible-playbook <playbook-name> -vvv

Rackspace server launch time can vary between regions (I used the LON region), sometimes this means that the majority of users aren't affected by a slower boot time and don't see this race condition, I didn't see it mentioned in the Ansible Rackspace Guide.

Fortunately the Docs for Ansible are pretty good and in this instance the wait_for module documentation page provided some very useful information. This included the details of a feature added in a previous version to search for the 'OpenSSH' banner whilst testing a connection on the SSH port 22. I could just check to ensure port 22 is listening but I found this still wasn't 100% reliable to ensure it's all up and running to avoid any race conditions with the connection.

After this I spun up 5 servers with the same playbook (add count: 5 to the original launch server request) and they all successfully completed and I've not encountered the same error since. I'm not sure of the overhead involved in the addition in wait_for to use the regex vs just testing to see if port 22 is up, but it seems to be minimal.

Hopefully that will help someone! If you ask me it's worth doing even if you aren't experiencing the timeout's just in case the Rackspace Cloud is having a busy day.

Stay tuned for the next post in which I'll cover the full end-to-end playbook.