ssh job queue

Today I started to ponder a problem which I can't be alone to have encountered.
When you administer over 300 systems and want to perform bulk operations over ssh,
there are always one or two systems which are down or unreachable,
so your nifty little scripts which log on to each system to install a package, apply a patch,
change a configuration setting, tweak a variable or just pull statistics from the system will fail.

So I started toying with the idea of an ssh job queue which helps you keep track of bulk operations,
so you can see the on which systems the operation has successfully completed.
Once I started to try this out I figured that I can't be the first one to face this problem,
so i thought I'd ask you for input.

How do you deal with this problem? And "pen and paper" isn't the answer I'm looking for :)

In an earlier life when I administered several hundred servers I had a little sh-scripted framework. It was centered around a list of all host names. It could automatically extend this host list with some information about the host, e.g. Solaris version, Veritas Volume Manager version, etc. This extended information was updated regularly to automatically reflect configuration changes. From this extended list I could select systems by feature, for example all those with Solaris 2.6 and then run a script on all selected systems. Running a script was done like this:

- determine method of access (not all systems at that time supported ssh, as it was one of our objectives to roll out ssh for that client)

- update scripting environment on target host, which usually meant copying the new script that's about to be run over to the target host, but on new machines under our control would also mean copy over the entire script-framework including ~/.profile and the like.

- finally call the script with appropriate arguments on the target host and evaluate its exit code.

For every script executed through this framework automatically two files were created: a log file containing all the output generated during execution and a list of hosts on which execution failed for any reason, and in this file the first line was a comment containing the original call to the framework. So in order to retry those failed hosts, all I had to do, was cut&paste the comment from the first line and replace the selection arguments regarding the above mentioned host list with an input redirection from the failed hosts list and possibly repeat this step until the list of failed hosts is empty.

BTW all calls into the framework usually ran within a nohup'd subshell in order to be protected against failures of my X-Terminal connection to the controlling host (and hence the need for automatic log files), which happened much more often than the case in which you are mainly interested, that is, a failure in the connection from the controlling host to a target host.

All of my target systems were 24/7 servers, so if a script failed it was very unlikely to be caused by downtime or network failure, but simply because of bugs and oversights on my behalf when writing the script, which then did not work on some system configurations. Therefore it was important, that my scripts only then ended in exit code 0 if they really had done, what they were supposed to do.

I hope this was interesting for you, for me it was certainly nice, to indulge in memories from a better time in my life.

All my stuff controlling servers in the tens to hundreds was many years ago, and over rsh rather than ssh. In general, where it was important, I did write aside a list of hosts of remote jobs that had not positively succeeded so that those hosts could be dealt with on a clean-up round later. But I usually also wanted to find out \*why\* they had not worked, so that clean-up round would be preceded my some manual investigation by me.

If a system is down, it's patch script will be ran next time the task timer is reached. Then, it's script will be deleted.

I used to do this way, not having a big script for patching all but having a script generating scripts for each one. It's an easy and clean way to solve. Even you can figure in the "generator" script how to "add tasks" to scripts in case it generates additional orders (task) for each subsystem when the first script has not completed.

Have you looked at the N1 Service Provisioning System? I know it's primary use is installation of software packages on large numbers of target systems, but it ought to be smart enough to install and run any arbitrary script. The key thing it provides is the ability to run the same operation on large numbers of servers.

I don't deal with this problem personally, but it is an interesting one.

One interesting approach would be to move to a pull model (rather than push). It can make the security design of the system more interesting, but it allows for a fairly elegant solution to the "some hosts are always down" problem. You never have to initiate the job more than once, although you have to monitor the job's progress across all your hosts somehow.