Feature: Tools & Utilities

Parallel SSH execution and a single shell to control them all

Many people use SSH to log in to remote machines, copy files around, and perform general system administration. If you want to increase your productivity with SSH, you can try a tool that lets you run commands on more than one remote machine at the same time. Parallel ssh, Cluster SSH, and ClusterIt let you specify commands in a single terminal window and send them to a collection of remote machines where they can be executed.

Why you would need a utility like this when, using openSSH, you can create a file containing your commands and use a bash for loop to run it on a list of remote hosts, one at a time? One advantage of a parallel SSH utility is that commands can be run on several hosts at the same time. For a short-running task this might not matter much, but if a task needs an hour to complete and you need to run it on 20 hosts, parallel execution beats serial by a mile. Also, if you want to interactively edit the same file on multiple machines, it might be quicker to use a parallel SSH utility and edit the file on all nodes with vi rather than concoct a script to do the same edit.

Many of these parallel SSH tools include support for copying to many hosts at once (a parallel version of scp) or using rsync on a collection of hosts at once. Because the parallel SSH implementations know about all the hosts in a group, some of them also offer the ability to execute a command "on one host" and will work out which host to pick using load balancing. Finally, some parallel SSH projects let you use barriers so that you can execute a collection of commands and explicitly have each node in the group wait until all the nodes have completed a stage before moving on to the next stage of processing.

Parallel ssh (pssh)

pssh is packaged for openSUSE as a 1-Click install, is available in Ubuntu Hardy Universe and the Fedora 9 repositories. I used the 64-bit package from the Fedora 9 repositories.

All of the Parallel ssh commands have the form command -h hosts-file options, where the hosts-file contains a list of all the hosts that you want to have the command executed on. For example, the first pssh command below will execute the date command on p1 and p2 as the ben user. The optional -l argument specifies the username that should be used to log in to the remote machines.

Normally the standard output from the remote hosts is not shown to you. The -P option in the last invocation displays the output from both remote hosts as well as the exit status. If you are running more complex commands you might like to use -i instead to see each remote host's output grouped nicely under its hostname rather than mixed up as the output comes in from the hosts. You can also use the --outdir pssh option to specify the path of a directory that should be used to save the output from each remote host. The output for each host is saved in separate file named with the remote machine's hostname.

You can use the --timeout option to specify how long a command can take. It defaults to 60 seconds. This means that if your command fails to complete within 60 seconds on a host, pssh will consider it an error and report it as such, as shown below. You can increase the timeout to something well above what might be acceptable (for example to 24 hours) to avoid this problem.

The pscp command takes the same -h, -l, and --timeout options and includes a --recursive option to enable deep copying from the local host. At the end of the command you supply the local and remote paths you would like to copy. The first pscp command in the example below copies a single file to two remote hosts in parallel. The following ssh command checks that the file exists on the p1 machine. The second pscp command fails in a verbose manner without really telling you the simple reason why. Knowing that I was trying to copy a directory over, I added the --recursive option to the command and it executed perfectly. The final ssh command verifies that the directory now exists on the p1 remote host.

The prsync command uses only a handful of the command-line options from rsync. In particular, you cannot use the verbose or dry-run options to get details or see what would have been done. The command shown below will rsync the example-tree into /tmp/example-tree on the remote hosts in a manner similar to the final command in the pscp example.

$ prsync -h hosts-file -l ben -a --recursive example-tree /tmp

The main gain of the prsync command over using the normal rsync command with pssh is that prsync gives a simpler command line and lets you sync from the local machine to the remote hosts directly. Using pssh and rsync, you are running the rsync command on each remote machine, so the remote machine will need to connect back to the local machine in order to sync.

The pslurp command is sort of the opposite to the pscp in that it grabs a file or directory off all the remote machines and copies it to the local machine. The below command grabs the example-tree directory from both p1 and p2 and stores them into /tmp/outdir. The -r option is shorthand for --recursive. As you can see, for each remote host a new directory is created with the name of the host, and inside that directory a copy of example-tree is made using the local directory name supplied as the last argument to pslurp.

You can use environment variables to make things easier with Parallel ssh. You can use the PSSH_HOSTS variable to name the hosts file instead of using the -h option. Likewise, the PSSH_USER environment variable lets you set the username to log in as, like the -l pssh command line option.

Comments

Note: Comments are owned by the poster. We are not responsible for their content.

Parallel SSH execution and a single shell to control them all

Posted by: Anonymous
[ip: 64.89.94.194]
on October 30, 2008 01:05 PM

Another product that you should check out is called Func https://fedorahosted.org/func
from the product description:
Func is a secure, scriptable remote control framework and API. It is intended to replace SSH scripted infrastructure for a variety of datacenter automation tasks (such as taking hardware inventory, running queries, or executing remote commands) that are run against a large amount of systems at the same time. Func provides both a command line tool and a simple and powerful Python API. It is also intended to be very easy to integrate with your provisioning environment and tools like Cobbler.

That's roughly 5.15 seconds per host. If this were a 5000 node network we're looking at about 7.1 hours to complete this command. Lets do the same test with pssh and a max parallel of 10:

$ time pssh -p 10 -h hosts "ls > /dev/null"
real 0m17.220s

That's some considerable savings. lets try each one in parallel and set the max to 32:
$ time pssh -p 32 -h hosts "ls > /dev/null"
real 0m7.436s

If one run took about 5 seconds, doing them all at the same time also took about 5 seconds, just with a bit of overhead. I don't have a 5000 node network (anymore) but you can see there are considerable savings by doing some things in parallel. You probably wouldn't ever run 5000 commands in parallel but really thats a limit of your hardware and network. if you had a beefy enough host machine you probably could run 50, 100 or even 200 in parallel if the machine could handle it.

Re(1): This isn't good enough?

It's absolutely not good enough. 4 or so years ago a coworker and I wrote a suite of parallel ssh tools to help perform security related duties on the very large network in our global corp. With our tools on a mosix cluster using load balanced ssh-agents across multiple nodes we could run upto 1000 outbound sessions concurrently. This made tasks such as looking for users processes or cronjobs on 10,000+ hosts world wide a task that could be done in a reasonable amount of time, as opposed to taking more than a day.

Parallel SSH execution and a single shell to control them all

Posted by: Anonymous
[ip: 24.14.35.105]
on November 02, 2008 12:43 PM

I use the parallel option to xargs for this. Tried shmux, and some other tools, but xargs seems to work best for me. Just use a more recent gnu version. Some older gnu versions, some aix version, etc... have some issues. Only real gotcha that I've run into is that it will stop the whole run if a command exits non-zero. Just write a little wrapper that exits 0 and you're good to go. I've used this in 2 ~1000 server environments to push code(pipe tar over ssh for better compatibility), and remotely execute commands.

Parallel SSH execution and a single shell to control them all

What I'm curious about is this:
"""
if you want to interactively edit the same file on multiple machines, it might be quicker to use a parallel SSH utility and edit the file on all nodes with vi rather than concoct a script to do the same edit.
"""

I would have found a short note on which of these three is capable of doing so very helpfull. Cluster SSH's description sounds as though it would be the tool that could do it. But I just don't have the time to test it just yet.
Anyone tried that yet? Or knows to which tool this statement refers to?