December 2, 2010

As system administrators, we are often faced with tasks that need to run
against a number of things, perhaps files, users, servers, etc. In most cases,
we resort of a loop of some sort, often, just a for loop in the shell. The
drawback to this seemingly obvious approach, is that we are constrained by the
fact that this approach is serial and so the time it will take increases linearly
with the number of things we are running the task against.

I am here to tell you there is a better way, it is the path of going parallel!

Tools for your shell scripts

The first place to start is with tools that can replace that for loop you
usually use and add some parallelism to the task you are running.

xargs is a tool used to build and execute command lines from standard
input, but one of its great features is that it can execute those command lines
in parallel via its -P argument. A quick example of this is:

seq 10 20 | xargs -n 1 -P 5 sleep

This will send a sequence of numbers to xargs and divide it into chunks of one
argument (-n 1) at a time and fork off 5 parallel processes (-P 5) to
execute each. You can see it in action:

gnu parallel is a lesser known tool, but has been gaining popularity
recently. It is written with the specific focus on executing processes in
parallel. From the home page description: "GNU parallel is a shell tool for
executing jobs in parallel locally or using remote machines. A job is typically
a single command or a small script that has to be run for each of the lines in
the input. The typical input is a list of files, a list of hosts, a list of
users, a list of URLs, or a list of tables."

This plist file is xml, but the output of parallel is unordered above because
each line of input is processed by one of the 8 workers and output occurs
(--group) as each worker finishes an input (line) and not necessarily in the
order of input.

While this approach is fine for small tasks on a small number of machines, the
drawback to it is that it is executed linearly, so the total time the job will
take is as long as the task takes to finish multiplied by the number of machines you are
executing it on, which means it could take a while, so you'd better get a
Snickers.

Fortunately, people have recognized this problem and have developed a number of
tools have been developed to solve this, by running your SSH commends in
parallel.

Smarter tools for multiple machines

Once you've adopted the mindset your tasks can be done in parallel and you've
started using one of the parallel ssh tools for executing ad-hoc commands in a
parallel fashion, you may find yourself thinking that you'd like to be able to
execute tasks in parallel, but in a more repeatable, extensible, and organized
fashion.

If you were thinking this, you are in a luck. There is a class of tools
commonly classified as Command and Control or Orchestration tools. These
tools include:

These tools are built to be frameworks within which you can build repeatable
systems automation. Mcollective and capistrano are written in Ruby, and Func
and Fabric are written in Python. This gives you options for whichever language
you prefer. Each has strengths and weaknesses. I'm a big fan of Mcollective in particular, because it has
the strength of being built on Puppet and its
primary author, R.I. Pienaar has a vision for it to
become an extremely versatile tool for the kinds of needs that fall within the
realm of Command and Control or Orchestration.

As it's always easiest to grasp a tool by seeing it in action, here are basic
examples of using each tool:

As you can see from these brief examples, each of these tools accomplishes
similar things, each one has a unique ecosystem, plugins available, and
strengths and weaknesses, a description of which, is beyond the scope of this
post.

Taking your own script(s) multithreaded

The kernel of this article was an article I recently wrote for my employer's
blog, Taking your script
multithreaded, in which I
detailed how I wrote a Python script to make an rsync job multithreaded and cut
the execution time of a task from approximately 6 hours, down to 45 minutes.

I've created a git repo out
of the script, so you can take my code and poke at it. If you end up using the
script and make improvements, feel free to send me patches!

With the help of David Grieser, there is
also a Ruby port of the script up on Github.

These are two good examples of how you can easily implement a multithreaded
version of your own scripts to help parallelize your tasks.

Conclusion

There are clearly many steps you can take along the path to going parallel.
I've tried to highlight how you can begin with using tools to execute commands
in a more parallel fashion, progress to tools which help you execute ad-hoc and
then repeatable tasks across many hosts, and finally, given some examples on
how to make your own scripts more parallel.

Great writeup! Fyi Yahoo just open-sourced a pretty awesome parallel job execution tool called Pogo here https://github.com/nrh/pogo. It's used heavily inside Yahoo and works great so I'm hoping it gets some wider adoption. Definitely a work in progress.

@Anonymous, tbh, Perl's community kind of stagnated for a while, all the momentum and "coolness" seemed to be with the Ruby and Python communities, so those of us who came into things in the last 5-8 years went to one of those communities, I know some hardcore Perl people and it seems like Perl is having a revival, see http://www.modernperlbooks.com/mt/index.html, but that's my take on it.

@GDR that is also a good one, there are many parallel SSH tools, too many for me to cover them all :)

@JB I know of these from a combination of following people on twitter, following blogs, listening to podcasts, and irc. There are two ways to find out about these kinds of things 1.) cast a wide net yourself 2.) find a few good people who cast a wide net and follow them ;) I'm working on another post that will highlight a lot of what I follow/listen to, so stay tuned for that.

@Phil thanks for the kind words, it's what keeps a writer going. I think every medium to large web company has written their own or taken an existing one and morphed into their own beast, it's great to see them be released, as the ecosystem grows and learns, with these kinds of things, it truly is a "rising tide lifts all boats" kind of effect.