Thursday, December 06, 2012

I've been having some great fun parallelizing R code on Amazon's cloud. Now that things are chugging away nicely, it's time to document my foibles so I can remember not to fall into the same pits of despair again.

The goal was to perform lots of trails of a randomized statistical simulation. The jobs were independent and fairly chunky, taking from a couple of minutes up to 90 minutes or so. From each simulation, we got back a couple dozen numbers. We worked our way up to running a few thousand simulations at a time on 32 EC2 nodes.

The two approaches I tried were Starcluster and the parallel package that comes with the R distribution. I'll save Starcluster for later. I ended up pushing through with the parallel package.

That was easy. But, it wouldn't be any fun if a few things didn't go wrong.

Hitting limits

Why 32 machines? As we scaled up to larger runs, I started hitting limits. The first was in the number of machines Amazon lets you start up.

According to the email Amazon sent me: "Spot Instances, which have an instance limit that is typically five times larger than your On-Demand Instance limit (On-Demand default is 20)..."

You can get the limits raised, but I'm a total cheapskate anyway, so I hacked Dan's Cloudformation template to use spot instances, adding a "SpotPrice" property in the right place. Spot instances can go away, at any time, but they're so much cheaper that it's worth dealing with that.

I confirmed that I could connect to the dud machines manually, and also from there back to the head node, like so:

ssh -i ~/.ssh/id_rsa ubuntu@10.241.65.139

The bug is resistant to rebooting and even terminating the dud node. Seemingly at random, somewhere between none and 3 machines out of 32 would turn out to be duds. How irritating!

Luckily, Dan from the Bioconductor group found the problem, and you can even see it, if you know where to look, in the afore-mentioned gobbledy-gook. The parameter MASTER=ip-10-4-215-155 means the worker has to do name resolution, which apparently sometimes fails. (See the notes under master in the docs for makePSOCKCluster)

We can give it an IP address, neatly bypassing any name resolution tar-pit:

cl makePSOCKcluster(hosts,
master=system("hostname -i", intern=TRUE))

Huge props to Dan for figuring that out and giving me a serious case of geek envy.

Load unbalancing

The LB in parLapplyLB stands for load balancing. It uses a simple and sensible strategy: give each worker one job, then when a worker is finished, give it another job, until all the jobs are assigned.

I think I saw cases where there were idle workers at a time when there were jobs that had not yet started. The only way that could happen is if the jobs were already assigned to a busy worker.

Looking at the code, that doesn't look possible, but I have a theory. There's an option in makePSOCKcluster to specify an outfile and outfile="" sends stdout and stderr back to the head node. I thought that might be handy for debugging.

One could start to imagine that a chatty and long-running worker sending output back to the head node via the outfile="" option would cause a socket to be readable before the job is done. So, another job gets submitted to that worker. Then workers become available and go idle for lack of work, which has already been hogged up (but not started) by the chatty worker.

If it's only a weird interaction between outfile="" and parLapplyLB, it's not that big of a deal. A more unfortunate property of parLapplyLB is what happens when a worker goes away; say, a connection is dropped or a spot instance is terminated. The result of that is that parLapplyLB bombs out with a socket error, and all work on all workers is lost. Doh!

For this reason, I had the workers write out checkpoints and collected them onto the head node periodically. This way, getting a return value back from parLapplyLB wasn't all that critical. And that brings me to the topic of automation.

Slothful automation

Automation is great. Don't tell anyone, but I take something of a lazy approach to automation: starting with a hacky mess that just barely works with lots of manual intervention and gradually refining it as needed, in the general direction of greater automation and more robust error handling.

A little snippet of Python run from cron grabs checkpoint files from the workers every 20 minutes.

All this is closer to a hacky mess than clean automation. A lot of babysitting is still required.

Features I'd like to see

shared EBS volume (via NFS?) for packages, checkpoints and results

a queuing system that doesn't require persistent socket connections to the workers

async-lapply - returns a list of futures, which can be used to ask for status and results

progress monitoring on head node

support for scalable pools of spot instances that can go away at any time.

grow and shrink pool according to size of queue

The right tool for 10,000 jobs

There are many ways to parallelize R. The approach in this document uses the parallel package and RStudio on Amazon EC2. The parallel package is nice for interactive development and has the advantage of keeping R worker processes alive rather than starting a new process for each job. But, this approach only works up to a point.

Different styles include interactive vs. batch, implicit vs. explicit and reformulating the problem as a map/reduce job. For map/reduce style computations, look at Rhipe. R at 12,000 Cores describes the “Programming with Big Data in R” project (pbdR). For batch jobs, Starcluster may be a better choice.

Starcluster provides several of those features, albeit with the caveat of restarting R for each job. Having pushed R/parallel to its limits, I intend try Starcluster a little more. So far I've only learned the term-of-art for when your Starcluster job goes horribly wrong - that's called a starclusterfuck.